Joint probabilistic-logical refinement of multiple protein feature predictors
Department of Information Engineering and Computer Science, Università degli Studi di Trento, Trento, Italy
BMC Bioinformatics 2014, 15:16 doi:10.1186/1471-2105-15-16Published: 15 January 2014
Computational methods for the prediction of protein features from sequence are a long-standing focus of bioinformatics. A key observation is that several protein features are closely inter-related, that is, they are conditioned on each other. Researchers invested a lot of effort into designing predictors that exploit this fact. Most existing methods leverage inter-feature constraints by including known (or predicted) correlated features as inputs to the predictor, thus conditioning the result.
By including correlated features as inputs, existing methods only rely on one side of the relation: the output feature is conditioned on the known input features. Here we show how to jointly improve the outputs of multiple correlated predictors by means of a probabilistic-logical consistency layer. The logical layer enforces a set of weighted first-order rules encoding biological constraints between the features, and improves the raw predictions so that they least violate the constraints. In particular, we show how to integrate three stand-alone predictors of correlated features: subcellular localization (Loctree [J Mol Biol 348:85–100, 2005]), disulfide bonding state (Disulfind [Nucleic Acids Res 34:W177–W181, 2006]), and metal bonding state (MetalDetector [Bioinformatics 24:2094–2095, 2008]), in a way that takes into account the respective strengths and weaknesses, and does not require any change to the predictors themselves. We also compare our methodology against two alternative refinement pipelines based on state-of-the-art sequential prediction methods.
The proposed framework is able to improve the performance of the underlying predictors by removing rule violations. We show that different predictors offer complementary advantages, and our method is able to integrate them using non-trivial constraints, generating more consistent predictions. In addition, our framework is fully general, and could in principle be applied to a vast array of heterogeneous predictions without requiring any change to the underlying software. On the other hand, the alternative strategies are more specific and tend to favor one task at the expense of the others, as shown by our experimental evaluation. The ultimate goal of our framework is to seamlessly integrate full prediction suites, such as Distill [BMC Bioinformatics 7:402, 2006] and PredictProtein [Nucleic Acids Res 32:W321–W326, 2004].