School of Computing, University of Leeds, Leeds, LS2 9JT, UK
Institute of Molecular and Cellular Biology, University of Leeds, Leeds, LS2 9JT, UK
Abstract
Background
A number of methods that use both protein structural and evolutionary information are available to predict the functional consequences of missense mutations. However, many of these methods break down if either one of the two types of data are missing. Furthermore, there is a lack of rigorous assessment of how important the different factors are to prediction.
Results
Here we use Bayesian networks to predict whether or not a missense mutation will affect the function of the protein. Bayesian networks provide a concise representation for inferring models from data, and are known to generalise well to new data. More importantly, they can handle the noisy, incomplete and uncertain nature of biological data. Our Bayesian network achieved comparable performance with previous machine learning methods. The predictive performance of learned model structures was no better than a naïve Bayes classifier. However, analysis of the posterior distribution of model structures allows biologically meaningful interpretation of relationships between the input variables.
Conclusion
The ability of the Bayesian network to make predictions when only structural or evolutionary data was observed allowed us to conclude that structural information is a significantly better predictor of the functional consequences of a missense mutation than evolutionary information, for the dataset used. Analysis of the posterior distribution of model structures revealed that the top three strongest connections with the class node all involved structural nodes. With this in mind, we derived a simplified Bayesian network that used just these three structural descriptors, with comparable performance to that of an all node network.
Background
An important aspect of the postgenomic era is to understand the biological effects of inherited variations between individuals. For instance, a key problem for the pharmaceutical industry is to understand variations in drug treatment responses among individuals at the molecular level. A single nucleotide polymorphism (SNP) is a mutation, such as an insertion, deletion or substitution, observed in the genomic DNA of individuals of the same species. When the SNP results in an amino acid substitution in the protein product of the gene, it is called a missense mutation. A missense mutation can have various phenotypic effects although we restrict ourselves here to the simplified task of predicting whether a missense mutation has an effect or no effect on protein function.
The wealth of SNP data now available
All these methods require either structural or evolutionary data to be available for predictions to be possible. However, there are many proteins that lack any detectable sequence homology to known proteins or a solved 3D structure. In these cases, many prediction methods break down. Therefore a method is needed that can combine both structural and evolutionary information but at the same time tolerate the absence of either without manual intervention. With this in mind we have applied Bayesian networks to the problem of predicting the consequences of a missense mutation on protein function. Bayesian networks are probabilistic graphical models which provide a neat compact representation for expressing joint probability distributions and inference. The representation and use of probability theory makes Bayesian networks suitable for learning from incomplete datasets, expressing causal relationships, combining domain knowledge and data, and avoiding overfitting a model to training data. As such, a host of applications in computational biology (for example, see
Bayesian networks
Our recent primer
Learning from complete data
The Bayesian learning paradigm can be summarised as:
I.e., the predictive distribution for a new example observation, given a set of training examples
Learning from incomplete data
One advantage of using Bayesian networks is that it is possible to learn model parameters from incomplete training data i.e. in cases where variables are missing. To learn from incomplete data, we used the ExpectationMaximisation (EM) algorithm, which estimates missing values by computing the expected values and updating parameters using these expected values as if they were observed values.
Structure learning
A fully connected network structure captures relationships (dependencies) between all of the variables. A simpler, more compact model may be produced if conditional independencies between variables are learned. To do this, we used the greedy search algorithm from the Matlabbased structure learning package (SLP)
Inference with missing data
Knowledge of the conditional probability distributions between variables allows us to make predictions about the expected states of variables even if some variables are missing from the test data. For example, if structural information about a test missense mutation is not available, we can still infer whether the mutation has a functional effect on the protein or not by marginalising over the unknown variables. This is illustrated in a very simple Bayesian network with three nodes, A, B, C, which can take the values {
Each of the probabilities can be expressed as a conditional probability table in this discrete case. If we wish to infer the value of
Example 3 node Bayesian network
Example 3 node Bayesian network. Example 3 node Bayesian network.
Results and discussion
The systematic unbiased mutagenesis dataset of lac repressor
A total of fourteen variables were used to predict whether or not a missense mutation affects protein function (Table
Attributes used for predicting functional effects of missense mutations
Abbreviation
Type
Description
Information
Discrete
Effect of mutation on functionality
Class
Continuous
Solvent accessible area of native AA
Continuous
Accessibility relative to maximum accessibility in training set
Continuous
Normalised Bfactor of native AA
Continuous
Normalised Bfactor of structural neighbourhood of native AA
Structural
Discrete
Mutant AA is charged AA at buried site
Discrete
Mutant AA occurs at glycine or proline in a turn
Discrete
Mutant AA occurs in helical region and involves glycine or proline
Discrete
Native AA is near subunit interface
Continuous
Phylogenetic entropy of structural neighbourhood of native AA
Structural + Evolutionary
Continuous
Normalised phylogenetic entropy of native AA
Discrete
Native AA is at conserved position in phylogenetic profile
Discrete
Native AA is near conserved position in phylogenetic profile
Evolutionary
Discrete
Mutant AA is not in phylogenetic profile
Discrete
Mutant AA is not in the smallest AA class that includes the phylogenetic profile
We used two basic types of Bayesian network structure in this study: naïve and learned. In the naïve structure, the
•
•
•
•
•
•
•
Results of these experiments are presented in Tables
Results with a naïve Bayes classifier.
Crossvalidation
Trained on:
All
All
NoS
All
NoE
All
key
Tested on:
All
NoS
NoS
NoE
NoE
key
key
mixed
AUC
0.83 ± 0.01
0.70 ± 0.02
0.70 ± 0.02
0.81 ± 0.02
0.81 ± 0.02
0.80 ± 0.02
0.79 ± 0.01
MCC
0.44 ± 0.04
0.27 ± 0.03
0.27 ± 0.03
0.43 ± 0.03
0.43 ± 0.03
0.41 ± 0.02
0.35 ± 0.06
Overall error rate
0.19 ± 0.01
0.24 ± 0.01
0.24 ± 0.01
0.18 ± 0.01
0.18 ± 0.01
0.18 ± 0.01
0.21 ± 0.00
Effect error rate
0.35 ± 0.05
0.52 ± 0.03
0.52 ± 0.03
0.26 ± 0.07
0.26 ± 0.07
0.24 ± 0.07
0.41 ± 0.04
No effect error rate
0.15 ± 0.02
0.18 ± 0.01
0.18 ± 0.01
0.17 ± 0.01
0.17 ± 0.01
0.17 ± 0.02
0.17 ± 0.03
sensitivity
0.47 ± 0.12
0.37 ± 0.06
0.37 ± 0.06
0.37 ± 0.06
0.37 ± 0.06
0.36 ± 0.09
0.38 ± 0.16
specificity
0.92 ± 0.03
0.88 ± 0.02
0.88 ± 0.02
0.96 ± 0.02
0.96 ± 0.02
0.96 ± 0.03
0.92 ± 0.05
lac rep
AUC
0.84 ± 0.02
0.74 ± 0.02
0.74 ± 0.02
0.82 ± 0.02
0.82 ± 0.02
0.80 ± 0.02
0.80 ± 0.02
MCC
0.47 ± 0.03
0.33 ± 0.06
0.33 ± 0.06
0.46 ± 0.04
0.46 ± 0.04
0.44 ± 0.03
0.39 ± 0.05
Overall error rate
0.18 ± 0.01
0.23 ± 0.01
0.23 ± 0.01
0.18 ± 0.01
0.18 ± 0.01
0.19 ± 0.01
0.21 ± 0.00
Effect error rate
0.27 ± 0.05
0.40 ± 0.04
0.40 ± 0.04
0.20 ± 0.06
0.20 ± 0.06
0.18 ± 0.09
0.36 ± 0.05
No effect error rate
0.16 ± 0.02
0.19 ± 0.02
0.19 ± 0.02
0.18 ± 0.02
0.18 ± 0.02
0.19 ± 0.03
0.18 ± 0.03
sensitivity
0.47 ± 0.10
0.36 ± 0.12
0.36 ± 0.12
0.37 ± 0.08
0.38 ± 0.08
0.34 ± 0.12
0.41 ± 0.13
specificity
0.93 ± 0.03
0.92 ± 0.04
0.92 ± 0.04
0.96 ± 0.02
0.96 ± 0.02
0.97 ± 0.04
0.92 ± 0.04
lysozyme
AUC
0.83 ± 0.02
0.68 ± 0.04
0.68 ± 0.05
0.81 ± 0.04
0.81 ± 0.04
0.78 ± 0.04
0.77 ± 0.04
MCC
0.40 ± 0.05
0.23 ± 0.06
0.23 ± 0.06
0.38 ± 0.08
0.38 ± 0.08
0.36 ± 0.11
0.28 ± 0.09
Overall error rate
0.17 ± 0.02
0.24 ± 0.01
0.24 ± 0.02
0.17 ± 0.03
0.17 ± 0.03
0.16 ± 0.02
0.21 ± 0.03
Effect error rate
0.40 ± 0.05
0.63 ± 0.05
0.63 ± 0.05
0.39 ± 0.12
0.39 ± 0.12
0.33 ± 0.13
0.54 ± 0.09
No effect error rate
0.13 ± 0.02
0.15 ± 0.01
0.15 ± 0.01
0.13 ± 0.03
0.13 ± 0.03
0.15 ± 0.02
0.14 ± 0.02
Sensitivity
0.43 ± 0.11
0.39 ± 0.07
0.39 ± 0.07
0.38 ± 0.17
0.38 ± 0.17
0.28 ± 0.09
0.36 ± 0.11
Specificity
0.93 ± 0.03
0.84 ± 0.02
0.84 ± 0.02
0.93 ± 0.07
0.93 ± 0.07
0.97 ± 0.01
0.89 ± 0.04
Train: lac rep
AUC
0.80
0.66
0.67
0.78
0.78
0.77
0.77
MCC
0.40
0.23
0.23
0.35
0.35
0.35
0.35
Overall error rate
0.20
0.27
0.24
0.17
0.17
0.16
0.16
Test: lysozyme
Effect error rate
0.52
0.65
0.63
0.41
0.41
0.32
0.32
No effect error rate
0.10
0.14
0.15
0.14
0.14
0.15
0.16
Sensitivity
0.58
0.46
0.39
0.33
0.33
0.26
0.26
Specificity
0.85
0.80
0.84
0.95
0.95
0.97
0.97
Train: lysozyme
AUC
0.81
0.71
0.71
0.80
0.80
0.79
0.79
MCC
0.43
0.37
0.37
0.41
0.41
0.42
0.42
Overall error rate
0.20
0.22
0.22
0.20
0.20
0.19
0.19
Test: lac rep
Effect error rate
0.34
0.43
0.43
0.25
0.25
0.18
0.18
No effect error rate
0.17
0.17
0.17
0.19
0.19
0.20
0.20
Sensitivity
0.45
0.46
0.46
0.33
0.33
0.30
0.30
Specificity
0.92
0.88
0.88
0.96
0.96
0.98
0.98
Column: (1) trained on all variables, tested with all variables observed; (2) trained on all variables, tested without any structural information (NoS) – only evolutionary variables observed; (3) trained and tested using only five evolutionary nodes; (4) trained on all variables, tested without any evolutionary information (NoE) – only structural variables observed; (5) trained and tested using only eight structural nodes; (6) trained on all variables, tested with only key variables observed (see later section); (7) trained and tested using only the three key variables.
Results with a learned Bayesian network.
Crossvalidation
Trained on:
All
All
NoS
All
NoE
All
key
Tested on:
All
NoS
NoS
NoE
NoE
key
key
mixed
AUC
0.84 ± 0.01
0.64 ± 0.01
0.70 ± 0.02
0.72 ± 0.02
0.82 ± 0.02
0.63 ± 0.03
0.80 ± 0.02
MCC
0.46 ± 0.03
0.11 ± 0.03
0.10 ± 0.16
0.26 ± 0.22
0.44 ± 0.03
0.40 ± 0.04
0.40 ± 0.04
Overall error rate
0.17 ± 0.01
0.67 ± 0.01
0.23 ± 0.00
0.36 ± 0.28
0.18 ± 0.01
0.18 ± 0.01
0.18 ± 0.01
Effect error rate
0.27 ± 0.05
0.75 ± 0.01
0.15 ± 0.24
0.40 ± 0.25
0.29 ± 0.07
0.24 ± 0.06
0.25 ± 0.05
No effect error rate
0.16 ± 0.01
0.11 ± 0.03
0.21 ± 0.03
0.29 ± 0.18
0.16 ± 0.02
0.18 ± 0.01
0.18 ± 0.01
sensitivity
0.41 ± 0.07
0.93 ± 0.01
0.13 ± 0.21
0.51 ± 0.33
0.41 ± 0.08
0.31 ± 0.04
0.31 ± 0.09
specificity
0.95 ± 0.02
0.15 ± 0.02
0.96 ± 0.07
0.68 ± 0.47
0.95 ± 0.03
0.97 ± 0.01
0.97 ± 0.01
lac rep
AUC
0.85 ± 0.01
0.47 ± 0.03
0.73 ± 0.02
0.70 ± 0.02
0.82 ± 0.02
0.61 ± 0.02
0.81 ± 0.02
MCC
0.52 ± 0.02
0.11 ± 0.03
0.32 ± 0.04
0.43 ± 0.04
0.46 ± 0.05
0.42 ± 0.04
0.42 ± 0.03
Overall error rate
0.17 ± 0.01
0.60 ± 0.01
0.24 ± 0.01
0.19 ± 0.01
0.18 ± 0.01
0.19 ± 0.01
0.19 ± 0.01
Effect error rate
0.25 ± 0.03
0.72 ± 0.01
0.46 ± 0.03
0.20 ± 0.06
0.21 ± 0.05
0.17 ± 0.07
0.22 ± 0.06
No effect error rate
0.15 ± 0.01
0.16 ± 0.02
0.19 ± 0.01
0.19 ± 0.01
0.18 ± 0.01
0.20 ± 0.01
0.19 ± 0.01
sensitivity
0.51 ± 0.03
0.86 ± 0.02
0.40 ± 0.03
0.33 ± 0.03
0.38 ± 0.06
0.30 ± 0.02
0.33 ± 0.02
specificity
0.94 ± 0.01
0.24 ± 0.02
0.88 ± 0.01
0.97 ± 0.01
0.96 ± 0.01
0.98 ± 0.01
0.97 ± 0.01
lysozyme
AUC
0.86 ± 0.02
0.51 ± 0.06
0.67 ± 0.05
0.78 ± 0.04
0.83 ± 0.05
0.70 ± 0.04
0.78 ± 0.05
MCC
0.47 ± 0.06
0.09 ± 0.05

0.37 ± 0.10
0.40 ± 0.10
0.37 ± 0.12
0.34 ± 0.12
Overall error rate
0.17 ± 0.03
0.75 ± 0.02
0.19 ± 0.00
0.16 ± 0.02
0.16 ± 0.02
0.16 ± 0.02
0.16 ± 0.02
Effect error rate
0.38 ± 0.14
0.80 ± 0.01

0.30 ± 0.13
0.34 ± 0.11
0.32 ± 0.13
0.33 ± 0.14
No effect error rate
0.10 ± 0.03
0.05 ± 0.08
0.19 ± 0.00
0.15 ± 0.02
0.14 ± 0.02
0.15 ± 0.02
0.15 ± 0.02
Sensitivity
0.55 ± 0.19
0.98 ± 0.02
0.00 ± 0.00
0.29 ± 0.10
0.36 ± 0.09
0.30 ± 0.09
0.26 ± 0.09
Specificity
0.90 ± 0.07
0.07 ± 0.02
1.00 ± 1.00
0.97 ± 0.02
0.95 ± 0.02
0.97 ± 0.01
0.97 ± 0.01
Train: lac rep
AUC
0.72
0.43
0.68
0.70
0.77
0.57
0.75
MCC
0.30

0.23
0.21
0.36
0.34
0.35
Overall error rate
0.17
0.19
0.27
0.21
0.17
0.17
0.17
Test: lysozyme
Effect error rate
0.33

0.65
0.57
0.41
0.35
0.35
No effect error rate
0.16
0.19
0.14
0.16
0.14
0.15
0.15
Sensitivity
0.20
0.00
0.46
0.25
0.35
0.26
0.26
Specificity
0.98
1.00
0.80
0.92
0.94
0.97
0.97
Train: lysozyme
AUC
0.79
0.44
0.65
0.58
0.78
0.66
0.78
MCC
0.41
0.11
0.32
0.06
0.42
0.40
0.41
Overall error rate
0.20
0.39
0.24
0.25
0.20
0.20
0.20
Test: lac rep
Effect error rate
0.22
0.84
0.46
0.30
0.26
0.23
0.23
No effect error rate
0.19
0.28
0.19
0.25
0.19
0.20
0.19
Sensitivity
0.32
0.13
0.40
0.01
0.35
0.30
0.33
Specificity
0.97
0.78
0.88
1.00
0.96
0.97
0.97
See Table 2 for column details. Note that MCC score or effect rate cannot be shown if all mutations are predicted as 'no effect'.
Naïve Bayes classifier
all:all
As expected, overall error rates of less than 20% were achieved in all cross validation tests with the
Missing structural information (all:noS and noS:noS)
Performance dropped significantly with a 6 node network utilising only evolutionary information (
Missing evolutionary information (all:noE and noE:noE)
In contrast to results achieved without structural information, there was little or no effect on performance when evolutionary information was either missing during testing (
Overall, results suggest that structural information is more important than evolutionary information in predicting the functional consequences of a missense mutation in both lac repressor and T4 lysozyme, for the dataset used. Indeed, although evolutionary information has some predictive power, utilising only structural information appears to be sufficient for accurate prediction, comparable to that of the
A note on structural flexibility
It has previously been suggested that the Bfactor and neighbourhood Bfactor of the native amino acid are the most important predictors of functional effects of SNPs
Learned structure
Using both the Bayesian and BIC scoring functions employed by the greedy search algorithm we learned structures from lac repressor and lysozyme datasets separately and the two datasets combined ('mixed'). As with the naïve Bayes classifier, we evaluated each structure using both homogeneous tenfold and heterogeneous crossvalidation. There was little significant difference in performance between the two scoring functions, or between structures learned on different datasets. The main difference was in the number of edges in the resulting DAGs. For our mixed dataset, there were 35 edges with BIC, and 48 with full Bayesian scoring. Using
Learned Bayesian network structure
Learned Bayesian network structure
all:all
Little significant improvement in homogeneous cross validation performance was gained from using structure
Structure
ROC curve for learned structure
ROC curve for learned structure
Missing structural information (all:noS and noS:noS)
The model learned from all the variables and tested using only evolutionary information (
There could be a number of reasons for the poor performance of the
Missing evolutionary information (all:noE and noE:noE)
When marginalising over unknown evolutionary variables (
Tolerance to incomplete training data
Bayesian networks are capable of learning model parameters from incomplete data. Here we test the tolerance of the Bayesian networks by training on incomplete data. In every training example, we hide
Classifier performance
Classifier performance. Performance of naïve Bayes classifier and structure
Figure
Training set size
In order to assess how much data is needed for training the Bayesian networks, sequential learning of the model parameters was performed. The 'mixed' dataset was divided into two. One half was used as the test validation set, and the Bayesian networks were trained on the other half. Figure
Training set size
Training set size. Performance of naïve Bayes classifier and structure
Interpreting the structures
The learned structure
Posterior distribution of relationships
Posterior distribution of relationships. Strength of relationships between variables, identified through analysis of edges connecting pairs of nodes in MCMC structure learning. A dark square indicates a strong relationship; a white square a weak relationship.
The use of MCMC methods to study the posterior distribution over networks has the advantage of revealing relationships between the input variables. For instance, in Figure
However, biologically meaningful relationships between the other variables are also revealed. With the exception of the trivial relationship between
From Figure
Our finding that solvent accessible area of the native amino acid, whether the amino acid is charged at a buried site, and the flexibility of its structural neighbourhood are all important predictors of effect agrees to some extent with Chasman and Adams (2001) who found that structure based accessibility and Bfactor features have the most discriminatory power. The strong performance of accessibility measures probably reflects the finding of
A simplified Bayesian network
Whilst the nodes directly connected to the
We tested this hypothesis by constructing two simple four node networks: a naïve structure (Figure
Simplified Bayesian networks
Simplified Bayesian networks. Four node networks using the three key variables shown to have the strongest relationship with the effect. (a) Naïve Bayes classifier, (b) learned Bayesian network structure
Across all crossvalidation tests, the four node naïve Bayes classifier trained and tested using only the three key variables (
Conclusion
We have applied Bayesian networks to the task of predicting whether or not missense mutations will affect protein function with comparable performance to other machine learning methods. However, the strength of the Bayesian network lies in its ability to handle incomplete data and to encode relationships between variables; both of which were exploited here to derive some biological insight into how a missense mutation affects protein function.
A number of models were learned in this work. Due to the unbalanced datasets we analysed ROC curves and selected a suitable cost ratio in order to choose a probability threshold for the classifiers. This allowed us to compare classifiers in a meaningful way. From this analysis we concluded that a naïve network structure is sufficient for accurate prediction of the effect of a missense mutation with AUC values around 0.80. We also found that the structural environment of the amino acid is a far better predictor of the functional consequences of a missense mutation than phylogenetic information. This was demonstrated by the more accurate performance of a naïve classifier that just uses structural information compared to that which uses just evolutionary information. There were no significant performance gains when using a learned network structure, however the learned structure did allow relationships between variables to be analysed, in particular by analysing the posterior distribution of model structures, we found the top three strongest connections with the effect node all involved structural nodes. With this in mind, we derived a simplified Bayesian network that used just these three structural descriptors (solvent accessible area of the native amino acid, whether the amino acid is charged at a buried site, and the flexibility of its structural neighbourhood) without significant decrease in performance. Given the importance of structure, it would be interesting to learn if certain amino acid changes are more predictive of effect than others. For example, both cysteine, which forms disulphide bridges, and proline, with its unique ring structure, are often critical to the integrity of a protein structure so one would expect a mutation involving either of these residues to change the structure significantly. This will provide the basis for future work.
Methods
Evaluation measures
A number of measures were applied to evaluate each classifier: error rates (fraction of misclassified examples), sensitivity (true positive rate) and specificity (true negative rate). We also used Matthews' correlation coefficient (MCC), which is a correlation measure designed for comparison of unbalanced datasets such as ours. A value of +1 indicates perfect classification, and 1 indicates misclassification of every example. The MCC is defined as:
where TP are true positives, TN are true negatives, FP are false positives, and FN are false negatives obtained from evaluating the classifier on the test data.
Since we have a Bayesian network classifier, with a probability associated with each classification, the metrics above depend on the value of the classification threshold
Choosing a classification threshold
In order to perform a fair comparison of classifiers, we choose the classification threshold
Data discretization
There were a number of challenges buried in these data. Continuous data was nonGaussian, making it unsuitable for modelling as a continuous Gaussian node in a Bayesian network. There were also no obvious boundaries at which to separate the data into discrete categories. Our solution was to fit a number of Gaussians to the data using an ExpectationMaximisation based algorithm that automatically chooses the number of classes. It begins with one Gaussian, and iteratively splits the Gaussian with the largest weight, until adding extra classes does not increase the maximum likelihood of the model. Full details are provided below. This allowed us to form discrete classes from continuous data, which gave better performance than simply splitting the data into three classes of equal range (results not shown). We therefore used this strategy in all our analyses.
The EM algorithm
The Expectation Maximisation algorithm is a well established efficient algorithm for fitting Gaussian mixture models to data. The main drawback of the algorithm is its sensitivity to initialisation, and the need for multiple runs with different numbers of mixtures in order to choose the maximum likelihood model. Here we present a adaptation to the method which is deterministic and automatically chooses the number of Gaussians. It begins with one Gaussian, and iteratively splits the Gaussian with the largest weight, until adding extra mixtures does not increase the maximum likelihood of the model. Given a data set X = {x_{1},..., x_{N}} of
The probability
For a set of
Estep:
Mstep:
When the ML stops increasing, the Gaussian with the largest weight
Classification of each data point x_{i }is taken as a hard classification into the most likely class given by arg_{j }max
Authors' contributions
CJN carried out the study, CAM assisted with data sets and result interpretation, JRB advised on biological aspects and result interpretation, AJB and DRW, suggested the study and assisted with result interpretation. All authors approved the final manuscript.
Acknowledgements
We would like to thank the BBSRC who funded this research under grant BBSB16585.