The identification of functionally or structurally important non-conserved residue sites in protein MSAs is an important challenge for understanding the structural basis and molecular mechanism of protein functions. Despite the rich literature on compensatory mutations as well as sequence conservation analysis for the detection of those important residues, previous methods often rely on classical information-theoretic measures. However, these measures usually do not take into account dis/similarities of amino acids which are likely to be crucial for those residues. In this study, we present a new method, the Quantum Coupled Mutation Finder (QCMF) that incorporates significant dis/similar amino acid pair signals in the prediction of functionally or structurally important sites.
The result of this study is twofold. First, using the essential sites of two human proteins, namely epidermal growth factor receptor (EGFR) and glucokinase (GCK), we tested the QCMF-method. The QCMF includes two metrics based on quantum Jensen-Shannon divergence to measure both sequence conservation and compensatory mutations. We found that the QCMF reaches an improved performance in identifying essential sites from MSAs of both proteins with a significantly higher Matthews correlation coefficient (MCC) value in comparison to previous methods. Second, using a data set of 153 proteins, we made a pairwise comparison between QCMF and three conventional methods. This comparison study strongly suggests that QCMF complements the conventional methods for the identification of correlated mutations in MSAs.
QCMF utilizes the notion of entanglement, which is a major resource of quantum information, to model significant dissimilar and similar amino acid pair signals in the detection of functionally or structurally important sites. Our results suggest that on the one hand QCMF significantly outperforms the previous method, which mainly focuses on dissimilar amino acid signals, to detect essential sites in proteins. On the other hand, it is complementary to the existing methods for the identification of correlated mutations. The method of QCMF is computationally intensive. To ensure a feasible computation time of the QCMF’s algorithm, we leveraged Compute Unified Device Architecture (CUDA).
Multiple sequence alignments (MSAs) of homologous protein sequences give us information about two major features of the proteins of interest. The first one consists of easily detectable highly conserved residue sites that are obviously important for the structure and/or the function of the protein; while the second one corresponds to compensatory (coupled) mutations between two or more residue sites that also contain crucial information on the structural and functional basis of proteins . These compensatory mutations occur according to the functional coupling of mutation positions which might be explained as one mutation in a certain site affecting a compensating mutation at another site, even if both related residue sites are distantly positioned in the protein structure [2-5]. In particular, such mutations at essential residue sites are likely to destroy protein structure which often results in loss of the protein function [6,7]. Thus, recognition of these residue sites is as important as the strictly conserved positions for the understanding of the structural basis of protein functions and for the identification of functionally important residue positions [5,8,9].
Although the strictly conserved residue sites are easily detectable and interpretable in MSAs, the detection of important non-conserved compensatory mutation sites needs more complex approaches. Today, due to the simplicity and efficiency, the mutual-information-based metrics (MI-metrics) are often used to measure the co-evolutionary relationship between residue sites in MSAs [4-6,10-13]. However, the MI-metrics strongly depend on the amino acid distributions observed in the MSA columns rather than on physical or biochemical constraints of amino acids that are likely to be crucial for the detection of functionally or structurally important compensatory mutations in a protein sequence. Further, according to the phylogenetic relationship of protein sequences and background noise, there is always a MI-value between each column pair in an MSA. Therefore, the challenging problems in bioinformatics for the detection of significant compensatory mutation signals are: i) the minimization of the influence of phylogenetic relationships of protein sequences by incorporating physical or biochemical properties of amino acids in the calculation; ii) the separation of significant signals from the background noise or unrelated pair signals.
In order to eliminate the influence of phylogeny and noise effects of MI, Dunn et al.  have introduced the average product correction (APC). Subtracting APC from MI, they obtained their MIp metric. However, in their model the reduction of background noise is not quantified. On the other hand, Gao et al.  have integrated amino acid background distribution (MIB) in the calculation of their MI-metric and focused on only 25 column pairs of each MSA with the highest normalized MI values as significant to reduce noisy effect which seems to be over-conservative, yet specific.
Large efforts have been made in the last few years to improve local-correlation-measure-based approaches to residue co-evolution when it comes to modeling effects that rely on spatial proximity (see  for an overview). In this case, it is necessary to disentangle direct and indirect correlations. Classical mutual information, for example, is high not only if the two sites under study are close in 3D space. Quite the contrary, any local measure of correlation, not just mutual information, is limited by the transitivity effect.
To overcome this problem, global statistical models of protein families are employed. The direct-coupling analysis (DCA) works as follows. Maximizing the entropy subject to preserving the single and pair residue frequencies observed, a joint probability distribution on all possible members of the protein family is derived. Utilizing this distribution, considerable progress in predicting residue-residue contacts in 3-dimensional protein structures was made [15-17]. Protein Sparse Inverse Covariance (PSICOV)  achieves disentanglement of direct and indirect correlations by inverting a residue-residue covariance matrix. In  further progress was made by integrating structural context and sequence co-evolution information.
There is merely a small number of methods that incorporate amino acid similarity in the prediction of functionally or structurally important sites. In this context, it is natural to partition the amino acids into chemically similar groups before applying an information-theoretic measure like the Shannon entropy [20,21]. It was reported that many other methods fail to outperform this simple partition approach . However, quantum information theory supplies a well-studied and powerful framework to integrate such similarity, where the classical Shannon entropy is swapped for the von Neumann entropy (VNE). Caffrey et al.  and Johansson et al.  have firstly introduced VNE to multiple sequence alignment analysis although they did not treat amino acid pair similarity.
Recently, a new method called Coupled Mutation Finder (CMF) has been introduced by Gültas et al.  to deal with phylogenetic noise as well as background signals and to quantify the error made in terms of the false discovery rate. The CMF method only focuses on BLOSUM62-dissimilar amino acid pairs as a model of compensatory mutations and integrated them in the calculation of normalized MI-metrics using a doubly stochastic matrix to transform the empirical pair distribution of the column pair. However, the CMF disregards amino acid pair similarity which can be also crucial for the detection of functionally or structurally important sites in MSAs.
In this study, we present a new method called Quantum Coupled Mutation Finder (QCMF) which extends the CMF algorithm  by additionally incorporating amino acid pair similarity. To this end, the QCMF invokes principles from quantum information theory, in particular for the first time in the context of MSA analysis quantum entanglement as a major resource of quantum information. Amino acid pair distributions are replaced by entangled density matrices from quantum mechanics which encompass in our case both empirical pair distributions, possibly transformed by the doubly stochastic matrix used in , and pair similarity. Following Capra and Singh  who pointed out that it is hard to improve upon metrics based on Jensen-Shannon divergences, we quantify the effect of both amino acid pair similarity and amino acid pair dissimilarity by the quantum Jensen-Shannon divergence between an entangled density matrix and the one that simply represents the amino acid pair frequencies.
The QCMF algorithm is strongly based on the matrix operations that are computationally intensive. When analyzing a single MSA, the computational time of these matrix operations rise very quickly due to the huge number of column pairs. In order to speed up the running time of the QCMF, we implemented its algorithm using Compute Unified Device Architecture (CUDA). CUDA is an efficient parallel computing architecture developed by NVIDIA that utilizes graphic processing units (GPUs) for general-purpose scientific and engineering applications . Nowadays, GPUs are often used for computationally challenging problems in bioinformatics [26-29] and several other scientific fields [30-32].
Our main focus in this study was to investigate whether quantum information theory based measures could contribute beyond conventional measures to the identification of important residue sites. The Results section of this work twofold. First, to test the functionality of QCMF-significant individual residue sites we analysed the essential sites of two human proteins: epidermal growth factor receptor (EGFR) (pdb entry 2J6M) and glucokinase (GCK) (pdb entry 1V4S). The functionally and structurally important sites of both proteins have been experimentally investigated in several studies previously [33-44] and their positions were summarized in  as essential sites. The essential sites of these proteins consist of several non-conserved residue sites which are directly located at or near disease associated amino acid mutation (non-synonymous single nucleotide polymorphisms (nsSNPs)) sites, catalytic sites, protein binding sites and so on, each of which are likely to affect protein stability or functionality (see  and references therein). In addition, residue sites are defined to be in contact according to the “nearby” definition of Nussinov et al.  if their carbon major atoms have a distance of less than or equal to 6 Å. Consequently, we defined an individual QCMF-significant residue site as “functionally or structurally important” if it corresponds to one of these essential sites.
Second, to further investigate the performance of QCMF and to make a comparison with the previous methods (CMF , MIp , and PSICOV ), we selected a non-redundant set of proteins prepared by Janda et al. . Although the dataset contains 216 proteins, we eliminated a few proteins due to inconsistency between corresponding MSAs and PDB files, so that we finally ended up with a dataset of 153 proteins (see Additional file 1).
The MSAs for each protein, which contain after filtering at least 125 independent sequences, were derived from the HSSP-database  that merges primary structure information and tertiary structure information of proteins.
Finally, we define QCMF-significant sites as follows. Let M be an MSA, with the protein of interest being the first row of M. A site pair as well as an individual site of the protein are said to be QCMF-significant with respect to the MSA M, if they are -significant or -significant. The latter two notions and the underlying two co-evolutionary column pair metrics and are defined in the Methods section. If the MSA M is fixed, we speak of -significance and -significance, rather than of -significance and -significance, respectively.
QCMF-significant residue sites in the Human Epidermal Growth Factor Receptor (EGFR) protein
Using the MSA-specific statistical model with a false discovery rate (FDR) of 1% for both QCMF-metrics, we first determined altogether 2688 out of 26079 non-conserved column pairs as significant in corresponding MSA of human EGFR protein. 631 of these significant pairs were detected by -metric, and 2149 pairs were detected by -metric. Only 92 significant column pairs were detected by both metrics. After that, utilizing the connectivity degree technique, we predicted in total 33 residue sites in corresponding sequence of human EGFR protein as QCMF-significant (see Additional file 2). 12 of them are only -significant and 18 residue sites are -significant, the remaining 3 residue sites (A839, A882 and V902) are both -significant and -significant.
10 of the QCMF-significant residue sites are in contact with either catalytic residues or critical active site regions for gefitinib binding site in wild type EGFR kinase [34,37,48] (see Figure 1 and Figure 2). Among these sites, the A839 and R841 have been verified as catalytic residue sites through the Catalytic Site Atlas . The T854 is a gefitinib binding site by itself and the residue sites V845 and A859 are also in contact with nsSNP positions K846, T847 and K860 in human EGFR protein. Moreover, two out of all 33 significant sites are related to disease associated nsSNP positions and their structural localization are illustrated in Figure 1.
Figure 1. QCMF-significant residue positions are in contact with catalytic residues in human EGFR protein (PDB-Entry 2J6M). Red spheres denote positions of the catalytic residues. Yellow spheres show the localization of significant adjacent residue positions found by QCMF which are in contact with these catalytic residues. Moreover, the QCMF-significant sites A839 and R841 are also catalytic residues by themselves. Green spheres show the structural localization of nsSNP positions found by QCMF as significant in the EGFR protein. The circles indicate clusters of catalytic residue sites and their significant adjacent sites.
Figure 2. QCMF-significant residue positions are in contact with gefitinib binding sites in human EGFR protein (PDB-Entry 2J6M). Red spheres show the structural localization of the gefitinib binding sites in the wild type kinase. Yellow spheres show QCMF-significant adjacent residue positions which are in contact with these binding sites. Moreover, the QCMF-significant site T854 is also a binding site by itself and interacts with gefitinib binding site D855. The circles indicate clusters of gefitinib binding sites and their significant adjacent sites.
Additionally, 13 out of all QCMF-significant sites are referred to as essential sites, each of them are either nearby strictly conserved residues or nsSNPs (see Table 1).
Table 1. QCMF-significant essential sites in the human EGFR protein, which are nearby either nsSNPs or strictly conserved sites
According to the essential sites of human EGFR protein, published in , we have shown altogether the structural or functional importance of 25 QCMF-significant sites. The remaining 8 significant residue sites (G729, T851, G779, Q820, M825, L927, G930, Y944) do not fall into essential sites and the reason for their significance and their importance in the EGFR protein is currently unclear.
QCMF-significant residue sites in the Human Glucokinase (GCK) protein
Like human EGFR protein, applying the MSA-specific statistical model with a FDR of 1% for both QCMF-metrics we identified a total of 9853 out of 69645 non-conserved column pairs as significant in the human GCK protein (pdb entry 1V4S). 6070 of them were -significant and 4232 were detected as -significant. Only 449 column pairs were detected as significant with respect to both metrics. Thereupon using the connectivity degree technique, we determined altogether 64 residue sites in the human GCK protein as QCMF-significant (see Additional file 3). 30 of them are determined as -significant and further 30 significant residue sites are determined as -significant. Only four residue sites (T82, G223, V253, and G407) are significant based on both metrics.
13 of QCMF-significant sites are in contact with allosteric sites V62, R63, M210, I211, Y214, Y215, M235, V452, V455 and A456 in the human GCK protein. Among these significant sites, the V62, M210, Y215 are allosteric sites by themselves  and the T209M, G223S and S453del are related to disease associated nsSNP positions. In addition, there are further five QCMF-significant sites (F123L, G162D, G175R, T228M, and E300K,Q) that have been verified as nsSNP positions through annotation databases and previous experimental studies [38-40,42,43,51]. The structural localization of these 18 QCMF-significant sites (contact sites and nsSNPs positions) are illustrated inFigure 3.
Figure 3. QCMF-significant positions that are either in contact with allosteric sites or related to nsSNPs in human GCK protein (PDB-Entry 1V4S). Yellow spheres correspond to structural localization of ten significant residue sites which are in contact with allosteric sites where V62, M210, and Y215 are denoted as allosteric sites by themselves and they are also in contact with an other allosteric sites. Green spheres indicate eight significant nsSNP positions in the GCK protein. Three of them (T209M, G223S and S453del) are further in contact with allosteric sites M210, I211, V452, V455 and A456.
Additionally, eight significant sites T149, G170, F171, T206, V207, A208, Q287 and G294 in contact with glucose binding sites (active sites) T168, K169, D204, D205 and E290 in human GCK protein  (see Figure 4) where V207 and A208 are also in contact with the allosteric sites M210 and I211.
Figure 4. QCMF-significant residue positions are in contact with glucose binding site in human GCK protein (PDB-Entry 1V4S). (A) Red spheres show the structural positions of the glucose binding sites (active sites) and yellow spheres show the localization of significant adjacent residue positions found by QCMF which are in contact with these active sites. The circles indicate clusters of glucose binding sites and their significant adjacent sites.
Moreover, we have also observed that 38 QCMF-significant sites are further included in essential sites since they are nearby nsSNPs or strictly conserved residues in human GCK protein (see Table 2).
Table 2. QCMF-significant essential sites in the human GCK protein, which are nearby either nsSNPs or strictly conserved sites
In total, we have demonstrated here that according to the essential sites of GCK, 62 out of 64 QCMF- significant sites are functionally or structurally important for human GCK protein. The remaining two significant residue sites V89 and N283 do not overlap with essential sites and the reason for their significance and their role in the GCK protein is still unclear.
Individual residue site comparison between QCMF-significant sites and previous CMF-significant sites
We compared QCMF-significant residue sites for both human EGFR and GCK proteins with the significant residue sites given in  of the previous CMF-method. The CMF-method detected for both human proteins, 43 sites in EGFR and 72 sites in GCK as significant.
For the EGFR protein we found that the QCMF-significant residue sites Q791, Q820, G824, K860, Y891, T892, Y900, T909 overlap with results of the CMF-method. Interestingly, one of the unconfirmed residue sites, the Q820, has been predicted by both QCMF-method and CMF-method as significant.
For GCK protein, we observed that in total 24 QCMF- significant sites (T60, T82, N83, F123, F148, T149, F152, H156, F171, N180, T206, T209, T228, E236, G260, L271, S281, N283, Q287, G294, E300, T332, F419 and E443) were also determined by the CMF-method as significant. Although both methods detected residue site N283 as significant, it corresponds to one of the unconfirmed residue sites for GCK, currently.
The CMF has been developed using normalized mutual information (MI) measures in order to detect important residue positions in MSAs. The method mainly focuses on significant BLOSUM62-dissimilar amino acid signals as a model of compensatory mutations and integrates them in the calculation of normalized MI-metrics. As a consequence of mainly taking into account dissimilar amino acid signals, an important part of CMF-significant sites were verified as disease associated nsSNP positions and just a small part of them were located at or near the catalytic sites, allosteric sites and binding sites in both proteins.
Moreover, when statistically evaluating both methods, we have observed that the QCMF significantly outperforms the QCMF-method. The QCMF reaches an improved performance in identifying essential sites from MSAs of both proteins with a significantly higher Matthews correlation coefficient (MCC) value of 0.215 whereas the CMF reaches only a MCC value of 0.133.
Significant residue pair comparison
To analyze whether the quantum-information-theory-based measures proposed in this study complements the coventional methods for the detection of correlated (co-evolutionary) mutations, we made pairwise comparisons between our new QCMF, MIp , PSICOV , and CMF .
All four methods take as input an MSA satisfying certain admissibility criteria. The problem is that QCMF and CMF output the set of QCMF-significant sites and CMF-significant sites of M’s reference protein, respectively, whereas PSICOV and MIp result in sets of important residue pairs. To make these outputs comparable, we extend them in all cases.
Let denote the output of QCMF on any admissible MSA M. We extend this set to what we call the QCMF-significant residue network of M as follows. Any two elements of are connected by an undirected edge belonging to if and only if the corresponding column pair is QCMF-significant.
In order to get a sufficiently large number MIp-significant and PSICOV-significant residue pairs, for every input MSA we simply took the top-ranking 10% as MIp-significant and PSICOV-significant, respectively.
For all four methods we used the 90th, the 95th and the 99th percentile as cut-off values.
We performed the method comparison edge-oriented, with the number of overlapping edges as measure. We applied all four methods to the 153 MSAs (see Additional files 1) described at the very beginning of this section and calculated the numbers , , , , , , , , and on each of them, where the connectivity cut-off ranges over the 90th, the 95th and the 99th percentile, and i=1,2,…,153. Summing up the 153 numbers in each of these groups results in the numbers , , , , , , , , and , which are displayed in Tables 3 and 4.
Table 3. Total number of edges in method-dependent significant residue networks with respect to various connectivity degree cut-offs
Table 4. Total number of edges in two networks of different type with respect to various connectivity degree cut-offs
Table 3 shows that all methods detect with the same connectivity degree cut-off a comparable number of edges in the corresponding significant residue network.
Table 4 highly suggests that all four methods carry distinct information. The overlap between any two of them is less than or equal to 10%. This indicates that, under the assumption that each of them models important aspects of co-evolution, they complement each other perfectly. In particular, this is true for QCMF as a quantum-information-science-based service compared with the other three established tools that are based on conventional methods.
Implementation of QCMF: Parallel computing using CUDA
The computation of both QCMF metrics (Equations 7 and 8) is strongly based on matrix operations. Therefore, we implement QCMF algorithm using CUDA  which is very suitable to perform large number of vector and matrix operations in real time. This results in a dramatic reduction of computational time of QCMF.
In this study, we use the CUDA 4.0 architecture (Toolkit) with several linear algebra libraries such as MAGMA , LAPACK , BLAS , GotoBLAS , CUBLAS  together (see Figure 5) to speed up the running time of the QCMF algorithm. Since our program requires a cooperative multi threading to not fall in any asynchronicity or locks we extended the magma library with dynamic scheduling features according to . Further, in order to be able to compare the performance, we also implemented the QCMF algorithm onto CPU architecture alone. Both implementations were performed on an Intel Core™ i7-3770K Processor operating at 3.9GHz, with 16 GB of DDR3 RAM and a GeForce GTX 680 graphics card using the Ubuntu 13.04 operating system (64-bit version).
Figure 5. Linking of the CUDA environment using C++.
Applying the QCMF algorithm for human EGFR protein with CPU alone and with CUDA acceleration, the average computational time of a column pair was 0.7117 seconds and 0.0301 seconds, respectively. Similarly, for human GCK protein, the average computational time of a column pair was 0.6977 seconds with CPU alone and 0.0299 seconds with CUDA acceleration. Consequently, the algorithm took ∼310 minutes for human EGFR protein and ∼811 minutes for GCK protein with CPU alone. On the other hand, applying the CUDA acceleration it took only ∼13 minutes for EGFR and ∼39 minutes for GCK protein. The comparison between the average times indicates that the required computational time of QCMFalgorithm with the CUDA acceleration was significantly faster than with CPU alone (approximately more than 23 times faster).
We predict important sites of a protein by detecting co-evolving residues. Our measures of co-evolution are quantum-Jensen-Shannon-divergence-based metrics of column pairs of a multiple sequence alignment, with the protein under study being the reference row. The quantum Jensen-Shannon divergence in turn has the von Neumann entropy as main building block.
The von Neumann entropy was originally defined in the framework of quantum mechanics. We elucidate it in the subsequent section as far as it is necessary to understand our methods. Researchers interested in learning more are referred to the excellent textbook due to Vedral . A comprehensive reference book was published by Nielsen and Chuang .
This section is organized as follows. In the first four subsections we recapitulate techniques developed in  which we leverage in this study. This concerns the definition of significant site pairs and of significant individual sites, the preparation of the training data set used, and the computation of a doubly stochastic matrix D as our model of compensatory mutations on grounds of two counting matrices Calt and Cnull. These two matrices also form the basis of the two amino acid pair similarity matrices and , which in turn give rise to our new quantum-information-science-based metrics and . The last four subsection are dedicated to their definitions.
Significant column pairs and significant position with respect to a certain metric
Let M be an MSA, where the protein of interest is represented by M’s first row, and let be a metric which assigns to every MSA column pair (γ1,γ2) a real number . We call a co-evolutionary column pair metric if it models a biologically meaningful co-evolutionary signal: The larger the metric value on (γ1,γ2), the more likely co-evolution between position γ1 and position γ2 has occurred.
Let be the empirical relative amino acid pair frequency of the i-th and the j-th amino acid in column pair (γ1,γ2), where i,j=1,2,…,20. (When choosing a row of this column pair by pure chance, acid pair (i,j) is drawn with probability .) In the subsequent subsection we recapitulate the way developed in  to identify significant columns and significant column pairs with respect to .
In order to identify significant column pairs of the MSA under study with respect to the metric , in  we have pointed out, that the distribution of can be regarded as a mixture of a background β-distribution F0, an unrelated pair distribution G1, and a distribution G2 of presumably co-evolving pairs.
The p-values are then uniformly distributed over [ 0,1] given the underlying -values are F0-distributed. In contrast, p-values tend to zero or one, if -values are G2-distributed or G1-distributed, respectively.
If, moreover, there is a sub-interval of [ 0,1] which contains only data from the background distribution, on grounds of a result due to Storey and Tibshirani [59,60] we determined in  an MSA-dependent threshold for -values. A column pair is said to be -significant, if its -value is above the threshold, where the false discovery rate is bounded by a predefined constant.
Figure 6 is a typical pictorial representation of metric distributions which can be treated that way to detect significant pairs.
Figure 6. p-value distributions of and-values for human EGFR protein (PDB-Entry 2J6M). The blue bars illustrate the p-value distribution of the -values and red bars display the p-value distribution of the -values.
We applied that model in this study.
We utilized the connectivity degree technique, introduced in  and developed further in , in order to define the -significance of individual residue sites. The connectivity degree of a position γ1 is the number of positions γ2 so that the site pair (γ1,γ2) is -significant. A site of the protein of interest is then called -significant, if its connectivity degree cut-off exceeds the 90-th percentile.
Training data set and pre-processing
Following , a redundancy free set of more than 35000 protein structures is our starting point. This collection was compiled in Rainer Merkl’s Lab at the University of Regensburg. The protein structures were taken from the protein data base (http://www.pdb.org/ webcite). The PISCES services  was applied to assess proteins on sequence similarity and equality of 3D-data. The related MSAs were gathered from the HSSP data base (http://swift.cmbi.ru.nl/gv/hssp/ webcite).
Taking pattern from , we filtered every MSA obtained as follows. First, highly similar and dissimilar sequences were deleted to ensure that the sequence identity between any two sequences is at least 20% and no more than 90%. Second, we removed strictly conserved residue columns, where the percentage of identical residues is greater than 95%. Third, we eliminated the residue columns which contain more than 25% gaps. Finally, we discarded all MSAs with less than 125 sequences. More than 17000 MSAs survived the last filtering step. We used approximately 1700 MSAs published in  as our training data set which we randomly chose from this set.
Setting up the counting matrices Calt and Cnull
The entries of the two matrices are frequencies of pair substitutions calculated from our training data set described in the foregoing subsection. Informally spoken, matrix Calt models the signal, whereas Cnull reflects the background.
In line with , we calculated a signal and a null set of column pairs. The signal set consists of all -significant column pairs, where M ranges over all training MSA. The null set consists of sufficiently many column pairs randomly chosen from every training MSA. For both the signal set and the null set we computed a symmetric 400×400 integer-valued matrix of frequencies of pair substitutions Calt and Cnull. To this end, the method used to compute BLOSUM62 matrices  is applied to count residue pair substitutions in MSA column pairs rather than residue substitution in columns.
Computing a doubly stochastic matrix D
According to , a pair ((ai,aj),(ak,al)) of amino acid pairs is said to be a formal dissimilar compensatory mutation, if the BLOSUM62 score both of (ai,ak) and (aj,al) is negative.
Using Calt and Cnull, we define the matrix CCompMut by
where φCompMut((ai,aj),(ak,al))=1 if and only if either (ai,aj)=(ak,al) or ((ai,aj),(ak,al)) is a formal dissimilar compensatory mutation and
By normalizing CCompMut, we obtain a symmetric matrix PCompMut. For ai,aj,ak,al ranging over all amino acids, PCompMut((ai,aj),(ak,al)) represents an empirical probability distribution on pairs of amino acid pairs.
We then calculated the symmetric 400×400-matrix
Having set all negative entries of SCompMut to zero, the doubly stochastic matrix D is computed by means of the canonical iterated row-column normalization procedure .
The doubly stochastic D is used to linearly transform empirical amino acid pair distributions of column pairs. If the pair distribution is regarded as a 400-dimensional row vector, matrix D is multiplied from the right. If then, for example, the resulting distribution is plugged into Equation 1, column pairs containing formal dissimilar compensatory mutations the D-transition probability of which is relatively large tend to be up-scaled.
The idea of the subsequent subsections is to design a model of MSA column pairs that takes formal dissimilar compensatory mutations regarded as pair dissimilarities as well as pair similarities into account. The challenge is to implement this in a way such that these two effects interfere but do not interact. This is necessary since a similarity relation is transitive, whereas a dissimilarity relation is not.
Setting up the two counting matrices Cent and Csep
We set up two significant pair substitution matrices Cent and Csep from Calt and Cnull which form the basis of our new metrics and . The intuition behind Cent is that the component-wise BLOSUM62-based pair similarity is rescaled, whereas Csep leads to a new amino acid pair similarity.
where φent((ai,aj),(ak,al))=1 if and only if either (ai,aj)=(ak,al) or the following two conditions are satisfied. First, the amino acids ai and ak as well as the amino acids aj and al are BLOSUM62-similar. Second,
where φsep((ai,aj),(ak,al))=1 if and only if either (ai,aj)=(ak,al) or Equation 2 is satisfied.
Calculating the two amino acid pair similarity matrices
Recall that a matrix A is positive definite (positive semi-definite), if there is an orthogonal matrix U (defining property U−1=UT) such that UAUT is a diagonal matrix, where the coefficients in the main diagonal are strictly positive (non-negative).
Let us call a 400×400-matrix a amino acid pair similarity matrix, if is positive definite and the entries in the main diagonal are equal to 1, whereas the off-diagonal elements ((g,h)≠(i,j)) are greater than or equal to 0, but less than 1.
Let C be either Cent or Csep. We define
where ((g,h),(i,j)) ranges over all possible 160000 indices of pairs of amino acid pairs including the main diagonal, and α∈(0,1) was set to 0.1 in order to enhance the effect of similarity.
Because of the fact, that matrix B is not in any case positive definite, we finally set
Amino acid pair similarity matrices generalize amino acid similarity matrices used by Johansson et al.  for evaluating amino acid conservation.
Modeling MSA column pairs and single columns by means of density matrices
Let (γ1,γ2) be a column pair of a multiple sequence alignment, let be the empirical amino acid pair distribution in these columns, let be the linear transform of by the doubly stochastic matrix D, and let be an amino acid pair similarity matrix.
Recall, that the trace of a matrix is the sum of its coefficients in the main diagonal.
Taking pattern from quantum mechanics, we model column pair (γ1,γ2) by a positive semi-definite 400×400-matrix the trace of which is equal to 1, a so-called density matrix. Regarding the two distributions and as 400×400-diagonal matrices the main diagonal of which are formed by the probabilities and , respectively, we integrate the classical model into the quantum-mechanics-based one.
Generalizing the approach for amino acid used in  to amino acid pairs, our density matrices are of the shape
In this study, we regard individual MSA columns only as components of column pairs. In the classical case, where MSA-column pair (γ1,γ2) is modeled by an MSA-dependent amino acid pair distribution (either or some derivative), the columns γ1 and γ2 are represented by the corresponding marginals and of .
where i,j=1,2…,20. As opposed to the indices of the marginals, matrix tr1(ρ) models column γ2, whereas matrix tr2(ρ) represents column γ1.
Defining our two new metrics
To begin with, we define the von Neumann entropy VNE(ρ) of a diagonal density matrix ρ as the Shannon entropy of its main diagonal coefficients regarded as a probability distribution.
The crucial property of a density matrix ρ is that there exists an orthogonal matrix U such that UρUT is a diagonal density matrix, where the diagonal elements are uniquely determined up to their order. Thus we are justified to finally define
where U is an orthogonal matrix diagonalizing ρ in a way just mentioned.
In principle, the following holds true. The larger the off-diagonal coefficients of the similarity matrix , the smaller the von Neumann entropy of the density matrix according to Equation 4 compared with the Shannon entropy of the probability distribution (i,j=1,2,…,20).
In order to compare two density matrices ρ and σ of the same dimension, we make use of the quantum Jensen-Shannon divergence:
It can be shown that 0≤QJSD(ρ∥σ)≤1, where 0 is attained if and only if the two density matrices ρ and σ are equal. As oppose to the case of Equation 1, we have thus avoided a normalization.
We are now in a position to define our new two metrics for a certain column pair of a given MSA. As before, the amino acid pair distribution is given by , where D is the 400×400 doubly stochastic matrix described above, is the empirical pair distribution of these two columns, and is the 400×400-identity matrix.
(see Equation 4). This metric measures the difference between a density matrix combining rescaled amino acid pair similarity with dissimilar compensatory mutations and the empirical amino acid pair distribution. The index “ent” indicates that here we make use of quantum entanglement, which in turn is a major resource of quantum information science. (Entangled 400×400-density matrices are those that cannot be represented as a convex combination of Kronecker products of 20×20-density matrices. Note, that the Kronecker product of density matrices is the analog of the classical product of probability distributions).
Using the example of the human EGFR protein (PDB-Entry 2J6M), Figure 6 illustrates that the method we developed in  to determine significant column pairs is well-applicable for both and . The results presented in this work prove that as well as are powerful co-evolutionary column pair metrics.
Grosse et al. observed in  that the Jensen-Shannon divergence (JSD) can be interpreted as mutual information between two (or more) random sources in a special setting particularly appropriate to discriminate between these sources. This is what we need when it comes to predicting important protein sites in an MSA-based approach. It might explain the findings of Capra and Singh  on the predictive power of JSD. These two articles encouraged us to utilize quantum Jensen-Shannon divergence (QJSD) in this study. As a side effect, a normalization is not necessary, since quantum Jensen-Shannon divergence, like its classical counterpart, ranges over the real interval [ 0,1].
Several studies have confirmed the fact that detecting coupled MSA-columns is extremely useful in the prediction of important protein sites (see e.g. [4-6,10-13,65-70]). When using information-theoretic metrics, there is no doubt that it is reasonable to incorporate amino acid pair dissimilarity as well as amino acid similarity in a consistent way such that similarity decreases entropy, whereas dissimilarity increases it. This kind of consistency is important, since entropy is the fundamental building block for most of those metrics. In particular, the Jensen-Shannon divergence between two probability mass functions (pmfs) p and q equals .
In  an amino acid pair dissimilarity model for compensatory mutations is presented. A doubly stochastic matrix transforms the empirical amino acid pair distribution of a column pair.
Rescaled pair similarity of BLOSUM62-similar pairs is to capture an aspect of coupled MSA column pairs orthogonal to the phenomenon of dissimilar compensatory mutations. It models the amino acid pair transition preferences within those column pairs on the average. As suggested by Caffrey et al. as well as Johansson et al., it is promising to incorporate them within the framework of quantum information theory. Therein, density matrices replace pmfs. The counterpart of the entropy of a pmf is the von Neumann entropy (VNE) of a density matrix (see Equation 5). QJSD corresponds then exactly to JSD (see Equation 6).
The challenge was to complement the model presented in  by additionally incorporating amino acid pair similarity in a way that the two effects interfere but do not interact. We model an MSA column pair by means of a 400×400-density matrix, rather than amino acid pair distributions. This provides us with the opportunity to utilize the notion of entanglement, which in turn is a major resource of quantum information. In our model, partial traces play the role of the marginals in the classical case. Pair similarity is reflected by means of positive definite pair similarity matrices (see Equation 3), where positive definiteness, which is a key property of density matrices, can only be ensured by using transitivity of similarity. Since there is no transitivity of dissimilarity, we kept dissimilarity apart from that similarity matrix. Instead, we carried over the CMF dissimilarity model of . Similarity matrix and transformed amino acid pair distribution are joined together by means of Equation 4 in the final step of our density matrix design. That way we minimize the interaction between the two effects of dissimilarity and similarity.
In order to eliminate the noise and to define an MSA-dependent threshold for significant column pairs, we followed the line of . The model presented there seems to be of universal applicability. The same is true for the connectivity degree model introduced in  and further developed in . Combining them results in a reliable and robust method to determine significant residues.
The results we present in this study show that the vast majority of QCMF-significant residue sites are closely related to functionality and structural stability of both human EGFR and GCK proteins. 10 significant residue sites in EGFR and 19 significant sites in GCK are established as functionally important since they are directly located at or close to catalytic sites, allosteric sites and binding sites which are crucial for maintaining protein functions and for understanding the underlying molecular mechanism (see Figures 1,2,3,4). Additionally, 2 significant sites in EGFR and 8 significant sites in GCK (three of them are also in contact with allosteric sites in GCK) are related to disease associated nsSNP regions of both proteins. As has been noted in , most disease-causing mutations at these positions in corresponding sequences destroy structural features of proteins, thus affecting protein stability and often results in loss of protein function.
Although the importance of almost all QCMF-significant sites are verified through essential sites of both human proteins, there are still eight and two unconfirmed significant sites in EGFR and GCK proteins, respectively, which do not fall into essential sites. It is interesting to note that some of these unconfirmed sites are also referred as significant by CMF . We therefore believe that most of these unconfirmed sites identified by our present method may have an importance for the function and structural stability of both proteins notwithstanding the absence of previous experimental data. A further comparison reveals that the overlaps between the results of the QCMF method and the CMF method are quite low, indicating that both methods detect considerably different sets of residue sites as functionally and structurally important. The comparison results clearly show that considering similar and dissimilar amino acid signals simultaneously, our present method is more sensible to catalytic, allosteric and binding sites, while only focusing on dissimilar signals the previous method deals successfully with nsSNP positions in proteins.
The final comparison between QCMF and CMF on EGFR and GCK proteins is made by inspecting several connectivity degree cut-offs. We initially set it to the 90-th percentile at which CMF reaches its maximal MCC value. Going through all possible n-th percentiles for n=80,81,…,99, QCMF reaches its maximal MCC value of 0.231 if n=88. What we got can be summarized as follows. On the one hand QCMF shows a better performance than CMF in identifying important residue sites. On the other hand QCMF complements CMF. This is because of the fact that the method of QCMF is more information rich than that of CMF. QCMF simultaneously uses similar and dissimilar amino acid pair signals, whereas CMF’s method focuses only on amino acid pair dissimilarity.
To confirm the educated guess that QCMF complements conventional methods both from information theory and statistics, we applied QCMF, CMF , MIp  and PSICOV  to the 153 MSAs described at the beginning of the Results section. In sum, each of these methods detects different residue pairs as important, where the pairwise overlap is bounded from above by 10%. The reason for that is that the four methods model different aspects of amino acid pair co-evolution. Consequently, they carry distinct information.
In this work, we report a new method, QCMF, applying principles of quantum information theory. In contrast to the previous method CMF which focused on dissimilar amino acid signals, QCMF simultaneously models similar and dissimilar amino acid pair signals in the detection of functionally or structurally important sites. QCMF includes two metrics based on quantum Jensen-Shannon divergence. While the first metric measures compensatory mutations between pairs of columns, the second metric considers the sequence conservation of columns. Results show that QCMF reaches an improved performance in identifying important sites from MSAs and it predicts a quite different set of residue sites as functionally and structurally important (in comparison to the previous method). Further, results indicate that the residue sites found by QCMF are more sensible to catalytic sites, allosteric sites and binding sites than those found by the previous method. On the top of that, a pairwise comparison with existing methods shows that QCMF is complementary to them when it comes to predicting co-evolving residue site pairs.
The authors declare that they have no competing interests.
SW developed the model underlying QCMF. EW adjusted the model together with SW and interpreted the results. MG developed the model together with SW, designed and implemented the tool together with SH, SJJ and GD, and interpreted the results together with EW and CM. SH designed and carried out the CUDA programming. GD and SJJ supported SW in developing the model and MG in designing and implementing the tool. CM interpreted the results together with EW and MG. All authors read and approved the manuscript.
We thank Rainer Merkl from Regensburg for his helpful advice and insights at early stages of this project and for comments on the methods.
Mehmet Gültas acknowledges financial support by the Deutsche Forschungsgemeinschaft (DFG: WA 766/7-1).
Biochemistry 2005, 44(19):7156-7165.PubMed Abstract | Publisher Full Text
BMC Bioinformatics 2012, 13:225.PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text
PLoS ONE 2010, 5:e8591.
[doi:10.1371%2Fjournal.pone.0008591].PubMed Abstract | Publisher Full Text | PubMed Central Full Text
Mol BioSyst 2013, 9:175-181.
[doi:10.1039/C2MB25317H]PubMed Abstract | Publisher Full Text
Nat Rev Genet Nat Publishing Group 2005, 6(9):678-687.
Bioinformatics 2003, 19(6):750-755.PubMed Abstract | Publisher Full Text
BMC Bioinformatics 2008, 9:151.PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text
BMC Bioinformatics 2011, 12:206.PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text
Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, Zecchina R, Onuchic JN, Hwa T, Weigt M: Direct-coupling analysis of residue coevolution captures native contacts across many protein families.
Proc Nat Acad Sci 2011, 108(49):E1293-E1301.PubMed Abstract | Publisher Full Text | PubMed Central Full Text
Proc Nat Acad Sci 2014.
Proc Nat Acad Sci 2013, 110(39):15674-15679.PubMed Abstract | Publisher Full Text | PubMed Central Full Text
J Bioinformatics Comput Biol 2010, 08(05):809-823.
BMC Bioinformatics 2013, 14:117.PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text
BMC Bioinformatics 2008, 9(Suppl 2):S10. BioMed Central Full Text
Parallel Process Lett 2008, 18(4):531-548. Publisher Full Text
PLoS ONE 2009, 4(10):e7485.
[doi:10.1371%2Fjournal.pone.0007485]PubMed Abstract | Publisher Full Text | PubMed Central Full Text
Yun CH, Boggon TJ, Li Y, Woo MS, Greulich H, Meyerson M, Eck MJ: Structures of lung cancer-derived EGFR mutants and inhibitor complexes: mechanism of activation and insights into differential inhibitor sensitivity.
Cancer Cell 2007, 11(3):217-227.PubMed Abstract | Publisher Full Text | PubMed Central Full Text
J Clin Invest 2007, 117(8):2051-2058.PubMed Abstract | Publisher Full Text | PubMed Central Full Text
Lynch TJ, Bell DW, Sordella R, Gurubhagavatula S, Okimoto RA, Brannigan BW, Harris PL, Haserlat SM, Supko JG, Haluska FG, Louis DN, Christiani DC, Settleman J, Haber DA: Activating mutations in the epidermal growth factor receptor underlying responsiveness of non-small-cell lung cancer to Gefitinib.
N Engl J Med 2004, 350(21):2129-2139.PubMed Abstract | Publisher Full Text
Biochemistry 2009, 48(35):8435-8448.PubMed Abstract | Publisher Full Text | PubMed Central Full Text
Tinto N, Zagari A, Capuano M, De Simone A, Capobianco V, Daniele G, Giugliano M, Spadaro R, Franzese A, Sacchetti L: Glucokinase gene mutations: structural and genotype-phenotype analyses in MODY children from South Italy.
PLoS ONE 2008, 3(4):e1870.PubMed Abstract | Publisher Full Text | PubMed Central Full Text
Capuano M, Garcia-Herrero CM, Tinto N, Carluccio C, Capobianco V, Coto I, Cola A, Iafusco D, Franzese A, Zagari A, Navas MA, Sacchetti L: Glucokinase (GCK) mutations and their characterization in MODY2 children of Southern Italy.
PLoS ONE 2012, 7(6):e38906.
[doi:10.1371%2Fjournal.pone.0007485]PubMed Abstract | Publisher Full Text | PubMed Central Full Text
Garcia-Herrero CM, Rubio-Cabezas O, Azriel S, Gutierrez-Nogues A, Aragones A, Vincent O, Campos-Barros A, Argente J, Navas MA: Functional characterization of MODY2 mutations highlights the importance of the fine-tuning of glucokinase and its role in glucose sensing.
PLoS ONE 2012, 7:e30518.
[doi:10.1371%2Fjournal.pone.0038906]PubMed Abstract | Publisher Full Text | PubMed Central Full Text
Structure 2004, 12(3):429-438.PubMed Abstract | Publisher Full Text
Nucleic Acids Res 2005, 33(suppl 1):D514-D517.PubMed Abstract | Publisher Full Text | PubMed Central Full Text
Nucleic Acids Res 2002, 30:253-254.PubMed Abstract | Publisher Full Text | PubMed Central Full Text
Flicek P, Amode MR, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S, Gordon L, Hendrix M, Hourlier T, Johnson N, Kähäri A, Keefe D, Keenan S, Kinsella R, Kokocinski F, Kulesha E, Larsson P, Longden I, McLaren W, Overduin B, Pritchard B, Riat HS, Rios D, Ritchie GRS, Ruffier M, Schuster M, et al.: Ensembl 2011.
Nucleic Acids Res 2011, 39(suppl 1):D800-D806.PubMed Abstract | Publisher Full Text | PubMed Central Full Text
Protein Sci 2004, 13(4):1043-1055.
[doi:10.1110/ps.03484604]PubMed Abstract | Publisher Full Text | PubMed Central Full Text
BMC Bioinformatics 2012, 13:55.PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text
Nucleic Acids Res 2004, 32(suppl 1):D129-D133.PubMed Abstract | Publisher Full Text | PubMed Central Full Text
Nucleic Acids Res 2001, 29:308-311.PubMed Abstract | Publisher Full Text | PubMed Central Full Text
PLoS ONE 2009, 4(10):e7485.
[doi:10.1371%2Fjournal.pone.0030518].PubMed Abstract | Publisher Full Text | PubMed Central Full Text
Valentinova L, Beer NL, Stanik J, Tribble ND, van de Bunt M, Huckova M, Barrett A, Klimes I, Gasperikova D, Gloyn AL: Identification and functional Characterisation of novel glucokinase mutations causing maturity-onset diabetes of the young in Slovakia.
PLoS ONE 2012, 7(4):e34541.
[doi:10.1371%2Fjournal.pone.0007485]PubMed Abstract | Publisher Full Text | PubMed Central Full Text
J Symbolic Comput 1997, 24(3–4):235-265.
[doi:10.1006/jsco.1996.0125]. [Computational algebra and number theory (London, 1993)]
Anderson E, Bai Z, Bischof C, Blackford S, Demmel J, Dongarra J, Du Croz J, Greenbaum A, Hammarling S, McKenney A, Sorensen D: LAPACK Users’ Guide. Philadelphia, PA: Society for Industrial and Applied Mathematics; 1999.
ACM Trans Math Softw 2002, 28(2):135-151.Publisher Full Text
ACM Trans Math Softw 2008, 34(3):12:1-12:25.
Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2012 IEEE 26th International. 2012, 2404-2413.
Proc Acad Sci 2003, 100:9440-9445. Publisher Full Text
Multiple comparisons: Bonferroni corrections and false discovery rates.
Lecture Notes EEB 581, Department of Ecology and Evolutionary Biology, University of Arizona, 2004
Proc Nat Acad Sci 1992, 89(22):10915-10919.PubMed Abstract | Publisher Full Text | PubMed Central Full Text
Phys Rev E 2002, 65:041905.
Proteins-Struct Funct Genet 1994, 18(4):309-317. Publisher Full Text
Proc Nat Acad Sci 1994, 91:98-102.PubMed Abstract | Publisher Full Text | PubMed Central Full Text
Protein Eng 1997, 10(6):647-657.PubMed Abstract | Publisher Full Text
Science 1999, 286(5438):295-299.PubMed Abstract | Publisher Full Text