Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada

Chern Institute of Mathematics, College of Mathematical Science and LPMC, Nankai University, Tianjin 300071, PCR

Abstract

Background

Traditionally, it is believed that the native structure of a protein corresponds to a global minimum of its free energy. However, with the growing number of known tertiary (3D) protein structures, researchers have discovered that some proteins can alter their structures in response to a change in their surroundings or with the help of other proteins or ligands. Such structural shifts play a crucial role with respect to the protein function. To this end, we propose a machine learning method for the prediction of the flexible/rigid regions of proteins (referred to as FlexRP); the method is based on a novel sequence representation and feature selection. Knowledge of the flexible/rigid regions may provide insights into the protein folding process and the 3D structure prediction.

Results

The flexible/rigid regions were defined based on a dataset, which includes protein sequences that have multiple experimental structures, and which was previously used to study the structural conservation of proteins. Sequences drawn from this dataset were represented based on feature sets that were proposed in prior research, such as PSI-BLAST profiles, composition vector and binary sequence encoding, and a newly proposed representation based on frequencies of k-spaced amino acid pairs. These representations were processed by feature selection to reduce the dimensionality. Several machine learning methods for the prediction of flexible/rigid regions and two recently proposed methods for the prediction of conformational changes and unstructured regions were compared with the proposed method. The FlexRP method, which applies Logistic Regression and collocation-based representation with 95 features, obtained 79.5% accuracy. The two runner-up methods, which apply the same sequence representation and Support Vector Machines (SVM) and Naïve Bayes classifiers, obtained 79.2% and 78.4% accuracy, respectively. The remaining considered methods are characterized by accuracies below 70%. Finally, the Naïve Bayes method is shown to provide the highest sensitivity for the prediction of flexible regions, while FlexRP and SVM give the highest sensitivity for rigid regions.

Conclusion

A new sequence representation that uses k-spaced amino acid pairs is shown to be the most efficient in the prediction of the flexible/rigid regions of protein sequences. The proposed FlexRP method provides the highest prediction accuracy of about 80%. The experimental tests show that the FlexRP and SVM methods achieved high overall accuracy and the highest sensitivity for rigid regions, while the best quality of the predictions for flexible regions is achieved by the Naïve Bayes method.

Background

The flexibility of protein structures is often related to protein function. Some proteins alter their tertiary (3D) structures due to a change of surroundings or as a result of interaction with other proteins

Additionally, the flexible linker and the rigid domain should be factored in when performing 3D protein structure prediction. Protein is a complex system that can be described by an accurate energy-based model

The knowledge of the flexible/rigid regions would also allow us to gain insights into the process of protein folding. Biological experiments and theoretical calculation have shown that the natural conformation of proteins is usually associated with the minimum of the free energy

Gerstein's group has done a significant amount of work on the related subject of classification of protein motions

Examples of the three types of flexible regions

**Examples of the three types of flexible regions**. 1) Pair (a1) and (a2) is an example of

Since the regions that are

Results and discussion

Feature-based sequence representation

Four groups of features were compared, and the best set was selected to perform the prediction. The

Using the 10-fold cross validation, the proposed FlexRP method, which applies Logistic Regression and the proposed collocation based representation, which is processed using entropy based feature selection, was compared with four other prediction methods, i.e., Support Vector Machines (SVM), C4.5, IB1 and Naïve Bayes, which apply each of the four representations and two selection methods, see Table

Prediction accuracy for different protein sequence representations based on 10-fold cross validation tests.

Feature representation

Classifier^{1 }Feature selection^{2}

FlexRP (Logistic Regression)

SVM

C4.5

IB1

Naïve Bayes

Composition vector

N/A

67.37%

68.74%

57.70%

57.33%

65.20%

PSI-BLAST profile

N/A

66.38%

67.35%

62.47%

61.62%

66.24%

Binary encoding

No selection

66.38%

66.06%

58.82%

59.92%

61.84%

Binary encoding

Linear coefficient

69.58%

68.74%

62.82%

57.05%

69.10%

Binary encoding

Entropy based

69.19%

68.74%

63.24%

58.21%

69.00%

K-spaced AA pairs

Linear coefficient

74.37%

74.60%

66.04%

68.74%

72.97%

K-spaced AA pairs

Entropy based

**79.51**%^{3}

78.46%

66.25%

66.93%

76.01%

^{1}The tested classifiers include the proposed FlexRP method, Support Vector Machine (SVM), decision tree (C4.5), instance-based learner (IB1), and Naïve Bayes.

^{2 }The sequence representations based on binary codes and frequencies of the k-spaced amino acid pairs were processed using two feature selection methods.

^{3 }The best result is shown in bold.

The proposed FlexRP method obtained the best, 79.5% accuracy, when compared with the other four methods, four representation and application of the two feature selection methods. The results for the two worst performing prediction methods, i.e., C4.5 and IB1, show relatively little differences in accuracy when the two feature selection methods are compared. On the other hand, results for the three best performing methods (FlexRP, SVM, and Naïve Bayes) show that using the entropy based feature selection results in the best accuracy of prediction when the proposed (best performing) representation is used. The results achieved by the proposed method are 1% and 3.5% better than two runner-up results achieved for the same representation and the SVM and Naïve Bayes classifiers, respectively. The results that apply other combinations of feature representations and selection methods are on average, over the three best methods, at least 4% less accurate. Therefore, entropy based selection not only reduces the dimensionality of the proposed representation, making it easier to implement and execute the method, but also results in improved accuracy. The superiority of the entropy based selection over the linear correlation based method can be explained by the type of features that constitute the proposed representation. The features take on discrete, integer values, and thus linear correlation coefficients, which prefer continuous values, are characterized by poorer performance.

Among the four sequence representations, the lowest average (over the five prediction methods) accuracy is achieved with the composition vector, while both PSI-BLAST profile and binary encoding give similar, second-best accuracies. The most accurate predictions are obtained with the proposed representation. Since the PSI-BLAST profile is one of the most commonly used representations, we also combined it with the features of the k-spaced AA pairs to verify whether this combination could bring further improvements. The corresponding experiments with the best performing three classifiers, i.e., FlexRP, Naïve Bayes and SVM, show that using both representations in tandem lowers the accuracy. The 10-fold cross validation accuracy equals 77.13%, 76.33%, and 72.99% for FlexRP, SVM, and Naïve Bayes, respectively. Finally, similar experiments that combine all four representations show a further drop in accuracy. The proposed, k-spaced residues based representation not only gives the best accuracy but it also uses the least number of features when compared to representations that combine multiple feature sets, and therefore this representation was used to perform the predictions.

A set of features used by the FlexRP method, which were selected using the best performing, entropy based selection method from the proposed representation, is given in Table

Features selected by the entropy based method.

^{1}

DF

AK

DI

AD

AI

DC

DI

ED

DP

AC

EF

FH

ED

AI

AV

HD

FF

GL

EN

EL

EL

KI

EK

AV

AY

IE

FG

PG

GG

KF

KE

KY

FK

GG

DG

NQ

HP

PS

KC

KG

LI

LL

GG

KQ

DS

PG

IL

TI

RI

LL

LQ

GR

LI

EK

QP

VI

TV

PA

PM

GS

LS

ER

RV

TL

VN

VR

QT

VH

KL

PW

HQ

VL

VR

VI

VL

KS

SG

LL

VV

YC

VP

LL

YH

LV

YL

PS

MV

PD

VI

SQ

VK

TK

VL

^{1}

Optimization of the prediction of the flexible/rigid regions

Table ^{-8 }of ridge parameter for the Logistic Regression classifier, Naïve Bayes with kernel estimator for numeric attributes

The accuracies of the optimized prediction methods equal 79.51%, 78.41% and 79.22% for the FlexRP, Naïve Bayes and SVM, respectively. To provide a more comprehensive comparison of the achieved performance, additional measures such as sensitivity, specificity, the Matthews Correlation Coefficient (MCC) and the confusion matrix values (TP, FP, FN, and TN) are reported in Table

Prediction accuracy after optimization.

Method

Accuracy^{1}

sensitivity

specificity

sensitivity

specificity

MCC

TP

FP

FN

TN

FlexRP

79.51%

88.52%

82.85%

59.71%

70.24%

0.51

3478

720

451

1067

SVM

79.22%

88.93%

82.27%

57.86%

70.39%

0.50

3494

753

435

1034

Naïve Bayes

78.41%

80.15%

87.40%

74.59%

63.09%

0.53

3149

454

780

1333

^{1 }The results were based on the best performing representation that includes 95 features selected using the entropy based selection method.

The optimization provides relatively marginal improvements. FlexRP method gives the best overall accuracy and high sensitivity and specificity for the rigid regions. SVM provides the best sensitivity for the rigid regions and the best specificity for the flexible regions, while Naïve Bayes gives the highest MCC and the highest sensitivity for the flexible regions. In summary, the proposed FlexRP method is shown to provide the most accurate prediction of flexible/rigid regions; however, Naïve Bayes based method provides more accurate prediction for the flexible regions.

Additionally, we studied the impact of the varying values of the maximal spread,

The prediction accuracy in function of

**The prediction accuracy in function of p for the k-spaced AA pairs where k ≤ p**. The number of features used to represent the sequence increases with the increasing value of

Comparison with similar prediction methods

The FlexRP was also compared with two recent methods that address similar predictions. Boden's group developed a method to predict regions that undergo conformational change via predicted continuum secondary structure

Comparison of performances between FlexRP, IUPred, and Boden's methods.

Method

Accuracy

sensitivity

specificity

sensitivity

specificity

MCC

TP

FP

FN

TN

FlexRP

79.51%

88.52%

82.85%

59.71%

70.24%

0.51

3478

720

451

1067

IUPred

65.64%

88.88%

69.58%

14.55%

37.30%

0.05

3492

1527

437

260

Boden's method

56.21%

56.71%

73.53%

55.12%

36.67%

0.11

2228

802

1701

985

We use an example to further demonstrate differences between the three prediction methods. The prediction was performed for a segment between 11E and 216A in chain A of 1EUL protein, see Figure

The predictions obtained with the Boden's method [29], the IUPred method [30] and the FlexRP method on the 11E to 216A segment in chain A of 1EUL protein

**The predictions obtained with the Boden's method [29], the IUPred method [30] and the FlexRP method on the 11E to 216A segment in chain A of 1EUL protein**. In the Boden's method residues with entropy greater than 0.49 are considered as regions undergoing conformational change; the IUPred method predicts all residues for which the probabilistic score is greater than 0.5 as belonging to the disordered regions. FlexRP classifies a residue as belonging to a flexible region if its corresponding probabilistic score is greater than 0.5. The actual flexible regions are identified using the white background.

In Figure

Conclusion

Knowledge of flexibility/rigidity of protein sequence segments is of a pivotal role to improve the quality of the tertiary structure prediction methods and to attempt to fully solve the mystery of the protein folding process. At the same time, such information requires a very detailed knowledge of protein structure, and thus is available only for a small number of proteins. To this end, we propose a novel method, called FlexRP, for prediction of flexible/rigid regions based on protein sequence. The method is designed and tested using a set of segments for which flexibility/rigidity is defined based on a comprehensive exploration of tertiary structures from PDB

Methods

Dataset

Our previous study that concerns conservation of the tertiary protein structures shows that less than 2% out of 8127 representative segments extracted from the entire Protein Data Bank (PDB)

List of 66 segments with multiple experimental structures.

Protein ID^{1}

Start AA

End AA

Protein ID^{1}

Start AA

End AA

Protein ID^{1}

Start AA

End AA

1eulA

11E

216A

1c0mA

199K

268D

1ic8A

208P

276A

121p

25Q

166H

1cdb

24F

105R

1ihgA

245K

298E

1a0h

482V

575D

1cejA

30C

95S

1iku

104W

189E

1a7lA

4E

198L

1cfpA

2E

80I

1ilf

7D

140Q

1a7xA

31E

106L

1cpq

7L

128E

1irf

27L

112L

1a90

25L

116Q

1cto

46R

108M

1jmvA

68Q

139R

1ael

41T

111N

1dem

4R

59R

1k0tA

11I

80Y

1akk

34G

103N

1dhx

339A

430G

1k9aA

117T

316A

1al01

42V

124T

1dmzA

613I

706G

1kmuR

299E

382Y

1aonA

218P

371K

1do0

42E

165E

1kvnA

23I

89R

1ap9

104D

155G

1ei7A

60V

148S

1l6kA

8E

61V

1avfJ

76G

155L

1ej6B

723A

928V

1mfn

3D

184T

1az0A

191S

244R

1f2hA

59L

164C

1mkmA

10I

215S

1b4m

42I

134K

1ffxA

148G

263P

1o0vA

265Q

470M

1b75A

41E

94A

1fm6A

266T

430L

1pbwA

238L

297E

1b7eA

133E

239I

1g3gA

44P

152K

1qpmA

28M

81T

1b8tA

12V

191S

1gm0

15A

122I

1sw6A

346Y

429S

1ba9

72G

123A

1go4G

498F

578M

1uaaA

88G

537R

1blr

41V

96E

1hqmD

1039L

1116T

1wtuA

14T

99K

1boc

6L

75Q

1hryA

20R

75R

2btfA

4D

71I

1bqmA

276V

400L

1hstA

26P

79G

2ezm

1L

100Y

1bsh

19L

138M

1i84S

883E

942E

5gcn

36M

165G

^{1 }For each segment, one PDB ID together with the start and the end of the segment are listed.

Definition of the flexible regions

Several different definitions of the flexible regions were proposed in the past:

1. all regions with NMR chemical shifts of a random-coil; regions that lack significantly ordered secondary structure (as determined by CD or FTIR); and/or regions that show hydrodynamic dimensions close to those typical of an unfolded polypeptide chain

2. all regions with missing coordinates in X-ray structures

3. stretches of 70 or more sequence-consecutive residues depleted of helices and strands

4. regions with high B factors (normalized) from X-ray structures

In this paper, a data-driven definition of flexible regions, which is based on a comprehensive exploration of the experimental protein structures, is proposed. A given sequence (region) is considered flexible if it has multiple different experimental structures (in different proteins), i.e. the corresponding structure is not conserved. Although two existing methods, i.e., FlexProt _{1}_{2}..._{i-1}_{i}_{i+1}..._{n}, consists of following

_{1}_{2}..._{5}_{6}, _{2}_{3}..._{6}_{7},...,..., _{n-6 }_{n-5}..._{n-2}_{n-1}, _{n-5 }_{n-4}..._{n-1 }_{n}

The flexible regions were identified by comparing distance, which was computed using the Root Mean Square Distance for Unit Vectors (URMSD) measure ^{th }six-residue segment are denoted as _{i,1}, _{i,2},..., _{i, m-1}, _{i, m}. Based on results in

the ^{th }six-residue fragments is defined as flexible; otherwise it is regarded as rigid. In other words, the regions characterized by maximal URMSD that are larger than 0.5 are indexed as flexible, while the remaining regions are indexed as rigid. The 66 segments that constitute our dataset include a total of 5716 residues, out of which 3929 were assumed as rigid and 1787 as flexible.

Following, we use an example, in which we aim to identify the flexible regions for 88G to 573R segment in chain A of 1UAA protein and chain B of 1UAA protein, to contrast results of the above method with the results of FlexPro and FatCat. Computation of flexible regions took 10 seconds for FlexProt, 30 seconds for Fatcat, and less than a second for the method that was used in this paper. The FlexProt identified a flexible region (hinge) between GLY374 and THR 375, the FatCat gave the same result, and the third method identified TYR369 to PHE 377 as the flexible region. While similar flexible regions were identified by all three methods, the method from

FlexRP method

The proposed method performs its prediction as follows:

1. Each residue that constitutes the input sequence is represented by a feature vector. First, a 19-residues wide window, which is centered on the residue, is established. Next, frequencies of the 95 k-spaced AA pairs given in Table

2. The vector is inputted into a multinomial logistic regression model to predict if the residue should be classified as flexible or rigid.

The evaluation procedure applied in this paper assumes that the original dataset is divided into two disjoint sets: a training set that is used to develop the regression model and a test set that is used to test the quality of the proposed method (and other, considered methods). The logistic regression model is established through a Quasi-Newton optimization based on the training set

Feature-based sequence representation

Four representations, which include PSI-BLAST profile, composition vector, binary encoding, and the proposed collocation based features are applied to test and compare the quality of the proposed FlexRP method. A window that is centered on an AA for which the prediction is computed is used to compute the representation. In this paper, the window size is set to 19, i.e., the central AA and nine AAs on both of its sides. The size was selected based on a recent study that shows that such a window includes information required to predict and analyze folding of local structures and provides optimal results for secondary structure prediction

The _{1}, _{2},..., _{19}, and _{20}, and the number of occurrences of _{i }in the local sequence window of size _{i}, the composition vector is defined as

Another popular protein sequence representation is based on _{i}, the ^{th }position of the vector is set to 1, and the remaining 19 values are set to 0. Each of the AAs in the local sequence window of size

_{i }is set to the log-odds score vector (over the 20 possible AAs) derived from the multiple alignment column corresponding to the ^{th }position in the window. This method treats each _{i }as a 21-dimensional vector of real values; the extra dimension is used to indicate whether _{i }is off the end of the actual protein sequence (0 for within sequence, 0.5 for outside). The log-odds alignment scores are obtained by running PSI-BLAST against Genbank's standard non-redundant protein sequence database for three iterations. In this paper, PSI-BLAST profiles were run with default parameters and a window size of 15 as suggested in

A new representation, which is based on frequency of

Sizes of feature sets for the considered sequence representations.

k-spaced AA pairs

Feature representation

Composition Vector

PSI-BLAST profile

Binary Encoding

adjacent pairs (dipeptides)

1-spaced pairs

......

Total

Number of features

20

315

380

400

400

......

400

400(

Feature selection

The binary encoding and the collocation based representations include relatively large number of features. Therefore, two selection methods, i.e., correlation and entropy based, were used to reduce the dimensionality and potentially improve the prediction accuracy by selecting a subset of the features.

The

where

The

where {_{i}} is a set of values of _{i}) is the prior probability of _{i}.

The conditional entropy of

where _{i}| _{j}) is the posterior probability of _{i }of

The amount by which the entropy of

According to this measure, Y is regarded as more highly correlated with

Logistic regression

Logistic regression is a method suitable to model a relationship between a binary response variable and one or more predictor variables, which may be either discrete or continuous. As such, this model perfectly fits the data used in this paper, i.e., the response variable is a binary flexible/rigid classification of a residue, and the predictor variables are the frequency of the selected k-spaced AA pairs in the local sequence window. We applied a statistical regression model for Bernoulli-distributed dependent variables, which is implemented as a generalized linear model that utilizes the logit as its link function. The model takes the following form

where _{i }= _{i }= 1).

The logarithm of the odds (probability divided by 1 – probability) of the outcome is modeled as a linear function of the predictor variables, _{i}. This can be written equivalently as

In contrast to the linear regression, in which parameters _{1},..., _{k }are calculated using minimal squared error, parameters in the logistic regression are usually estimated by maximum likelihood. More specifically, (_{1},..., _{k}) is a set of values that maximizes the following likelihood function

Experimental setup

The classification systems used to develop and compare the proposed systems were implemented in Weka, which is a comprehensive open-source library of machine learning methods

The reported results include the following quality indices:

where TP, TN, FP and FN denote true positive, true negative, false positive and false negative, respectively.

Authors' contributions

KC and LK developed the prediction method and performed the experimental evaluation. KC and JR contributed to the data collection and definition of flexible regions. All authors contributed to writing the manuscript, and read and approved the final version.

Acknowledgements

KC and LAK gratefully acknowledge support from NSERC Canada under the Discovery program and MITACS Canada under the industrial internship program. JR was supported by Liuhui Center for Applied Mathematics, China-Canada exchange program administered by MITACS and NSFC (10271061). The authors would like to thank Dr. Mani Vaidyanathan for copyediting help.