Skip to main content
  • Methodology article
  • Open access
  • Published:

Large-scale prediction of long disordered regions in proteins using random forests

Abstract

Background

Many proteins contain disordered regions that lack fixed three-dimensional (3D) structure under physiological conditions but have important biological functions. Prediction of disordered regions in protein sequences is important for understanding protein function and in high-throughput determination of protein structures. Machine learning techniques, including neural networks and support vector machines have been widely used in such predictions. Predictors designed for long disordered regions are usually less successful in predicting short disordered regions. Combining prediction of short and long disordered regions will dramatically increase the complexity of the prediction algorithm and make the predictor unsuitable for large-scale applications. Efficient batch prediction of long disordered regions alone is of greater interest in large-scale proteome studies.

Results

A new algorithm, IUPforest-L, for predicting long disordered regions using the random forest learning model is proposed in this paper. IUPforest-L is based on the Moreau-Broto auto-correlation function of amino acid indices (AAIs) and other physicochemical features of the primary sequences. In 10-fold cross validation tests, IUPforest-L can achieve an area of 89.5% under the receiver operating characteristic (ROC) curve. Compared with existing disorder predictors, IUPforest-L has high prediction accuracy and is efficient for predicting long disordered regions in large-scale proteomes.

Conclusion

The random forest model based on the auto-correlation functions of the AAIs within a protein fragment and other physicochemical features could effectively detect long disordered regions in proteins. A new predictor, IUPforest-L, was developed to batch predict long disordered regions in proteins, and the server can be accessed from http://dmg.cs.rmit.edu.au/IUPforest/IUPforest-L.php

Background

Intrinsically unstructured/disordered proteins (IUPs/IDPs) contain long disordered regions or are completely disordered [1]. IUPs are abundant in higher organisms and often involved in key biological processes, such as transcriptional and translational regulation, membrane fusion and transport, cell-signal transduction, protein phosphorylation, the storage of small molecules and the regulation of self-assembly of large multi-protein complexes [2–11]. The disordered state in IUPs creates larger intermolecular interfaces [12], which increase the speed of interaction with potential binding partners even in the absence of tight binding, and provide flexibility for binding diverse ligands [2, 5, 11, 13–15]. However, long disordered regions in IUPs cause difficulties in protein structure determination by both X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy. Efficient prediction of disordered region(s) in IUPs by computational methods can provide valuable information in high-throughput protein structure characterization, and reveal useful information on protein function [15].

Many predictors have been developed to predict disordered regions in proteins, such as PONDR [16], RONN [17, 18], VL2, VL3, VL3H and VL3E from DisProt [1, 19, 20], NORSp [21, 22], DISpro [23], FoldIndex [24], DISOPRED and DISOPRED2 [25–27], GlobPlot [28] and DisEMBL [29], IUPred [30], Prelink [31], DRIP-PRED (MacCallum, online publication http://www.forcasp.org/paper2127.html), FoldUnfold [32], Spritz [33], DisPSSMP [34], VSL1 and VSL2 [35, 36], POODLE-L [37], POODLE-S [38], Ucon [39], PrDOS and metaPrDOS [40, 41]. Among these predictors, neural networks and support vector machines (SVM) are widely used machine learning models.

The accuracy of disorder predictors is generally limited by the existence of various kinds of disorder which are represented unevenly in the various databases, and the lack of a unique definition of disorder [30]. Predictors designed for long disordered regions are usually less successful in predicting short disordered regions [36, 42] because the long and short disordered regions have different sequence features. As a result, some predictors are specified for predicting long disordered regions, such as POODLE-L [37], while predictors targeting all types of disordered regions usually have to sacrifice time efficiency for exploiting heterogeneous sequence properties, especially the evolution information extracted from PSI-BLAST or protein secondary structure [25, 27, 33–36, 38].

In this paper, a new algorithm, IUPforest-L, is proposed for predicting long disordered regions based on the random forest learning model [43] and simple parameters extracted from the amino acid sequences and amino acid indices (AAIs) [44]. 10-fold cross validation tests and blind tests demonstrate that IUPforest-L can achieve significantly higher accuracy than many existing algorithms in predicting long disordered regions. The high efficiency of IUPforest-L makes it a suitable tool for high-throughput comparative proteomics studies.

Methods

Training and test datasets

To train IUPforest-L, a subset (positive training set) of disordered regions was constructed based on DisProt [20] (version3.6), which includes 352 regions of 30 aa or more in length, and 47251 aa in total. The negative training set was extracted from PDBSelect25 [45] (Oct. 2004 version), from which 366 sequences (80,324 aa in total) of at least 80 aa were selected. Each of them has a high resolution crystal structure (< 2.0Ã…), free from missing backbone or side chain coordinates and free from non-standard amino acid residues.

To assess the prediction performance of IUPforest-L, three datasets were used for blind tests. The first dataset was based on the dataset constructed by Hirose et al (Hirose-ADS1) as a blind test dataset of POODLE-L [37]. Hirose-ADS1 contains 53 ordered regions of at least 40 aa (11431 aa in total) from the Protein Data Bank [46] and 63 disordered regions of at least 30 aa (8700 aa in total) from DisProt (version 3.0). The second test set (Han-ADS1) comprised of 53 ordered regions as in Hirose-ADS1 and 33 long disordered regions (5959 aa in total) from the latest DisProt (version 4.8), after removing disordered regions homologous to those in DisProt (version 3.6) using the CD-HIT algorithm with a threshold of 0.9 sequence identity [47]. The third test set (Peng-DB) was constructed based on the blind test dataset of VLS2 [35], where 56 long disordered regions of at least 30 aa (2841 aa in total) and 1965 ordered regions (318431 aa in total) were used in the assessment. For an objective blind test of IUPforest-L on Hirose-ADS1 (as reported in Table 1), disordered and ordered regions homologous to those in Hirose-ADS1 were removed from our training set based on the CD-HIT algorithm with a threshold of 0.9 sequence identity [47], resulting in 293 disordered regions and 364 ordered regions for training the predictor. Similarly for an objective blind test on Han-ADS1 (as reported in Table 2), ordered regions homologous to the 53 ordered regions in Hirose-ADS1 were also removed from the original training set for training the predictor. The final IUPforest-L was still trained by the whole training set. Han-ADS1 is listed in the Additional file 1 and is also available online at http://dmg.cs.rmit.edu.au/IUPforest/Han-ADS1.fasta.

The random forest model

A random forest is an ensemble of unpruned decision trees (shown in Figure 1), where each tree is grown using a (bootstrap) subset of the training dataset [43]. Bootstrapping is a resampling technique where a number of bootstrap training sets are drawn randomly from the original training set with replacement. Each tree induced from bootstrap samples grows to full length and the number of trees in the forest is adjustable. To classify an instance of unknown class label, each tree casts a unit classification vote. The forest selects the classification having the most votes over all the trees in the forest. Compared with the decision tree classifier [48], random forests have better classification accuracy, are more tolerant to noise and are less dependent on the training datasets.

Figure 1
figure 1

A sample random forest. In the decision tree on the left, the node at the root tests an attribute, such as the first order auto-correlation function of the normalized flexibility parameters (see below). If it is higher than a given threshold then the residue is in a disordered state (the right branch labelled D); otherwise another input attribute is tested and a set of other tests are further performed until a decision is made. A random forest can comprise hundreds of decision trees.

Features used in training and test

When a window of w aa slides along a sequence, six types of features were derived from residues within the window, as defined and explained below.

  1. 1)

    Auto-correlation function of amino acid indices (AAIs)

Each residue in the training set was replaced with a value of the normalized amino acid index (AAI), which is a set of 20 numerical values representing the physicochemical and biological property of 20 amino acids chosen from the AAI Database http://www.genome.ad.jp/dbget/aaindex.html[44]. As such, a sequence of N amino acids in the training set was firstly transformed into a numerical sequence [49, 50], and denoted as:

P1P2 ⋯ P i ⋯ Pi+w⋯ P N (1)

Then the sequences were smoothed with the Savitzky-Golay filter [51]. The Moreau-Broto auto-correlation function F d of an AAI was then calculated within a window, which is defined as:

F d = 1 w − d ∑ i = 1 w − d p i × p i + d , ( d = 1 , 2 , ... , w − 1 ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOray0aaSbaaSqaaiabdsgaKbqabaGccqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabdEha3jabgkHiTiabdsgaKbaakmaaqahabaGaemiCaa3aaSbaaSqaaiabdMgaPbqabaGccqGHxdaTcqWGWbaCdaWgaaWcbaGaemyAaKMaey4kaSIaemizaqgabeaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWG3bWDcqGHsislcqWGKbaza0GaeyyeIuoakiabcYcaSiabbccaGiabcIcaOiabdsgaKjabg2da9iabigdaXiabcYcaSiabikdaYiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabdEha3jabgkHiTiabigdaXiabcMcaPaaa@58BF@
(2)

where w is the window size, p i and pi+dare the AAI values at positions i and i+d respectively [49, 50]. For example, when d = 1, the numerical value for each residue (i) in the window multiplies by the value of the next nearby residue (i+1) and F1 is the average of these w-1 products. Similarly, F2 is the average of the w-2 products generated from every other residue. The value of d represented the order of the correlation and was tuned to optimize the prediction performance. The F d (d = 1, 2,..., 30) for the 40 sets of AAI listed in Table A1 in the Additional file 2 was calculated and evaluated in training IUPforest-L.

  1. 2)

    The mean hydrophobicity, defined as the average value of Kyte and Doolittle's hydrophobicity [52] in the window.

  2. 3)

    The modified hydrophobic cluster [31], calculated as the longest hydrophobic clusters in the window divided by the window size.

  3. 4)

    The mean net charge within the window and local mean net charge within a 13 aa fragment centered at the middle residue. Residues K and R were defined as +1; D and E were defined as -1; other residues were 0.

  4. 5)

    The mean contact number, defined as the mean expected number of contacts in the globular state of all residues within the window [53].

  5. 6)

    The composition of four reduced amino acid groups [48] and the Shannon's entropy (K2) of the amino acid composition within the window were calculated.

IUPforest-L

A flow chart of IUPforest-L is shown in Figure 2. At the training stage, features listed above were calculated when a window of w aa slides from the N-terminal end to the C-terminal end of a protein sequence. Each window was tagged with a label of disorder (Positive or P) or order (Negative or N) according to the label of the central residue, and IUPforest-L models were trained from the six types of features and the prediction result could be obtained by each of the trees in the forest. The final score was the combination of the outcomes from all trees by voting and smoothing [51]. A threshold that best classifies the ordered or disordered state of a residue could be defined based on the scores and the optimal evaluated values in the 10-fold cross validation tests.

Figure 2
figure 2

Flow chart of IUPforest-L. The sequence features were calculated when a window slides along a protein sequence. IUPforest-L models were trained from the six types of features and the prediction result could be obtained by each of the trees in the forest. The final score in the prediction was the combination of the outcomes from all trees by voting.

During the prediction stage, the features were firstly calculated when a window slides over an inquiry sequence and then a probability score of a residue being disordered was assigned by IUPforest-L. A region was annotated as disordered only when 30 or more consecutive amino acid residues were predicted to be disordered.

Evaluations

To estimate the generalization accuracy, 10-fold cross validation tests were conducted, where 90% of the sequences in the training set were randomly used in training and the other 10% were used in test. The process was repeated for the entire dataset and the final result was the average of the results from 10 processes. In addition, independent tests were performed on Hirose-ADS1 [38], Han-ADS1 and Peng-DB [35].

During the cross validation test, the confusion matrix, which comprises true positive (TP), false positive (FP), true negative (TN) and false negative (FN), was used to evaluate the prediction performance in terms of the following measures:

1) AUC, the area under the receiver operating characteristic (ROC) curve. Each point of a ROC curve was defined by a pair of values for the true positive rate ( T P T P + F N MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSqaaeaacqWGubavcqWGqbauaeaacqWGubavcqWGqbaucqGHRaWkcqWGgbGrcqWGobGtaaaaaa@3444@ , or sensitivit y) and the false positive rate ( F P T N + F P MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSqaaeaacqWGgbGrcqWGqbauaeaacqWGubavcqWGobGtcqGHRaWkcqWGgbGrcqWGqbauaaaaaa@3428@ , or 1-specificity).

2) Balanced overall accuracy

B a c c ≡ s e n s i t i v i t y + s p e c i f i c i t y 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOqaiKaemyyaeMaem4yamMaem4yamMaeyyyIOBcfa4aaSaaaeaacqWGZbWCcqWGLbqzcqWGUbGBcqWGZbWCcqWGPbqAcqWG0baDcqWGPbqAcqWG2bGDcqWGPbqAcqWG0baDcqWG5bqEcqGHRaWkcqWGZbWCcqWGWbaCcqWGLbqzcqWGJbWycqWGPbqAcqWGMbGzcqWGPbqAcqWGJbWycqWGPbqAcqWG0baDcqWG5bqEaeaacqaIYaGmaaaaaa@53E8@
(3)

3) Sproduct

Sproduct ≡ sensitivity × specificity (4)

4 Matthew's correlation functions (MCC)

M C C ≡ T P × T N − F P × F N ( T P + F P ) × ( T P + F N ) × ( T N + F P ) × ( T N + F N ) ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyta0Kaem4qamKaem4qamKaeyyyIOBcfa4aaSaaaeaacqWGubavcqWGqbaucqGHxdaTcqWGubavcqWGobGtcqGHsislcqWGgbGrcqWGqbaucqGHxdaTcqWGgbGrcqWGobGtaeaadaGcaaqaaiabcIcaOiabdsfaujabdcfaqjabgUcaRiabdAeagjabdcfaqjabcMcaPiabgEna0kabcIcaOiabdsfaujabdcfaqjabgUcaRiabdAeagjabd6eaojabcMcaPiabgEna0oaabmaabaGaemivaqLaemOta4Kaey4kaSIaemOrayKaemiuaaLaeiykaKIaey41aqRaeiikaGIaemivaqLaemOta4Kaey4kaSIaemOrayKaemOta4KaeiykaKcacaGLOaGaayzkaaaabeaaaaaaaa@63B4@
(5)

5) Sw

S w ≡ w d i s o r d e r × T P − w o r d e r × F P + w o r d e r × T N − w d i s o r d e r × F N w d i s o r d e r × ( T P + F N ) + w o r d e r × ( T N + F P ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uamLaem4DaCNaeyyyIOBcfa4aaSaaaeaacqWG3bWDdaWgaaqaaiabdsgaKjabdMgaPjabdohaZjabd+gaVjabdkhaYjabdsgaKjabdwgaLjabdkhaYbqabaGaey41aqRaemivaqLaemiuaaLaeyOeI0Iaem4DaC3aaSbaaeaacqWGVbWBcqWGYbGCcqWGKbazcqWGLbqzcqWGYbGCaeqaaiabgEna0kabdAeagjabdcfaqjabgUcaRiabdEha3naaBaaabaGaem4Ba8MaemOCaiNaemizaqMaemyzauMaemOCaihabeaacqGHxdaTcqWGubavcqWGobGtcqGHsislcqWG3bWDdaWgaaqaaiabdsgaKjabdMgaPjabdohaZjabd+gaVjabdkhaYjabdsgaKjabdwgaLjabdkhaYbqabaGaey41aqRaemOrayKaemOta4eabaGaem4DaC3aaSbaaeaacqWGKbazcqWGPbqAcqWGZbWCcqWGVbWBcqWGYbGCcqWGKbazcqWGLbqzcqWGYbGCaeqaaiabgEna0kabcIcaOiabdsfaujabdcfaqjabgUcaRiabdAeagjabd6eaojabcMcaPiabgUcaRiabdEha3naaBaaabaGaem4Ba8MaemOCaiNaemizaqMaemyzauMaemOCaihabeaacqGHxdaTcqGGOaakcqWGubavcqWGobGtcqGHRaWkcqWGgbGrcqWGqbaucqGGPaqkaaaaaa@980F@
(6)

where w disorder and w order are the weights for disorder and order, respectively, that are inversely proportional to the number of residues in the disordered and ordered state. Sw is also referred to as probability excess [34].

The Sproduct and Sw scores were used in assessing the prediction of disordered residues in the Critical Assessment of techniques for protein Structure Prediction (CASP6 and CASP7) [54].

Results

10-fold cross validation

The 10-fold cross validation test results using a window of 31 aa are shown in Figure 3. With the type 1 features (the auto-correlation function of AAIs), a forest of more trees has better predictive ability. For example, the AUC increased by 2% when the number of trees increased from 10 to 50. However, the prediction accuracy increased only modestly when the number of trees increased further from 50 to 100, while the training and prediction times increased significantly. Detailed test results on the time consumption with number of trees from 10 to 300 are shown in Additional file 3. The default setting of IUPforest-L is a forest of 50 trees for large-scale application.

Figure 3
figure 3

ROC curves of 10-fold cross validation tests. The ROC curves of IUPforest-L in 10-fold cross validation tests are shown. The IUPforest-L could reach a 76% true positive rate at a 10% false positive rate with MCC = 0.67, Sproduct = 0.64 and an area of 89.5% under the ROC curve on the training data set with a window of 31 aa.

With a forest of a fixed number of trees, the ROC curve trained with the auto-correlation function with d value between 1 and 15 almost overlaps with the ROC curve trained with d between 1 and 30. This result indicates that continuous correlations between nearby residues from 1 to 15 along the sequence could determine whether the fragment is involved in a long disordered region.

Figure 3 shows that training with either type 1 or the combination of type 2–6 features could reach the 70.5% or 70.0% true positive rates with a 10% false positive rate, while their combination of type 1–6 features could lead to a higher true positive rate of 76%, and an area of 89.5% under the ROC curve. This result indicates that type 1 and type 2–6 features have redundant, but complementary structural information. Type 2–6 features generated only nine parameters in total within a given window, while type 1 features could generate hundreds of parameters that take into account both order information and physicochemical properties. It has been shown that the random forest model has no risk of overfitting with an increasing number of trees when the input parameters increase [43]. As such, using type 1 features to train the random forest could extract more sequence-structure information [55] and it was thus conjectured that better prediction accuracy could be achieved with the auto-correlation functions generated from AAIs combined with other features of type 2–6.

The window size and step size for sliding the window are additional parameters for tuning the performance of IUPforest-L models. The window should be of a reasonable size so that the AAI-based correlation can be of significance within a reasonable training or test time. Training with small windows increases training time and can introduce noises, whereas training with large windows can lose local information. Our test results indicated that from window size of 19 aa to 47 aa, the random forest gave more stable result on blind test set Han-ADS1, but the accuracy on the 10-fold cross validation test on the training set will drop with larger window size (details listed in the Additional file 4). To batch predict long disordered regions, the window size of 31 aa was set in default to keep the balance between high efficiency and accuracy. The step size for sliding windows can also affect the accuracy and overall time efficiency at both the training and test stage. If the step size is too small, when a window slides along a sequence, it will introduce redundancy between windows and prolong the time for training models. Our experiments (details listed in the Additional file 4) show that with a sliding step of 20 aa (default setting) models achieve stable sensitivity without significantly prolonging the training process.

Blind tests

Figure 4 depicts the ROC curves for IUPforest-L and nine other publicly available predictors on the blind test dataset Hirose-ADS1, including the most recently developed POODLE-L [37] and the well-established predictor VSL2 [35]. It is obvious that IUPforest-L outperforms most of the other predictors in terms of the AUC in predicting long disordered regions. At low false positive rates (< 10%), IUPforest-L achieves the highest sensitivity among all the predictors. In terms of other performance measures listed in Table 1, IUPforest-L is also comparable to or better than other predictors. Figure 5 and Table 2 show the result of comparisons of IUPforest-L with POODLE-L and other predictors on the Han-ADS1. It can be seen that IUPforest-L always performs better than most of them. Figure 6 and Table 3 shows the result of comparisons of IUPforest-L with POODLE-L and other predictors on the Peng-DB. It can be seen again that IUPforest-L always performs better than most of them.

Figure 4
figure 4

ROC curves on test set Hirose-ADS1. The ROC curves for IUPforest-L and nine publicly available predictors on the blind test dataset Hirose-ADS1 are shown. IUPforest-L has the best performance in terms of the AUC.

Table 1 Comparison of IUPforest-L with other predictors on th test set Hirose-ADS1*
Figure 5
figure 5

ROC curves on test set Han-ADS1. The ROC curves for IUPforest-L and some publicly available predictors on the blind test dataset Han-ADS1 are shown. IUPforest-L performs better in terms of the AUC than most of the predictors.

Table 2 Comparison of IUPforest-L with other predictors on test set Han-ADS1*
Figure 6
figure 6

ROC curves on test set Peng-DB. The ROC curves for IUPforest-L and some publicly available predictors on the blind test dataset Peng-DB are shown. IUPforest-L performs better in terms of the AUC than most of the predictors.

Table 3 Comparison of IUPforest-L with other predictors on test set Peng-DB.*

Discussion

Protein structures are stabilized by numerous intramolecular interactions such as hydrophobic, electrostatic, van der Waals, and hydrogen bonds. The autocorrelation function tests whether the physicochemical property of one residue is independent of that of neighbouring residues. A group of residues involved in ordered structure close to other groups of residues in space will be dynamically constrained by the backbone or side chain interactions from these residues, and hence the residues in both groups will show higher density in the contact map or have higher pairwise correlation. On the other hand, a repetitive sequence of amino acids can also give significant positive correlation for all physicochemical properties. Therefore, residues within a fragment exhibiting a higher autocorrelation may either be structurally constrained, or have low sequence complexity. The random forest learning model employed by the IUPforest-L disorder predictor combines the complementary contributions from the autocorrelation function (type 1 feature) and other types of features, so that structural information is extracted with a high degree of prediction accuracy.

The random forest model is an ensemble learning model and is known to be more robust to noise than many non-ensemble learning models. However, as a classifier based on the random forest needs to load many decision trees into memory, it is relatively slow for a forest to predict a single instance at a time. As a result, the current web server of IUPforest-L is better suited to batch prediction of a large number of protein sequences, which provides an alternative useful tool in large-scale analysis of long disordered regions in proteomics. As an initial application, we have provided a server, IUPforest-L, for batch protein sequences analysis with the output of overall summary and details for each sequence. For convenience in proteomic comparisons, the prediction results for 62 eukaryotes linked to The European Bioinformatics Institute are also pre-calculated and can be downloaded from the server.

Conclusion

IUP studies are important because disordered regions are common and functionally important in proteins. The new features, the auto-correlation functions of AAIs within a protein fragment, reflect both residues' contact information and sequence complexity. The random forest model based on this new type of features and other physicochemical features could effectively detect long disordered regions in proteins. As a result, a new predictor, IUPforest-L, was developed to predict long disordered regions in proteins. Its high accuracy and high efficiency make it a useful tool in large-scale protein sequence analysis.

References

  1. Vucetic S, Brown CJ, Dunker AK, Obradovic Z: Flavors of protein disorder. Proteins. 2003, 52 (4): 573-584. 10.1002/prot.10437.

    Article  CAS  PubMed  Google Scholar 

  2. Dyson H, Wright PE: Intrinsically Unstructured Proteins and their Functions. Nat Rev Mol Cell Biol. 2005, 6: 197-208. 10.1038/nrm1589.

    Article  CAS  PubMed  Google Scholar 

  3. Tompa P, Szasz C, Buday L: Structural disorder throws new light on moonlighting. Trends Biochem Sci. 2005, 30 (9): 484-489. 10.1016/j.tibs.2005.07.008.

    Article  CAS  PubMed  Google Scholar 

  4. Tompa P: Intrinsically unstructured proteins. Trends Biochem Sci. 2002, 27 (10): 527-533. 10.1016/S0968-0004(02)02169-2.

    Article  CAS  PubMed  Google Scholar 

  5. Uversky VN, Oldfield CJ, Dunker AK: Showing your ID: intrinsic disorder as an ID for recognition, regulation and cell signaling. J Mol Recognit. 2005, 18 (5): 343-384. 10.1002/jmr.747.

    Article  CAS  PubMed  Google Scholar 

  6. Wright PE, Dyson HJ: Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol. 1999, 293 (2): 321-331. 10.1006/jmbi.1999.3110.

    Article  CAS  PubMed  Google Scholar 

  7. Dunker AK, Lawson JD, Brown CJ, Williams RM, Romero P, Oh JS, Oldfield CJ, Campen AM, Ratliff CM, Hipps KW: Intrinsically disordered protein. J Mol Graph Model. 2001, 19 (1): 26-59. 10.1016/S1093-3263(00)00138-8.

    Article  CAS  PubMed  Google Scholar 

  8. Russell RB, Gibson TJ: A careful disorderliness in the proteome: Sites for interaction and targets for future therapies. FEBS Lett. 2008, 582 (8): 1271-1275. 10.1016/j.febslet.2008.02.027.

    Article  CAS  PubMed  Google Scholar 

  9. Radivojac P, Iakoucheva LM, Oldfield CJ, Obradovic Z, Uversky VN, Dunker AK: Intrinsic disorder and functional proteomics. Biophys J. 2007, 92 (5): 1439-1456. 10.1529/biophysj.106.094045.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  10. Oldfield CJ, Cheng Y, Cortese MS, Romero P, Uversky VN, Dunker AK: Coupled folding and binding with alpha-helix-forming molecular recognition elements. Biochemistry. 2005, 44 (37): 12454-12470. 10.1021/bi050736e.

    Article  CAS  PubMed  Google Scholar 

  11. Tompa P: The interplay between structure and function in intrinsically unstructured proteins. FEBS Lett. 2005, 579 (15): 3346-3354. 10.1016/j.febslet.2005.03.072.

    Article  CAS  PubMed  Google Scholar 

  12. Gunasekaran K, Tsai CJ, Kumar S, Zanuy D, Nussinov R: Extended disordered proteins: targeting function with less scaffold. Trends Biochem Sci. 2003, 28 (2): 81-85. 10.1016/S0968-0004(03)00003-3.

    Article  CAS  PubMed  Google Scholar 

  13. Namba K: Roles of partly unfolded conformations in macromolecular self-assembly. Genes Cells. 2001, 6 (1): 1-12. 10.1046/j.1365-2443.2001.00384.x.

    Article  CAS  PubMed  Google Scholar 

  14. Dunker AK, Cortese MS, Romero P, Iakoucheva LM, Uversky VN: Flexible nets. The roles of intrinsic disorder in protein interaction networks. Febs J. 2005, 272 (20): 5129-5148. 10.1111/j.1742-4658.2005.04948.x.

    Article  CAS  PubMed  Google Scholar 

  15. Oldfield CJ, Ulrich EL, Cheng Y, Dunker AK, Markley JL: Addressing the intrinsic disorder bottleneck in structural proteomics. Proteins. 2005, 59 (3): 444-453. 10.1002/prot.20446.

    Article  CAS  PubMed  Google Scholar 

  16. Li X, Romero P, Rani M, Dunker AK, Obradovic Z: Predicting Protein Disorder for N-, C-, and Internal Regions. Genome Inform Ser Workshop Genome Inform. 1999, 10: 30-40.

    CAS  PubMed  Google Scholar 

  17. Yang ZR, Thomson R, McNeil P, Esnouf RM: RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics. 2005, 21 (16): 3369-3376. 10.1093/bioinformatics/bti534.

    Article  CAS  PubMed  Google Scholar 

  18. Thomson R, Esnouf R: Prediction of natively disordered regions in proteins using a bio-basis function neural network. Lecture Notes in Computer Science. 2004, 3177: 108-116.

    Article  Google Scholar 

  19. Smith DK, Radivojac P, Obradovic Z, Dunker AK, Zhu G: Improved amino acid flexibility parameters. Protein Sci. 2003, 12 (5): 1060-1072. 10.1110/ps.0236203.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  20. Vucetic S, Obradovic Z, Vacic V, Radivojac P, Peng K, Iakoucheva LM, Cortese MS, Lawson JD, Brown CJ, Sikes JG: DisProt: a database of protein disorder. Bioinformatics. 2005, 21 (1): 137-140. 10.1093/bioinformatics/bth476.

    Article  CAS  PubMed  Google Scholar 

  21. Liu J, Tan H, Rost B: Loopy proteins appear conserved in evolution. J Mol Biol. 2002, 322 (1): 53-64. 10.1016/S0022-2836(02)00736-2.

    Article  CAS  PubMed  Google Scholar 

  22. Liu J, Rost B: NORSp: Predictions of long regions without regular secondary structure. Nucleic Acids Res. 2003, 31 (13): 3833-3835. 10.1093/nar/gkg515.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  23. Cheng J, Sweredoski MJ, Baldi P: Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data. Data Mining and Knowledge Discovery. 2005, 11: 213-222. 10.1007/s10618-005-0001-y.

    Article  Google Scholar 

  24. Prilusky J, Felder CE, Zeev-Ben-Mordehai T, Rydberg E, Man O, Beckmann JS, Silman I, Sussman JL: FoldIndex(C): a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics. 2005, 21 (16): 3435-3438. 10.1093/bioinformatics/bti537.

    Article  CAS  PubMed  Google Scholar 

  25. Jones DT, Ward JJ: Prediction of disordered regions in proteins from position specific score matrices. Proteins. 2003, 53 (Suppl 6): 573-578. 10.1002/prot.10528.

    Article  CAS  PubMed  Google Scholar 

  26. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT: Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol. 2004, 337: 635-645. 10.1016/j.jmb.2004.02.002.

    Article  CAS  PubMed  Google Scholar 

  27. Ward JJ, McGuffin LJ, Bryson K, Buxton BF, Jones DT: The DISOPRED server for the prediction of protein disorder. Bioinformatics. 2004, 20 (13): 2138-2139. 10.1093/bioinformatics/bth195.

    Article  CAS  PubMed  Google Scholar 

  28. Linding R, Russell RB, Neduva V, Gibson TJ: GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Res. 2003, 31 (13): 3701-3708. 10.1093/nar/gkg519.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  29. Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB: Protein disorder prediction: implications for structural proteomics. Structure (Camb). 2003, 11 (11): 1453-1459. 10.1016/j.str.2003.10.002.

    Article  CAS  Google Scholar 

  30. Dosztanyi Z, Csizmok V, Tompa P, Simon I: The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J Mol Biol. 2005, 347 (4): 827-839. 10.1016/j.jmb.2005.01.071.

    Article  CAS  PubMed  Google Scholar 

  31. Coeytaux K, Poupon A: Prediction of unfolded segments in a protein sequence based on amino acid composition. Bioinformatics. 2005, 21 (9): 1891-1900. 10.1093/bioinformatics/bti266.

    Article  CAS  PubMed  Google Scholar 

  32. Galzitskaya OV, Garbuzynskiy SO, Lobanov MY: FoldUnfold: web server for the prediction of disordered regions in protein chain. Bioinformatics. 2006, 22 (23): 2948-2949. 10.1093/bioinformatics/btl504.

    Article  CAS  PubMed  Google Scholar 

  33. Vullo A, Bortolami O, Pollastri G, Tosatto SC: Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines. Nucleic Acids Res. 2006, W164-168. 10.1093/nar/gkl166. 34 Web Server

  34. Su CT, Chen CY, Ou YY: Protein disorder prediction by condensed PSSM considering propensity for order or disorder. BMC Bioinformatics. 2006, 7: 319-10.1186/1471-2105-7-319.

    Article  PubMed Central  PubMed  Google Scholar 

  35. Peng K, Radivojac P, Vucetic S, Dunker AK, Obradovic Z: Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics. 2006, 7: 208-10.1186/1471-2105-7-208.

    Article  PubMed Central  PubMed  Google Scholar 

  36. Obradovic Z, Peng K, Vucetic S, Radivojac P, Dunker AK: Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins. 2005, 61 (Suppl 7): 176-182. 10.1002/prot.20735.

    Article  CAS  PubMed  Google Scholar 

  37. Hirose S, Shimizu K, Kanai S, Kuroda Y, Noguchi T: POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions. Bioinformatics. 2007, 23 (16): 2046-2053. 10.1093/bioinformatics/btm302.

    Article  CAS  PubMed  Google Scholar 

  38. Shimizu K, Hirose S, Noguchi T: POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix. Bioinformatics. 2007, 23 (17): 2337-2338. 10.1093/bioinformatics/btm330.

    Article  CAS  PubMed  Google Scholar 

  39. Schlessinger A, Punta M, Rost B: Natively unstructured regions in proteins identified from contact predictions. Bioinformatics. 2007, 23 (18): 2376-2384. 10.1093/bioinformatics/btm349.

    Article  CAS  PubMed  Google Scholar 

  40. Ishida T, Kinoshita K: Prediction of disordered regions in proteins based on the meta approach. Bioinformatics. 2008, 24 (11): 1344-1348. 10.1093/bioinformatics/btn195.

    Article  CAS  PubMed  Google Scholar 

  41. Ishida T, Kinoshita K: PrDOS: prediction of disordered protein regions from amino acid sequence. Nucleic Acids Res. 2007, W460-464. 10.1093/nar/gkm363. 35 Web Server

  42. Peng K, Vucetic S, Radivojac P, Brown CJ, Dunker AK, Obradovic Z: Optimizing Intrinsic Disorder Predictors with Protein Evolutionary Information. J Bioinform Comput Biol. 2005, 3: 35-60. 10.1142/S0219720005000886.

    Article  CAS  PubMed  Google Scholar 

  43. Breiman L: Random Forest. Machine Learning. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.

    Article  Google Scholar 

  44. Kawashima S, Kanehisa M: AAindex: amino acid index database. Nucleic Acids Res. 2000, 28 (1): 374-10.1093/nar/28.1.374.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  45. Hobohm U, Sander C: Enlarged representative set of protein structures. Protein Sci. 1994, 3 (3): 522-524.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  46. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28 (1): 235-242. 10.1093/nar/28.1.235.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  47. Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006, 22 (13): 1658-1659. 10.1093/bioinformatics/btl158.

    Article  CAS  PubMed  Google Scholar 

  48. Dosztanyi Z, Chen J, Dunker AK, Simon I, Tompa P: Disorder and sequence repeats in hub proteins and their implications for network evolution. J Proteome Res. 2006, 5 (11): 2985-2995. 10.1021/pr060171o.

    Article  CAS  PubMed  Google Scholar 

  49. Feng ZP, Zhang CT: Prediction of membrane protein types based on the hydrophobic index of amino acids. J Protein Chem. 2000, 19 (4): 269-275. 10.1023/A:1007091128394.

    Article  CAS  PubMed  Google Scholar 

  50. Bu WS, Feng ZP, Zhang Z, Zhang CT: Prediction of protein (domain) structural classes based on amino-acid index. Eur J Biochem. 1999, 266 (3): 1043-1049. 10.1046/j.1432-1327.1999.00947.x.

    Article  CAS  PubMed  Google Scholar 

  51. Savitzky A, Golay MJE: Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Analytical Chemistry. 1964, 36: 1627-1639. 10.1021/ac60214a047.

    Article  CAS  Google Scholar 

  52. Kyte J, Doolittle RF: A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982, 157 (1): 105-132. 10.1016/0022-2836(82)90515-0.

    Article  CAS  PubMed  Google Scholar 

  53. Garbuzynskiy SO, Lobanov MY, Galzitskaya OV: To be folded or to be unfolded?. Protein Sci. 2004, 13 (11): 2871-2877. 10.1110/ps.04881304.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  54. Jin Y, Dunbrack RL: Assessment of disorder predictions in CASP6. Proteins. 2005, 61 (Suppl 7): 167-175. 10.1002/prot.20734.

    Article  CAS  PubMed  Google Scholar 

  55. Han P, Zhang X, Norton RS, Feng ZP: Predicting disordered regions in proteins using the profiles of amino acid Indices. Supplement Issue of BMC Bioinformatics for APBC. 2009,

    Google Scholar 

Download references

Acknowledgements

The authors thank Lefei Zhan for his help on conducting some experiments, Marc Cortese for his explanation of DisProt database. ZPF is supported by an APD award from the Australian Research Council. XZ is supported in part by an RMIT Emerging Researcher Grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhi-Ping Feng.

Additional information

Authors' contributions

PH wrote up the computer program, carried out calculations and developed the web interface; RSN participated in design of the project and drafting the manuscript; XZ and ZPF participated in design of the project, development of the algorithm, interpretation of the results and drafting the manuscript.

Electronic supplementary material

12859_2008_2738_MOESM1_ESM.txt

Additional file 1: Blind test set Han-ADS1. 53 ordered regions and 33 disordered regions longer than 30 aa used in blind test IUPforest-L. (TXT 19 KB)

12859_2008_2738_MOESM2_ESM.pdf

Additional file 2: Table A1: The amino acid indices (AAIs) used in the study. The names of 20 disorder-correlated indices and 20 ordered-correlated indices. (PDF 33 KB)

12859_2008_2738_MOESM3_ESM.pdf

Additional file 3: Influence of the number of trees and time efficiency. Result and discussion on the influence of the number of trees and time efficiency. (PDF 245 KB)

12859_2008_2738_MOESM4_ESM.pdf

Additional file 4: Influence of the windows and sliding step for training IUPforest-L. Results and discussions on the influence of the windows and sliding step for training IUPforest-L. (PDF 613 KB)

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Han, P., Zhang, X., Norton, R.S. et al. Large-scale prediction of long disordered regions in proteins using random forests. BMC Bioinformatics 10, 8 (2009). https://doi.org/10.1186/1471-2105-10-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-10-8

Keywords