Predicting peptides binding to MHC class II molecules using multi-objective evolutionary algorithms

Rajapakse, Menaka; Schmidt, Bertil; Feng, Lin; Brusic, Vladimir

doi:10.1186/1471-2105-8-459

Research article
Open access
Published: 22 November 2007

Predicting peptides binding to MHC class II molecules using multi-objective evolutionary algorithms

Menaka Rajapakse^1,3,
Bertil Schmidt²,
Lin Feng³ &
…
Vladimir Brusic⁴

BMC Bioinformatics volume 8, Article number: 459 (2007) Cite this article

9339 Accesses
25 Citations
Metrics details

Abstract

Background

Peptides binding to Major Histocompatibility Complex (MHC) class II molecules are crucial for initiation and regulation of immune responses. Predicting peptides that bind to a specific MHC molecule plays an important role in determining potential candidates for vaccines. The binding groove in class II MHC is open at both ends, allowing peptides longer than 9-mer to bind. Finding the consensus motif facilitating the binding of peptides to a MHC class II molecule is difficult because of different lengths of binding peptides and varying location of 9-mer binding core. The level of difficulty increases when the molecule is promiscuous and binds to a large number of low affinity peptides.

In this paper, we propose two approaches using multi-objective evolutionary algorithms (MOEA) for predicting peptides binding to MHC class II molecules. One uses the information from both binders and non-binders for self-discovery of motifs. The other, in addition, uses information from experimentally determined motifs for guided-discovery of motifs.

Results

The proposed methods are intended for finding peptides binding to MHC class II I-A^g7 molecule – a promiscuous binder to a large number of low affinity peptides. Cross-validation results across experiments on two motifs derived for I-A^g7 datasets demonstrate better generalization abilities and accuracies of the present method over earlier approaches. Further, the proposed method was validated and compared on two publicly available benchmark datasets: (1) an ensemble of qualitative HLA-DRB1*0401 peptide data obtained from five different sources, and (2) quantitative peptide data obtained for sixteen different alleles comprising of three mouse alleles and thirteen HLA alleles. The proposed method outperformed earlier methods on most datasets, indicating that it is well suited for finding peptides binding to MHC class II molecules.

Conclusion

We present two MOEA-based algorithms for finding motifs, one for self-discovery and the other for guided-discovery by experimentally determined motifs, and thereby predicting binding peptides to I-A^g7 molecule. Our experiments show that the proposed MOEA-based algorithms are better than earlier methods in predicting binding sites not only on I-A^g7 but also on most alleles of class II MHC benchmark datasets. This shows that our methods could be applicable to find binding motifs in a wide range of alleles.

Background

Major histocompatibility complex (MHC) molecules play a key role in initiating immune responses. They bind to and expose an antigen (or short peptides) to T cell receptors (TCR) triggering an immune response against the infected cell or foreign agent. MHC molecules make multiple contacts with the side-chains of binding peptides, which define the binding motif and determine the specificity of binding [1]. Prediction of peptides binding to a MHC class II molecule is difficult due to different types of side chains and because the length of the binding peptides is longer than 9aa (approximately 11 to 22aa) [1, 2]. It has been previously observed that a core of 9aa is sufficient for binding peptides to a MHC class II molecules [3], however, the exact location of the binding core (or motif) within the peptide is usually unknown and vary.

A binding motif is usually represented either by a consensus sequence or as a weight matrix [4]. The presence or composition of a motif can be experimentally determined from a large pool of putative binding peptides [3, 5]. However, such wet-lab experiments are costly, time consuming, and cumbersome. Amino acids at specific sites of a motif, contributing significantly to the binding are referred to as primary anchor residues and the corresponding sites as anchor positions. By using such position-specific information, earlier studies have found weight matrix models elaborating the nature and strength of binding motifs [6, 7]. These models offer binding strengths of every residue at specific sites in the form of a position specific scoring matrix (PSSM).[7]

In general, MHC class-II prediction methods are categorized into two main classes [8]: (1) quantitative prediction methods that predict inhibitory concentration (IC₅₀) values and (2) qualitative prediction methods that determine the binding status (binder or non-binder) based on the predictive score. Recent quantitative prediction approaches include SVRMHC [8], PLS-ISC [9], ARB [10], and SMM-align [11]. The ARB approach uses full length of the peptide whereas both SVRMHC and PLS-ISC approaches use a preprocessing step involving alignment of sequences, based on anchor position-specific residues. The underlying assumption of SMM-align is that amino acids occupying the 9-mer binding core motif are sufficient to determine the affinity of peptide-MHC binding. However, in some cases, the predictive performance could be improved by incorporating terminal residues known as peptide flanking residues (PFR) [11].

Qualitative prediction approaches use classifiers such as artificial neural networks [12–16], hidden Markov models [4, 17], support vector machines [18–21], and their hybrids [22], or profile analysis such as those using iterative learning [23–26], stochastic approaches (MEME) [27, 28], Gibbs motif sampler [29–32], profile motifs (RANKPEP) [33, 34], DNA microarrays and virtual matrices (TEPITOPE) [35], and evolutionary algorithms (EA) [36]. However, given a set of sequences of differing lengths with known binding affinities, the location of the binding core within each sequence must be first identified before classification of sequences. Classical multiple sequence alignment techniques often fail to detect binding cores in MHC class II binding peptides because of weak instances of binding motifs.

All methods predicting peptides binding to MHC molecules have their pros and cons; most show good performance only for datasets upon which they were developed. Therefore, there is a need for new algorithms that perform well on previously unseen data. We propose to use MOEA to align a set of experimentally determined binding peptides at their binding cores and subsequently derive the consensus motif. The methods are especially useful when molecules are promiscuous and bind to a large number of low affinity peptides. The preliminary results of our work have been presented in [37].

I-A^g7 is the MHC class II molecule of the NOD mouse, critical for the development of insulin-dependent diabetes mellitus (IDDM) and other autoimmune disorders [38–43]. Knowledge of peptides binding to I-A^g7 is important in understanding the molecular basis of development of IDDM in NOD mice. Experiments have demonstrated that I-A^g7 binding peptides are 9–30aa long [44]. Finding motifs in peptide binding to I-A^g7 is a non-trivial problem [45, 46]. Despite numerous attempts, no consensus has been reached on the rules of peptide binding to I-A^g7 molecule [38–48]. However, computational analyses on multiple datasets indicate that experimental motifs satisfy only a subset of rules describing the optimal motif.

To demonstrate the utility in predicting peptides binding to other MHC molecules, our method is tested on two benchmark datasets comprising of peptides of number of different HLA (human MHC) and mouse alleles. The first dataset, referred to as BM-Set1 here onwards, consists of different combinations of peptides of HLA-DRB1*0401 allele, and the second dataset, BM-Set2, consists of datasets from thirteen different HLA alleles and three mouse alleles.

Multi-Objective Evolutionary Algorithms (MOEA)

Evolutionary algorithms (EA) are based on the principles of biological evolution and have often been successful in solving complex search and optimization problems. Majority of bioinformatics applications of EA have been in the discovery of motifs such as transcription factor binding sites [49–53]. Yet, only a few researchers have used EA for the prediction of peptides binding to protein sequences [36].

An EA consists of (1) representing input variables as individuals or chromosomes (binary or real valued) in a population, (2) formulating the fitness (objective function) to evaluate individuals, (3) generating a new population by genetic operations (such as reproduction, crossover, and mutation) on the current population, and (4) determining if the population has reached the optimal fitness. The algorithm begins with an initial population and evolves over time. At a particular instance of evolution, every individual is evaluated by its fitness. New populations (offspring) are produced from highly fit individuals (parents) selected, which undergo genetic operations. Each offspring is paired and compared to its parents. Highly fit individuals are retained in the population while less fit individuals are discarded. Search mechanisms such as elitism, constraint-handling, and multi-objective optimization are available for finding a better spread of solutions, depending on the needs of the optimization problem [54–57].

Multi-objective evolutionary algorithms (MOEA) are used to solve problems which require simultaneous optimization of a number of competing objective functions [58–61]. MOEA maintains a set of solutions ranked by their dominance at a given instant of the evolution. A solution is said to dominate another if it is better or equal with respect to all objectives and strictly better in at least one objective [58]. Often, there are more than one non-dominated solutions, representing the best ones, collectively known as the Pareto front. MOEA algorithms result in a Pareto optimal set of solutions.

Non-dominated Sorting Genetic Algorithm II (NSGA-II) was recently introduced to incorporate several new genetic mechanisms for better convergence, such as non-dominated sorting, elitism, diversity preservation, and constraint handling [58]. In NSGA-II, a population is subjected to several rounds of non-dominated sorting. That is, all the non-dominated individuals are identified and assigned the same fitness value until a new set of non-dominated solutions is found. The solutions found in subsequent rounds are assigned fitness values lower than those in the previous rounds. This process continues until the whole population is partitioned into non-dominated fronts with diverse fitness values. The elitism prevents the loss of fit individuals encountered in earlier generations by allowing earlier solutions to survive in the subsequent generations. The diversity of Pareto-optimal solutions is maintained by imposing a measure referred to as crowding distance. A solution that satisfies the constraints defined by the objective functions is called a feasible solution.

Peptide Binding to MHC Class II I-A^g7

In this paper, we attempt to find an optimal motif describing peptide binding to MHC class II molecules, using experimentally determined binding data. There are several factors that impede the derivation of such a consensus motif. The first is the strong resemblance among the peptides isolated in a single experiment and the second is the diversity among different datasets. A motif derived from a dataset lacking diversity indicates a bias towards the dataset used in deriving the motif. Such motifs are difficult to generalize on other experimental or previously unseen datasets. The MOEA based motif detection algorithm is designed to find a consensus motif on I-A^g7 datasets, which alleviates the influences arising from biased datasets and thereby predicts binding peptides more accurately in new datasets.

Results

Predicting Peptides Binding to MHC Class II

We use our approach to find a consensus motif on seven experimental datasets of peptides binding to I-A^g7 molecules, obtained from literature [40–43, 62–64]. The motif is validated using an independent testing set generated from the Stratmann dataset [46]. The overall quality of prediction was measured using area under curve (AUC) of the receiver operating characteristics (ROC) curve [65–67]. AUC values of all feasible solutions in the final population of EA were evaluated and the solution with the highest AUC was chosen as the consensus motif (see Additional file 1).

Table 1 shows the information of the datasets extracted from literature, which were used in the training. A blank '-'indicates the unavailability of a particular information. As an example, the details of the experimental motif of Reizis et al are given in Table 2. Table 3 shows the performance when an experimental motif is used to predict peptide binders in other datasets. As seen, a motif of a particular experiment does not characterize peptide binding of I-A^g7 molecules in other datasets. Table 4 shows the cross-validation performance of two motifs (by self-discovery and guided-discovery) derived using MOEA; in a particular cross-validation run, one experimental dataset was excluded and the motif was derived using the information of the remaining datasets. The motif was tested for predicting binders and non-binders of the left-out dataset. The self-discovery approach uses only the binding information whereas the guided-discovery uses both binding information as well as information associated with experimental motifs. As seen in Table 4, by achieving AUC values greater than 0.7 for all cross-validation runs, MOEA derived motifs demonstrate better generalization capabilities compared to experimentally determined motifs. The binding motifs derived from self-discovery and guided-discovery are illustrated as sequence logo plots [68] in the Additional file 2.

Table 1 I-A^g7 datasets and experimental motifs

Full size table

Table 2 Representation of an experimentally derived I-A^g7 motif

Full size table

Table 3 Validation of I-A^g7 experimental motifs

Full size table

Table 4 Performance of I-A^g7 MOEA derived motifs

Full size table

To compare the performance of our method with earlier methods, a training dataset was created by combining all the experimental datasets given in Table 1. Motifs derived on the training dataset were tested on an independent test dataset – a balanced set generated from Stratmann dataset. The Stratmann dataset was balanced by adding randomly generated non-binders. Twenty five such balanced test datasets were assembled by generating random samples starting from different seeds and adding them to the Stratmann dataset. The results reported are based on the average AUC values over all balanced test sets. Figure 1 shows comparison of performances of motifs derived by MOEA and by earlier motif prediction approaches such as MEME and RANKPEP. An increase of 4–10% in predictive performance is observed with MOEA over the other approaches.

Comparison of performances of MOEA derived motifs for BM-Set1 (see Table 5) with enhanced Gibbs sampler [32], TEPITOPE [35], SVRMHC [8] and ARB [10], is given in Table 6. As seen, MOEA shows comparable or superior performance with Gibbs sampler on all datasets except for the Southwood dataset. Out of the ten non-redundant (NR) datasets, the MOEA outperformed Gibbs sampler, TEPITOPE, SVRMHC and ARB by seven, nine, eight and ten datasets, respectively.

The performance of MOEA on BM-Set2 (see Table 7) was compared with Gibbs sampler [32], TEPITOPE [35], SVRMHC [8], ARB [10] and NetMHCII [11]. Each allele dataset was subjected to five-fold cross-validation and the results are given in Table 8. The present method shows comparable or superior performance on majority of allele datasets compared to Gibbs sampler, SVRMHC, TEPITOPE, and NetMHCII. A fair comparison of ARB method cannot be drawn because the method has been trained on quantitative data obtained from IEDB [10].

Table 5 Description of peptides in BM-Set1

Full size table

Table 6 Comparison of performance on BM-Set1

Full size table

Table 7 Description of peptides in BM-Set2

Full size table

Table 8 Comparison of Performance on BM-Set2

Full size table

Discussion

We proposed two approaches using MOEA for deriving motifs (1) when the information of only the binders and non-binders are known (i.e., self-discovery) and (2) when, in addition, the information of experimentally (wet-lab) determined motifs are available (i.e., guided-discovery).

Since I-A^g7 molecule is known to bind to a large number of peptides of low affinity and appears to be a promiscuous binder, the prediction of peptides binding to I-A^g7 molecule has been nontrivial. This has lead to the definition of a number of suboptimal consensus motifs specific to the datasets. MOEA derived motifs had superior generalization capabilities to those derived with MEME and RANKPEP techniques as well as to the experimentally determined motifs on other datasets. The performances evaluated on two benchmark datasets indicate that the present MOEA based algorithm is applicable in deriving motifs on other class II MHC alleles as well.

The likelihood of finding an optimal motif by MOEA is higher than by a local or greedy search because of the stochastic nature of EA. The proposed approach learns from the characteristics of both binders and non-binders in the training set whereas other methods use information only from binders to determine motifs [27, 32]. Moreover, ranges of the parameters involved in MOEA are known, so the parameters of the fitness functions are quickly estimated in a few cross-validation runs. Furthermore, unlike the earlier methods, the present method does not rely on any prior information such as anchor positions to obtain an alignment, prior distributions, etc., [8, 9]. Given sufficient data samples representing both binders and non-binders, the method could be applicable to find motifs in other types of molecules. A future direction of this research would be to integrate additional information such as peptide length [69] and PFR [70] as such information has been shown to have the potential to enhance motif detection [11, 69]. This would lead to further improvement of the performance of the present algorithm.

Even though EAs are generally known to be computationally intensive, training for derivation of scoring matrices can be performed off-line and the prediction engines can be provided through web services. As seen in Tables 6 and 8, a single method does not always perform well on all types of allele datasets. Nevertheless, the present method showed higher accuracy in detecting motifs on majority of MHC alleles in the benchmark datasets. Therefore, we believe that MOEA-based methods could provide a general framework for efficiently determining motifs in a wide range of MHC molecules.

In immunology, accuracy and speed in predicting binding peptides is of paramount importance. Computationally predicted binders do subsequently need to be validated with wet-lab experiments. By using computational predictions as an initial step, high cost involved in initial screening and time-consuming clinical testing can be significantly reduced. Towards this end, the proposed MOEA methods present a promising way to predict peptides that bind to MHC class II alleles including promiscuous and low affinity peptide binders.

Conclusion

We present two MOEA-based algorithms for finding motifs, one for self-discovery and the other for guided-discovery by experimentally determined motifs, and thereby predicting binding peptides to I-A^g7 molecule. Our experiments show that the proposed MOEA-based algorithms are better than earlier methods in predicting binding sites not only on I-A^g7 but also on most alleles of class II MHC benchmark datasets. This demonstrates the applicability of our methods to find binding motifs in a wide range of MHC alleles.

Methods

Datasets

Several I-A^g7 datasets were extracted from literature [40–43, 62–64] and from Brusic, V.(unpublished data). The numbers of binders and non-binders in each dataset are given in Table 1. The datasets consist of short peptides ranging from 9–30aa in length. Their binding affinities had been experimentally determined by independent studies and classified as binders or non-binders based on IC₅₀ values according to the following scheme [41]: good binder (IC₅₀ = 100 nM); weak binder (IC₅₀ = 2000 nM); non-binder (IC₅₀ = 50000 nM). The datasets in [40–43, 62–64] were combined into a single training dataset and curated by removing duplicates and redundancy as follows: if a binder is a subsequence of another binder sequence, the longer binder sequence is discarded; if a non-binder is a subsequence of another non-binder, the shorter subsequence is discarded. Let the curated whole dataset be referred to as training dataset here onwards and it be denoted by D = {(x_i, v_i): i = 1, 2,.... N} where N is the number of total peptide sequences and x_iis the i-th peptide sequence with the label v_iε {b, nb} indicating whether the sequence x_iis a binder (b) or a non-binder (nb). The number of peptides in the training set N = 438 in which the number of binders N_b = 304 and the number of non-binders N_nb = 134.

The set of experimentally validated I-A^g7motifs [38–44] derived largely from uncorrelated datasets [40–43] was extracted and is illustrated in Table 1 with the distribution of binders and non-binders in each dataset. Table 2 illustrates an experimentally validated motif of I-A^g7 reported by Reizis et al [40]. Experimental motifs are described by the anchor positions and binding affinities of amino acids of the motif. The residues which contribute significantly to the peptide binding are called primary anchor residues and positions they reside are called anchor positions. An amino acid occupying a specific position within a motif is characterized as well tolerated, weakly tolerated, or non-tolerated based on its involvement in the binding process.

An independent dataset was generated from binders of Stratmann dataset [46], consisting of a diverse set of I-A^g7 binding peptides with their binding affinities, to find the test accuracies in predicting binders and non-binders. The Stratmann dataset was balanced with randomly generated 9-mer non-binders so that for testing dataset, N_b = N_nb = 112.

Binding Score Matrix

A k-mer motif of amino acids is characterized by a PSSM Q = {q_ia}_{k × 20}where q_iadenotes the binding strength of the site i when it is occupied by amino acid a. The binding score of a putative motif is computed by adding the binding scores assigned to each amino acid at the respective positions. The binding score indicates the likelihood of the motif binding to the molecule. The binding score s_iof sequence x_i= (x_i,1, x_i,2,...x_{i, n}) of length n is determined by the maximum value of binding scores computed for all k-mer subsequences in x_i:

s_{i} = \max_{j} {s_{i j} : j = 1, 2, \dots n - k + 1}

(1)

where s_ijdenotes the binding score of the subsequence beginning at location j of the sequence i, which is given by

s_{i j} = \sum_{l = 1, 2, \dots k} q_{(j + l)}, x_{i (j + l)}

(2)

and assuming that only one motif instance exists in every sequence, the location j* of the motif is given by

j^{*} = \arg \max_{j} {s_{i j} : j = 1, 2, \dots n - k + 1}

(3)

That is, the most likely motif instance of sequence x_i, say m_i, is given by the sequence m_i= (x_ij*·x_{ij* + 1},... x_{ij* + k-1}).

Self-discovery of Motif

We derive a consensus motif from the training dataset which consists of peptides from several experiments and of varying lengths. The positions of binding cores within the peptides are unknown. The elements of the PSSM are represented as 20k-tuples (q_ia, : i = 1,... k; a ε Ω) where Ω represents the amino acid alphabet. Each element in the k-tuple is converted to a real number representation using a binary word of size θ so that q_ia∈ [0, 2^θ-1]. The k-mer motif is therefore represented by an individual of 20kθ long string in the EA. Let the population at t-th iteration of the evolution is denoted by q(t) = {q₁(t), q₂(t),..... q_M(t)} where q_j(t) represents an individual in a population of size M.

The fitness function is designed to arrive at an optimal consensus of the motif, by using the training dataset. A solution is evaluated based on its ability to maximize the accuracies in identifying true binders (TP) and true non-binders (TN) as well as to widen the gap between the total score for binders and non-binders. This is achieved by two fitness functions: f₁ to minimize the sum of false positives (FP) and false negatives (FN), and f₂ to minimize the ratio between the average cumulative scores of non-binders and binders:

f₁ = FN + κ₁ FP (4)

f_{2} = \frac{N_{b}}{N_{nb}} \frac{\sum_{i = 1}^{N} s (m_{i}) δ (v_{i} = nb)}{\sum_{i = 1}^{N} s (m_{i}) δ (v_{i} = b)}

(5)

Eqs. (4) and (5) are minimized and subjected to following two constraints:

\frac{FP}{N_{nb}} \leq \frac{1}{α_{1}}

(6)

\frac{FN}{N_{b}} \leq \frac{1}{α_{2}}

(7)

where s(m_i) denotes the score computed for the most likely motif instance m_iof sequence x_iof the training dataset, and Kronecker δ is one when the argument is satisfied and otherwise is zero. N_b and N_nb are the total counts of binders and non-binders in the dataset. The constant κ₁ (>N_b/N_nb for N_b > N_nb, or vice versa) was empirically determined to minimize the number of false positives. The two parameters α₁ (<<N_nb) and α₂ (<<N_b) are set to minimize FP and FN rates, respectively. If none of the individuals satisfies the above constraints, MOEA reports no feasible solution. Given the training set, a few trial runs with different initializations are necessary to determine the best values of α₁ and α₂.

Scoring of Experimental Motifs

The description of an experimental k-mer motif conveys three kinds of information at each site: (1) the amino acid occupied, (2) the tolerance level of the amino acid, and (3) the strength of binding. Let us denote a k-mer motif validated in experiment "e" by m(e) and the tolerance level of the residue at site j by ρ_jwhere ρ_j∈ {well, weak, unknown, non – tolerated}. The binding strength of site j is expressed by σ_j∈ {primary – anchor, secondary – anchor, other}. Then, the binding score for a k-mer experimental motif is given by

s (m (e)) = \sum_{j = 1}^{k} ρ_{j} \cdot σ_{j}

(8)

Guided-discovery of Motif

In this algorithm, we assume that experimentally determined motifs are available along with the experimental datasets. An MOEA is proposed to determine a motif closer to experimental motifs. An objective function f₃ is proposed to best represent the characteristics of the motif that is close to the knowledge embedded in the experimental motifs:

f_{3} = \sum_{e} | \hat{Q} - Q (m (e)) |

(9)

where $\hat{Q}$ denotes the estimated PSSM of the motif. We use the same objective function in Eq. (4) to accurately predict binders of the training dataset. The MOEA minimizes the objective functions given in Eqs. (4) and (9), subjected to the two constraints given in Eqs. (6) and (7). The summation in Eq. (9) is taken over all the experimental motifs and | $\hat{Q}$ - Q(m(e))| is the sum of squares of differences between individual elements of weight matrices $\hat{Q}$ and Q(m(e)). The knowledge of the experimental motif is incorporated to the consensus motif adaptively with the distance function used in f₃. Further, the fitness f₁ optimizes the specificity and sensitivity of the prediction of binders.

The elements in the PSSM of experimental motifs are set to values within the same range [0, 2^θ-1] as before. The following procedure is adopted to determine the elements of Q(m(e)): a well tolerated amino acid at an anchor position of the motif receives the highest possible score of 2^θ-1; the lowest score of zero is assigned to a non-tolerated residue; weakly tolerated residues and residues at secondary anchor positions receive of (2^θ-1)/2; and all the other unknown positions receives a score of (2^θ-1)/3.

Performance Comparison

The binding scores of I-A^g7 experimental motifs were computed using Eq. (8) by assigning the following values for binding strengths: primary = 4, secondary = 2, and others = 1, and for anchor positions: well = 4, weak = 2, non-tolerated = -4, and unknown = 0. The experimentally determined motifs were used with peptide data in the guided-discovery of motifs.

We used AUC to compare performance of the proposed methods with earlier approaches [28, 34] and experimental motifs [38–44]. Whether a peptide is a binder or a non-binder is determined by a threshold of the binding score. By varying this threshold, the ROC curve was plotted, from which AUC value was obtained. A comparison of performances of the methods is given in Figure 1.

In order to compare to the MEME method, only binders in the I-A^g7 training set were submitted to MEME motif discovery tool at the prediction server [71]. The motif of 9-mer length was obtained with the following options: zero or one motif per sequence, minimum and maximum width = 9. The performance accuracy of RANKPEP approach on the testing dataset was carried out by uploading the dataset to the online prediction server at [72] with a 4% binding threshold [34].

Benchmark Datasets

The proposed self-discovery approach was tested on BM-Set1, i.e., HLA-DRB1*0401, which consists of one training set and 10 testing datasets and had been earlier used to benchmark a number of motif finding algorithms [25, 26, 32, 73]. The performance of MOEA was compared with earlier methods [8, 10, 32, 35].

The training set consisting of binders and non-binders was assembled as follows: an ensemble of 532 unique binding peptides were extracted from SYFPEITHI [44] and MHCPEP [63] databases and a set of 177 unique non-binders were extracted from the MHCBN database [20]. The datasets were pre-processed by removing peptides that did not allow a hydrophobic residue at P1 position of all putative 9-mer binding cores and unnatural peptides containing more than 75% alanine [32]. The preprocessed binder set has 456 unique peptides with a length distribution ranging from 9 to 30 amino acid residues.

Of the 10 testing datasets, 8 datasets were taken from the MHC-bench as described in [74]. The other 2 datasets were extracted from experiments described by Southwood [75] and Geluk [76]. An affinity of (IC₅₀ = 1000 nM) was taken as the threshold for peptide binding as described in [75]. Homology reduction had been carried out on all datasets in order to reduce the chances of over-fitting due to the redundancy of datasets. The peptides in the non-redundant (NR) datasets had sequence similarities less than 90%. The number of binders and non-binders in the original and NR datasets are given in Table 5.

We tested our method on BM-Set2 comprising of 3 mouse alleles and 13 HLA alleles made available at [77]. These quantitative peptide datasets had been extracted from the IEDB at [78]. The number of binders and non-binders in each dataset is given in Table 7. The DRB3-0101 allele dataset was excluded from the benchmark dataset because of the significant imbalance between binders and non-binders (3 binders and 99 non-binders). With this dataset, we compared our method with [8, 10, 11, 32, 35].

Parameters of MOEA

The range of positional scores was set with θ = 7. For each run of MOEA, the population size M = 500, crossover probability p_c= 0.9, and mutation probability p_m= 0.005 were used. The process was terminated after 300 generations as no significant improvement in the convergence was observed during the experimental trial sessions. The parameters of the fitness functions were empirically determined for optimum performance within the following ranges: κ₁ = 1~2.5, α₁ = 5.0–6.0, and α₂ = 1.0–2.0. The parameters κ₁ = 2.5, α₁ = 6.0, and α₂ = 2.0 were found to work well empirically for both datasets.

References

Stern LJ, Wiley DC: Antigenic peptide binding by class I and class II histocompatibility proteins. Behring Inst Mitt. 1994, 1-10. 94
Hammer J, Bono E, Gallazzi F, Belunis C, Nagy Z, Sinigaglia F: Precise prediction of major histocompatibility complex class II-peptide interaction based on peptide side chain scanning. J Exp Med. 1994, 180 (6): 2353-2358.
Article CAS PubMed Google Scholar
Rammensee HG, Friede T, Stevanoviic S: MHC ligands and peptide motifs: first listing. Immunogenetics. 1995, 41 (4): 178-228.
Article CAS PubMed Google Scholar
Mamitsuka H: Predicting peptides that bind to MHC molecules using supervised learning of hidden Markov models. Proteins. 1998, 33 (4): 460-474.
Article CAS PubMed Google Scholar
Falk K, Rotzschke O, Stevanovic S, Jung G, Rammensee HG: Allele-specific motifs revealed by sequencing of self-peptides eluted from MHC molecules. Nature. 1991, 351 (6324): 290-296.
Article CAS PubMed Google Scholar
Ruppert J, Sidney J, Celis E, Kubo RT, Grey HM, Sette A: Prominent role of secondary anchor residues in peptide binding to HLA-A2.1 molecules. Cell. 1993, 74 (5): 929-937.
Article CAS PubMed Google Scholar
Bouvier M, Wiley DC: Importance of peptide amino and carboxyl termini to the stability of MHC class I molecules. Science. 1994, 265 (5170): 398-402.
Article CAS PubMed Google Scholar
Wan J, Liu W, Xu Q, Y R, Flower DR, Li T: SVRMHC prediction server for MHC-binding peptides. BMC Bioinformatics. 2006, 7: 463-
Article PubMed Central PubMed Google Scholar
Doytchinova IA, Flower DR: Towards the insilico identification of class II restricted T-cell epitopes: a partial least squares iterative self-consistent algorithm for affinity prediction. Bioinformatics. 2003, 19 (17): 2263-2270.
Article CAS PubMed Google Scholar
Bui H, Sidney J, Peters B, Sathiamurthy M, Sinichi A, Purton K, Mothé BR, Chisari FV, Watkins DI, Sette A: Automated generation and evaluation of specific MHC binding predictive tools: ARB matrix applications. Immunogenetics. 2005, 57 (5): 304-314.
Article CAS PubMed Google Scholar
Nielsen M, Lundegaard C, Lund O: Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method. BMC Bioinformatics. 2007, 8 (238):
Bisset L, Fierz W: Using a neural network to identify potential HLA-DR1 binding sites within proteins. J Mol Recognition. 1994, 6: 41-48.
Article Google Scholar
Brusic V, Rudy G, Harrison LC: Prediction of MHC binding peptides using artificial neural networks. Complex Systems: Mechanism of Adaptation. Edited by: Stonier R, Yu XS. 1994, Amsterdam: IOS Press, 253-260.
Google Scholar
Adams HP, Koziol JA: Prediction of binding to MHC class I molecules. J Immunol Methods. 1995, 185 (2): 181-190.
Article CAS PubMed Google Scholar
Gulukota K, Sidney J, Sette A, DeLisi C: Two complementary methods for predicting peptide binding major histocompatibility complex molecules. J Mol Biol. 1997, 267: 1258-1267.
Article CAS PubMed Google Scholar
Burden FR, Winkler DA: Predictive Bayesian neural network models of MHC class II peptide binding. J Mol Graph Model. 2005, 23 (6): 481-489.
Article CAS PubMed Google Scholar
Noguchi H, Kato R, Hanai T, Matsubara Y, Honda H, Brusic V, Kobayashi T: Hidden Markov model-based prediction of antigenic peptides that interact with MHC class II molecules. J Biosci Bioeng. 2002, 94 (3): 264-270.
Article CAS PubMed Google Scholar
Donnes P, Elofsson A: Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinformatics. 2002, 3: 25-
Article PubMed Central PubMed Google Scholar
Zhao Y, Pinilla C, Valmori D, Martin R, Simon R: Application of support vector machines for T-cell epitopes prediction. Bioinformatics. 2003, 19 (15): 1978-1984.
Article CAS PubMed Google Scholar
Bhasin M, Singh H, Raghava GPS: MHCBN: A comprehensive database of MHC binding and non-binding peptides. Bioinformatics. 2003, 19: 665-666.
Article CAS PubMed Google Scholar
Salomon J, Flower DR: Predicting Class II MHC-Peptide binding: a kernel based approach using similarity scores. BMC Bioinformatics. 2006, 7: 551-
Article Google Scholar
Takahashi H, Honda H: Prediction of peptide binding to major histocompatibility complex class II molecules through use of bossted fuzzy classifier with SWEEP operator method. Bioscience and Bioengineering. 2006, 101 (2): 137-141.
Article CAS Google Scholar
Mallios RR: Class II MHC quantitative binding motifs derived from a large molecular database with a versatile iterative stepwise discriminant analysis meta-algorithm. Bioinformatics. 1999, 15 (6): 432-439.
Article CAS PubMed Google Scholar
Mallios RR: Predicting class II MHC/peptide multi-level binding with an iterative stepwise discriminant analysis meta-algorithm. Bioinformatics. 2001, 17 (10): 942-948.
Article CAS PubMed Google Scholar
Murugan N, Dai Y: Prediction of MHC class II binding peptides based on an iterative learning model. Immunome Res. 2005, 1 (6): 10-
Google Scholar
Karpenko O, Shi J, Dai Y: Prediction of MHC class II binders using the ant colony search strategy. Artif Intell Medicine. 2005, 35 (1–2): 47-56.
Google Scholar
Bailey TL, Elkan C: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning. 1995, 21: 51-80.
Google Scholar
Bailey TL, Charles E: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Second International Conference on Intelligent Systems for Molecular Biology. 1994, AAAI Press, Menlo Park, California, 28-36.
Google Scholar
Neuwald AF, Liu JS, Lawrence CE: Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Sci. 1995, 4 (8): 1618-1632.
Article PubMed Central CAS PubMed Google Scholar
Thijs G, Marchal K, Lescot M, Rombauts S, De Moor B, Rouze P, Moreau Y: A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. J Comput Biol. 2002, 9 (2): 447-464.
Article CAS PubMed Google Scholar
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993, 262 (5131): 208-214.
Article CAS PubMed Google Scholar
Nielsen M, Lundegaard C, Worning P, Hvid CS, Lamberth K, Buus S, Brunak S, Lund O: Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach. Bioinformatics. 2004, 20 (9): 1388-1397.
Article CAS PubMed Google Scholar
Reche PA, Glutting JP, Reinherz EL: Prediction of MHC class I binding peptides using profile motifs. Hum Immunol. 2002, 63 (9): 701-709.
Article CAS PubMed Google Scholar
Reche PA, Glutting JP, Zhang H, Reinherz EL: Enhancement to the RANKPEP resource for the prediction of peptide binding to MHC molecules using profiles. Immunogenetics. 2004, 56 (6): 405-419.
Article CAS PubMed Google Scholar
Sturniolo T, Bono E, Jiayi D, Raddrizzani L, Tuereci O, Sahin U, Braxenthaler M, Gallazzi F, Protti M, Sinigaglia F, Hammer J: Generation of tissue-specific and promiscuous HLA ligand databases using DNA microarrays and virtual HLA class II matrices. Nature Biotech. 1999, 17 (6): 555-561.
Article CAS Google Scholar
Brusic V, Schonbach C, Takiguchi M, Ciesielski V, Harrison LC: Application of genetic search in derivation of matrix models of peptide binding to MHC molecules. Proc Int Conf Intell Syst Mol Biol. 1997, 5: 75-83.
CAS PubMed Google Scholar
Rajapakse M, Schmidt B, Brusic V: Multi-Objecitve Evolutionary Algorithm for Discovering Peptide Binding Motifs. Applications of Evolutionary Computing. 2006, Lecture Notes in Computer Science, Springer, 3907: 149-158.
Chapter Google Scholar
Reich EP, von Grafenstein H, Barlow A, Swenson KE, Williams K, Janeway CA: Self peptides isolated from MHC glycoproteins of non-obese diabetic mice. J Immunol. 1994, 152 (5): 2279-2288.
CAS PubMed Google Scholar
Amor S, O'Neill JK, Morris MM, Smith RM, Wraith DC, Groome N, Travers PJ, Baker D: Encephalitogenic epitopes of myelin basic protein, proteolipid protein, myelin oligodendrocyte glycoprotein for experimental allergic encephalomyelitis induction in Biozzi ABH (H-2Ag7) mice share an amino acid motif. J Immunol. 1996, 156 (8): 3000-3008.
CAS PubMed Google Scholar
Reizis B, Eisenstein M, Bockova J, Konen-Waisman S, Mor F, Elias D, Cohen IR: Molecular characterization of the diabetes-associated mouse MHC class II protein, I-Ag7. Int Immunol. 1997, 9 (1): 43-51.
Article CAS PubMed Google Scholar
Harrison LC, Honeyman MC, Trembleau S, Gregori S, Gallazzi F, Augstein P, Brusic V, Hammer J, Adorini L: A peptide-binding motif for I-A(g7), the class II major histocompatibility complex (MHC) molecule of NOD and Biozzi AB/H mice. J Exp Med. 1997, 185 (6): 1013-1021.
Article PubMed Central CAS PubMed Google Scholar
Latek RR, Suri A, Petzold SJ, Nelson CA, Kanagawa O, Unanue ER, Fremont DH: Structural basis of peptide binding and presentation by the type I diabetes-associated MHC class II molecule of NOD mice. Immunity. 2000, 12 (6): 699-710.
Article CAS PubMed Google Scholar
Gregori S, Bono E, Gallazzi F, Hammer J, Harrison LC, Adorini L: The motif for peptide binding to the insulin-dependent diabetes mellitus-associated class II MHC molecule I-Ag7 validated by phage display library. Int Immunol. 2000, 12 (4): 493-503.
Article CAS PubMed Google Scholar
Rammensee H, Bachmann J, Emmerich NP, Bachor OA, Stevanovic S: SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics. 1999, 50 (3–4): 213-219.
Article CAS PubMed Google Scholar
Carrasco-Marin E, Kanagawa O, Unanue ER: The lack of consensus for I-A(g7)-peptide binding motifs: is there a requirement for anchor amino acid side chains?. Proc Natl Acad Sci USA. 1999, 96 (15): 8621-8626.
Article PubMed Central CAS PubMed Google Scholar
Stratmann T, Apostolopoulos V, Mallet-Designe V, Corper AL, Scott CA, Wilson IA, Kang AS, Teyton L: The I-Ag7 MHC class II molecule linked to murine diabetes is a promiscuous peptide binder. J Immunology. 2000, 165 (6): 3214-3225.
Article CAS Google Scholar
Carrasco-Marin E, Shimizu J, Kanagawa O, Unanue ER: The class II MHC I-Ag7 molecules from non-obese diabetic mice are poor peptide binders. J Immunol. 1996, 156 (2): 450-458.
CAS PubMed Google Scholar
Suri A, Vidavsky I, van der Drift K, Kanagawa O, Gross ML, Unanue ER: In APCs, the autologous peptides selected by the diabetogenic I-Ag7 molecule are unique and determined by the amino acid changes in the P9 pocket. J Immunol. 2002, 168 (3): 1235-1243.
Article CAS PubMed Google Scholar
Beiko RG, Charlebois RL: GANN: genetic algorithm neural networks for the detection of conserved combinations of features in DNA. BMC Bioinformatics. 2005, 6 (1): 36-
Article PubMed Central PubMed Google Scholar
Fogel GB, Weekes DG, Varga G, Dow ER, Harlow HB, Onyia JE, Su C: Discovery of sequence motifs related to coexpression of genes using evolutionary computation. Nucleic Acids Res. 2004, 32 (13): 3826-3835.
Article PubMed Central CAS PubMed Google Scholar
Liu F, Tsai J, Chen R, Chen S, Shih S: FMGA: Finding Motifs by Genetic Algorithm. IEEE BIBE. 2004
Google Scholar
Lo N, Changchien S, Chang Y, Lu T: Human promoter prediction based on sorted consensus sequence patterns by genetic algorithms. Intl congr on Biological and Medical Engineering. 2002, 111-112.
Google Scholar
Corne D, Meade A, Sibly R: Evolving Core Promoter Signal Motifs. IEEE Congress on Evolutionary Computation. 2001, 1162-1169.
Google Scholar
Fogel G, Corne D: Evolutionary Computation in Bioinformatics. 2003, Morgan Kaufman publishers
Google Scholar
Mitchell M: An Introduction to Genetic Algorithms. 1999, MIT press
Google Scholar
Deb K: Multi-Objective Optimization Using Evolutionary Algorithms. 2001, Wiley publishers
Google Scholar
Holland J: Adaptation in Natural and Artificial Systems. 1975, Ann Arbor, MI: University of Michigan Press
Google Scholar
Deb K, Pratap A, Agrawal S, Meyarivan T: A Fast and Elitist Multiobjective Genetic Algorithm:NSGA-II. IEEE Trans on Evolutionary Computation. 2002, 6 (2): 182-197.
Article Google Scholar
Zitzler E, Thiele L: Multiobjective Evolutionary Algorithms: A Comparative Case Study and the Strength of Pareto Approach. IEEE Trans on Evolutionary Computation. 1999, 3: 257-271.
Article Google Scholar
Knowles JD: Approximating the Nondominant front using the Pareto Archived evolution strategy. Evolutionary Computation. 2000, MIT Press, 8 (Summer): 49-172.
Fonseca C, Fleming PJ: Genetic Algorithms for Multiobjective Optimization: Formulation, discussion and generalization. the fifth Intl conference on Genetic Algorithms. 1993, San Mateo, CA: Morgan Kauffman, 416-423.
Google Scholar
Corper AL, Stratmann T, Apostolopoulos V, Scott CA, Garcia KC, Kang AS, Wilson IA, Teyton L: A Structural Framework for Deciphering the Link Between I-Ag7 and Autoimmune Diabetes. Science. 288 (5465): 505-511. 21 April 2000
Brusic V, Rudy G, Harrison LC: MHCPEP, a database of MHC-binding peptides: update 1997. Nucleic Acids Res. 1998, 26 (1): 368-371.
Article PubMed Central CAS PubMed Google Scholar
Yu B, Gauthier L, Hausmann DH, Wucherpfennig KW: Binding of conserved islet peptides by human and murine MHC class II molecules associated with susceptibility to type I diabetes. Eur J Immunol. 2000, 30 (9): 2497-2506.
Article CAS PubMed Google Scholar
Webb A: Statistical Pattern Recognition. 2002, John Wiley & Sons, 2
Book Google Scholar
Swets JA: Measuring the accuracy of diagnostic systems. Science. 1988, 240 (4857): 1285-1293.
Article CAS PubMed Google Scholar
Schueler-Furman O, Altuvia Y, Sette A, Margalit H: Structure-based prediction of binding peptides to MHC class I molecules: Application to a broad range of MHC alleles. Protein Sci. 2000, 9: 1838-1846.
Article PubMed Central CAS PubMed Google Scholar
Schneider D, Stephens RM: Sequence logos: a new way to display consensus sequences. Nucleic Acids Research. 1990, 18 (20): 6097-6100.
Article PubMed Central CAS PubMed Google Scholar
Chang ST, Ghosh D, Kirschner DE, Linderman JJ: Peptide length-based prediction of peptide-MHC class II binding. Bioinformatics. 2006, 22 (22): 2761-2767.
Article CAS PubMed Google Scholar
Godkin AJ, Smith KJ, Willis A, Tejada-Simon MV, Zhang J, Elliott T, Hill AVS: Naturally Processed HLA Class II Peptides Reveal Highly Conserved Immunogenic Flanking Region Sequence Preferences That Reflect Antigen Processing Rather Than Peptide-MHC Interactions. Immunology. 2001, 166 (11): 6720-6727.
Article CAS Google Scholar
MEME. [http://meme.sdsc.edu/meme/]
RANKPEP. [http://bio.dfci.harvard.edu/Tools/rankpep.html]
Bhasin M, Raghava GP: SVM based method for predicting HLA-DRB1*0401 binding peptides in an antigen sequence. Bioinformatics. 2004, 20 (3): 421-423.
Article CAS PubMed Google Scholar
MHCBench. [http://www.imtech.res.in/raghava/mhcbench]
Southwood S, Sidney J, Kondo A, del Guercio M, Appella E, Hoffman S, Kubo RT, Chestnut R, Grey HM, Sette A: Several common HLA-DR types share largely overlapping peptide binding repertoires. Immunology. 1998, 160: 3363-3373.
CAS Google Scholar
Geluk A, van Meijgaarden K, Schloot N, Drijfhout J, Ottenhoff T, Roep B: HLA-DR binding analysis of peptides from islet antigens in IDDM. Diabetes. 1998, 47 (1584–1600):
NetMHCII. [http://www.cbs.dtu.dk/services/NetMHCII]
IEDB. [http://www.immuneepitope.org]
Weblogo. [http://weblogo.berkeley.edu/]

Download references

Acknowledgements

The authors would like to thank Dr. Tim Oliver for proof reading the manuscript. We are also grateful to the anonymous reviewers whose comments significantly improved the paper.

Author information

Authors and Affiliations

Institute for Infocomm Research, 21 Heng Mui Keng Terrace, 119613, Singapore
Menaka Rajapakse
NICTA VRL, University of Melbourne, Parkville, 3010, Australia
Bertil Schmidt
School of Computer Engineering, Nanyang Technological University, Block N4, Nanyang Avenue, 639798, Singapore
Menaka Rajapakse & Lin Feng
Cancer Vaccine Center, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, 02115, USA
Vladimir Brusic

Authors

Menaka Rajapakse
View author publications
You can also search for this author in PubMed Google Scholar
Bertil Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Lin Feng
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Brusic
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Menaka Rajapakse.

Additional information

Authors' contributions

MR and VB conceived the study; MR designed experiments and performed computational analysis; MR, BS, VB and LF wrote the manuscript. All authors read and corrected the manuscript.

Electronic supplementary material

12859_2007_1831_MOESM1_ESM.pdf

Additional file 1: MOEA derived matrices on I-A^g7 dataset. The two PSSM derived by using MOEA self-discovery and guided-discovery approaches are given in the Additional file 1. (PDF 233 KB)

12859_2007_1831_MOESM2_ESM.pdf

Additional file 2: Motif logos obtained for I-A^g7 from MOEA derived matrices. Figure 1 and Figure 2 illustrate motif logos derived from the alignments obtained from the MOEA guided-discovery and self-discovery approaches. The web server [79] was used to generate the motif logos as described in [68]. (PDF 61 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Rajapakse, M., Schmidt, B., Feng, L. et al. Predicting peptides binding to MHC class II molecules using multi-objective evolutionary algorithms. BMC Bioinformatics 8, 459 (2007). https://doi.org/10.1186/1471-2105-8-459

Download citation

Received: 07 May 2007
Accepted: 22 November 2007
Published: 22 November 2007
DOI: https://doi.org/10.1186/1471-2105-8-459

Predicting peptides binding to MHC class II molecules using multi-objective evolutionary algorithms

Abstract

Background

Results

Conclusion

Background

Multi-Objective Evolutionary Algorithms (MOEA)

Peptide Binding to MHC Class II I-Ag7

Results

Predicting Peptides Binding to MHC Class II

Discussion

Conclusion

Methods

Datasets

Binding Score Matrix

Self-discovery of Motif

Scoring of Experimental Motifs

Guided-discovery of Motif

Performance Comparison

Benchmark Datasets

Parameters of MOEA

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

12859_2007_1831_MOESM1_ESM.pdf

12859_2007_1831_MOESM2_ESM.pdf

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us

Peptide Binding to MHC Class II I-A^g7