Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics

Open Access Proceedings

B-cell epitope prediction through a graph model

Liang Zhao1, Limsoon Wong2, Lanyuan Lu3, Steven CH Hoi1* and Jinyan Li4*

Author Affiliations

1 Bioinformatics Research Center, School of Computer Engineering, Nanyang Technological University, Singapore

2 School of Computing, National University of Singapore, Singapore

3 School of Biological Science, Nanyang Technological University, Singapore

4 Advanced Analytics Institute, School of Software, Faculty of Engineering and IT, University of Technology Sydney, PO Box 123, NSW 2007, Australia

For all author emails, please log on.

BMC Bioinformatics 2012, 13(Suppl 17):S20  doi:10.1186/1471-2105-13-S17-S20


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/13/S17/S20


Published:13 December 2012

© 2012 Zhao et al.; licensee BioMed Central Ltd.

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Prediction of B-cell epitopes from antigens is useful to understand the immune basis of antibody-antigen recognition, and is helpful in vaccine design and drug development. Tremendous efforts have been devoted to this long-studied problem, however, existing methods have at least two common limitations. One is that they only favor prediction of those epitopes with protrusive conformations, but show poor performance in dealing with planar epitopes. The other limit is that they predict all of the antigenic residues of an antigen as belonging to one single epitope even when multiple non-overlapping epitopes of an antigen exist.

Results

In this paper, we propose to divide an antigen surface graph into subgraphs by using a Markov Clustering algorithm, and then we construct a classifier to distinguish these subgraphs as epitope or non-epitope subgraphs. This classifier is then taken to predict epitopes for a test antigen. On a big data set comprising 92 antigen-antibody PDB complexes, our method significantly outperforms the state-of-the-art epitope prediction methods, achieving 24.7% higher averaged f-score than the best existing models. In particular, our method can successfully identify those epitopes with a non-planarity which is too small to be addressed by the other models. Our method can also detect multiple epitopes whenever they exist.

Conclusions

Various protrusive and planar patches at the surface of antigens can be distinguishable by using graphical models combined with unsupervised clustering and supervised learning ideas. The difficult problem of identifying multiple epitopes from an antigen can be made easied by using our subgraph approach. The outstanding residue combinations found in the supervised learning will be useful for us to form new hypothesis in future studies.

Background

A B-cell epitope is a set of spatially proximate residues in an antigen that can be recognized by antibodies to activate immune response [1]. B-cell epitopes are of two types: about 10% of them are linear B-cell epitopes and about 90% are conformational B-cell epitopes [2-4]. Linear epitopes differ from conformational epitopes in the continuity of their residues in primary sequence--residues of a linear-epitope are contiguous in primary sequence while the residues in a conformational-epitope are not. B-cell epitope prediction is a long-studied problem of high complexity which aims to identify those residues in an antigen forming one or multiple epitopes.

This problem has attracted tremendous efforts over the last two decades because of its significance in prophylactic and therapeutic biomedical applications [5]. Various approaches have been proposed to identify conformational epitopes, for example, by clustering accessible surface area (ASA) [6], by combining residues' ASA and their spatial contact [7], by grouping surface residues under their protrusion index [8], by aggregating epitope-favorable triangular patches [9], or by using naïve Bayesian classifier on residues' physicochemical and geometrical properties [10]. Far more approaches have been developed for predicting linear epitopes. Some of these methods use just a single feature of residues--such as hydrophobicity, polarity, or flexibility only--to detect the crests or troughs of propensity values as epitopes [11,12]. The other methods take complicated machine learning approaches, including artificial neural network, Bayesian network, and kernel methods, to tackle this problem [13-19]. With these tremendous efforts, this field of research has been advanced significantly and the best AUC performance has reached to 0.644 [9]. However, there are still many limitations in existing methods, and huge room for performance improvement exists.

A limitation of those methods using geometrical properties [7,8,10] is that they only favor epitopes with protrusive shapes, not identifying epitopes in other formations such as planar shapes. In fact, many epitopes are shaped at plain areas of antigens. For example, the surface atoms of the epitope of paracoccus denitrificans cytochrome C oxidase is very at in 3-dimensional space with a root mean square deviation (rmsd, an index of non-planarity) of only 1.08Å (Figure 1). The second limitation of the conventional methods is that they do not separate or distinguish between any two epitopes in an antigen when multiple epitopes exist. They only tell which residue of the antigen is antigenic, but not tell to which epitope it belongs to. That is, only a union of all antigenic residues, irrespective to specific epitopes, are just predicted. This is a limitation because multiple epitopes are possible at the same antigen [20]. For instance, there exist two non-overlapping epitopes on the ubiquitin antigen: one of them has a very smooth surface with a non-planarity of 1.04Å, while the other stretches out remarkably with a non-planarity of 3.14Å. See Figure 2 for more details of their constituent resides. In this work, we propose a graph-based model to improve the prediction performance by identifying both protrusive and planar epitopes and by detecting multiple epitopes in an antigen if applicable (i.e., identifying all of the epitopes instead of the union of all epitope residues).

thumbnailFigure 1. A hydrophilic island with a hydrophobic core in the binding site of paracoccus denitrificans cytochrome C oxidase. Interacting with an antibody (PDB complex 1AR1). The hydrophilic residues are rendered by blue and the hydrophobic residues are colored by orange. The shade of colors represents hydrophobic intensity. This figure is produced by using Chimera [33].

thumbnailFigure 2. An example of a multiple-epitope antigen. The ten residues with color orange are the residues of the epitope on the ubiquitin antigen (chain X) in PDB complex 3DVG, while the fourteen residues colored in forest green are the residues of the epitope on the ubiquitin antigen (chain Y) in PDB complex 3DVG.

The use of graph model is motivated by the following biological observations. First, the tight packing of residues at each protein surface can be effectively represented by a graph. Second, epitope/non-epitope residues form particular patches separately on antigen surfaces, displaying distinct subgraphs of their own characteristics. As shown in Figure 1, the binding site shapes like a hydrophilic island (a hydrophilic subgraph) containing a hydrophobic core (a hydrophobic subgraph). It can be also seen that this island subgraph is surrounded by hydrophobic non-epitope residues which form a non-epitope subgraph. Third, the distinction between protrusive and planar eptiopes can be manifested by the change of weights in the connections between residues.

Our graph-based prediction method consists of three main steps: construct a weighted graph to represent an antigen surface, cluster the nodes of this weighted graph, and learn a label (epitope or non-epitope) for each cluster. Specifically, we take the idea of Delaunay tessellation and use Qhull [21] in the implementation of Delaunay tessellation to construct a protein surface graph. The weights of the edges in this graph are determined by <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M1">View MathML</a> test statistics combined with a log odds ratio of each edge type. An edge type is determined by the amino acid types of the interacting residue pair. Then, a Markov CLustering algorithm (MCL) [22] is used to partition the entire graph into subgraphs based on the weights of the edges and the graph topology. MCL simulates flows in a network with two operations: expansion and inflation. Expansion increases homogeneity of nodes within one subgraph, while inflation evaporates inter-flow between different subgraphs and amplifies flow within subgraphs. These ideas mimic properties of residues connecting within an epitope, within a non-epitope, or between an epitope and a non-epitope. Thus, we can divide the weighted antigen surface graph into a good set of subgraphs for the subsequent learning algorithms to predict these subgraphs as epitopes or non-epitopes accurately.

Experimental results on a set of 92 non-redundant antibody-antigen complexes compiled from the Protein Data Bank (PDB) [23] show that our proposed graph model improves the performance of B-cell epitope prediction significantly and, it is also able to identify multiple epitopes as well as predict epitopes with various geometrical formations. For ease of reference, we refer to the proposed B-Cell epiTope prediction model as BeTop. Our data and web server for B-cell epitope prediction are available at http://sunim1.sce.ntu.edu.sg/~s080011/betop/index.php webcite.

Materials and methods

Collection of antigen protein data

Protein complexes satisfying the following criteria were retrieved from the PDB on May 14th, 2011: (i) the macromolecular type is protein only, no DNA, RNA, or their hybrid complexes; (ii) the number of protein chains in an asymmetric unit of one complex is larger than two; (iii) the length of every chain is larger than or equal to 30; (iv) the X-ray resolution of one complex is less than 3Å; and (v) the structure title contains at least one of the following terms: antibody, Fab, Fv, or VHH. We obtained 622 antibody-antigen complexes. As transformed and redundant chains in the raw PDB complexes may cause noise effect on the results, we removed all of the transformed chains and duplicate chains. One antigen chain is considered as a duplicate if there exists one pair-wise chain similarity between this chain and one of the other in the data set larger than 80%, a threshold widely used to remove redundant antigens [24]. Removal of duplicate chains by pair-wise chain similarity may filter out multi-epitope antigens, but it can significantly reduce more noise data because the number of non-epitope residues is extremely larger than the number of epitope residue for an antigen. Asymmetric units in each complex that do not have structural difference were also excluded from our consideration. Finally, a non-redundant data set containing 92 antibody-antigen complexes were collected for our model training and testing. Epitope residues on antigen surfaces were determined by using the Euclidian distance of 4Å for every antigen-antibody PDB complex, following the traditional method for determining epitope residues [7].

Construction of epitope prediction model

The training phase of our prediction method has the following steps: (i) antigen surface triangulation, (ii) weight calculation for edges, (iii) clustering on the nodes of the graphs, and (iv) supervised learning for distinguishing between epitope subgraphs and non-epitope subgraphs. The details of each step are presented below.

Triangulation of an antigen surface

A surface graph of an antigen structure is built in two steps: determine the surface atoms of the antigen, and then build an atom-level graph for these surface atoms and upgrade into a residue-level graph. To obtain surface atoms of an antigen with 3D coordinates, we compute each atom's ASA by using NACCESS [25] with the default probe size. Those atoms with ASA ≥ 10Å2 are defined as surface atoms. A graph of these surface atoms is constructed as per Delaunay triangulation rule which has been commonly used to construct protein surface graph [26]. To upgrade an atom-level graph into a residue-level graph, we ignore connections of the atoms within the same residue, e.g., ignore connection between Cα and Cβ of Alanine; and then merge multiple atom connections between two different residues into one edge, e.g. merge the connection between OD1 of Aspartate and CG1 of Isoleucine and the connection between OD2 of Aspartate and CG2 of Isoleucine into one edge. Atom connections that have Euclidian distances larger than 6Å are also removed. Then, residues are distinguished by their positions--i.e., two residues are considered different if they have different positions even when they are of the same amino acid type. Figure 3 shows a graph of an antigen after triangulation, in which nodes are surface residues and edges represent residues' spatial relations.

thumbnailFigure 3. Diagram of an antigen surface graph. Nodes are residues and edges represent residues' spatial relation. Blue nodes are epitope residues, and orange ones are non-epitope residues. Dash lines are boundary edges between two clusters, while solid lines are edges within a cluster.

Weight calculation for edges

The weight between two residue types x and y within an epitope subgraph or within a non-epitope subgraph in our graph database is given by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M2">View MathML</a>

(1)

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M3">View MathML</a> is the normalized value of W, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M4">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M5">View MathML</a> are the <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M1">View MathML</a> test and the log odds ratio of the frequencies of xy (edge between x and y) between epitope clusters and non-epitope clusters, respectively. <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M4">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M5">View MathML</a> are calculated by using

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M6">View MathML</a>

(1a)

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M7">View MathML</a>

(1b)

where c ∈ {epitope, non-epitope}, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M8">View MathML</a> is the number of edges with type xy and label c in our training data, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M9">View MathML</a> is the number of expected edges with type xy and label c, Pxy is the probability that a pair of residues xy in epitopes, and Qxy is the probability that a pair of residues xy in non-epitopes. Pxy is given by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M10">View MathML</a>

where, Nxy is the number of residue pairs xy in a cluster, i.e., the number of edges connecting two nodes with one node labeled as x and the other as y. Qxy is calculated by the same way of computing Pxy.

The weight calculation for boundary edges is very innovative. A boundary edge is an edge connecting an epitope residue and a non-epitope residue. We group all of the boundary edges (e.g. dashed black lines in Figure 3) in our graph database as a class, and take all epitope edges (e.g. solid blue lines in Figure 3) as the other class. Then, we apply Equation (1a) and (1b) to calculate the weights <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M11">View MathML</a> for the boundary edges by setting c ∈ {boundary_class, epitope_classg}. This process is also applied with regard to the boundary class and non-epitope class (e.g., edges with solid orange lines in Figure 3) to determine weights <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M12">View MathML</a> for the boundary edges. In other words, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M11">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M12">View MathML</a> are determined by using the exactly same equations as computing Wxy, with substitution of the relevant class label c. The weight of a boundary edge xy is finally set as <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M11">View MathML</a> or <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M12">View MathML</a> whichever is larger. Those boundary edges with heavy weights (larger than a threshold W0) are definitely boundary edges between epitope and non-epitope subgraphs. We remove them to sharpen the distinction in the later clustering step and supervised learning. Boundary edges might change to another set when different computational methods are used to define epitope residues, such as using accessible surface area larger than 1Å2 upon binding with an antibody [6,27] and distance threshold of 4Å [7,28], 5Å [29] or 6Å [30]. However, Ponomarenko et. al. have shown that epitope residues have no significant difference when various criteria are used to capture epitope residues [24].

As a few number of large weights can pull all weight values towards zero after normalization, we further contrast normalized weights Wxy to amplify important weights and suppress trifling weights by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M13">View MathML</a>

where θ and are γ optimized as 3 and 3 in this study.

Since there are only 20 different standard residue types, the total number of different weights between two residue types is <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M14">View MathML</a>.

Clustering on nodes in an antigen surface graph

Antigen surface graphs are constructed by Qhull with weights W on edges determined by the procedure above. We then use mcl [22] (an implementation of the MCL algorithm with inflation coefficient r of 1.8) to cluster the nodes and edges of every antigen graph into subgraphs. In the MCL algorithm, the graph of an antigen surface residues is represented by a square matrix M, where each row/column represents a surface residue and the value of each entry is the weight of these two residue types. In the expansion stage of MCL, M is expanded as the normal product of itself; during the inflation stage, the matrix M undertakes Hadamard power with coefficient r followed by normalization. This two steps keep on in iteration until an equilibrium state is reached, i.e., when expansion and inflation do not alter the matrix any more.

The subgraphs of an antigen surface clustered by MCL are not always clean and some subgraphs may contain a mixture of epitope residues and non-epitope residues. To clean up the training data, we consider a subgraph as an epitope subgraph if the number of epitope residues in this subgraph is larger than the number of non-epitope residues and, as a non-epitope subgraph if no or very few (say, at most two) epitope residues show up. Subgraphs with other situations are considered as noise data which are overlooked during model training. We adopt this strategy because of the small number of epitope residues. We note that this approach is tolerant to false positives, but is sensitive to false negative.

Supervised learning for distinguishing epitope subgraphs and non-epitope subgraphs

By using mcl, each antigen surface graph is clustered into a number of subgraphs. To distinguish between epitope subgraphs and non-epitope subgraphs, we design a feature vector to represent all of these subgraphs in our training data. Each subgraph is transformed into a feature vector with 1770 dimensions, which comprises <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M15">View MathML</a> dimensions of single residues, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M16">View MathML</a> dimensions of residue pairs, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M17">View MathML</a> dimensions of residue triangles. A single-residue feature takes the weighted summation of <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M1">View MathML</a> test and log odds ratio on the frequencies of the residue type between epitope clusters and non-epitope clusters, which is similar to the calculation of the weight of a pair of residue types shown in Equation (1). A residue-pair feature takes the weight of this edge in the subgraph as its value, and a triplet feature takes the average weight of the three edges in the subgraph as its value.

The number of nodes in a subgraph is very small (15 on average); but the dimension of each vector is very large (1770). Therefore, each vector is very sparse and, some features even have no differentiability between epitope subgraphs and non-epitope subgraphs. Hence, feature selection is conducted to maximize classification performance. The feature selection was done by using the LIBSVM [31] feature-selection module targeting at maximizing classification f-score. As a result, 144 from the 1770 features are chosen for classifying epitope subgraphs from non-epitopes subgraphs.

Due to the extreme imbalance between the epitope residue number and non-epitope residue number for an antigen surface graph (15 and 120 on average in our data set), the number of non-epitope subgraphs far exceeds the number of epitope subgraphs as generated by mcl. To address this imbalance problem, a two-stage supervised learning, multi-SVM classification and trust-reliable voting, is taken to accomplish the distinction between epitope and non-epitope subgraphs. The number of SVM classifiers is automatically determined by the proportion between non-epitope subgraphs and epitope subgraphs. Based on our data set in this work, the number of SVM classifiers is nine. For each classifier in the first stage, a parameter grid-search is carried out on a balanced training data set to maximize model performance, while in the second stage the final decision is voted and determined by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M18">View MathML</a>

(2)

where

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M19">View MathML</a>

and the symbol annotations are as follows:

y: epitope/non-epitope label for a sample predicted by the model;

wi: weight of classifier i computed by its performance;

f(xi): label for a sample x determined by classifier i in the first level;

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S17/S20/mathml/M20">View MathML</a>: probability of classifier i that predicts sample x as non-epitope;

δi: determinant of classifier i. δi is 0 when the classifier i is dubious and other confident classifiers exist.

θ0 is a threshold to filter out non-epitope residues, and τ0 is used to control to what extent we trust the classifier.

Prediction of epitopes for an unknown antigen

Given an antigen with 3D coordinate information, the following steps are taken to identify one or multiple epitope for this antigen: (i) calculate each atom's ASA by using NACCESS, and filter out those atoms with ASA less than 10Å2; (ii) construct an atom-level graph by using Qhull and upgrade it to a residue-level graph; (iii) assign weights to all edges of this residue graph, where the weights are those determined during the training; (iv) cluster this undirected and weighted graph into exclusive subgraphs using mcl; and (v) transform every subgraph into a feature vector, and predict its label by the well-trained two-stage classification model. Epitope residues are the residues within those subgraphs which are predicted as epitope. Two epitope subgraphs can be merged together if they are connected in the original surface graph.

Results and discussions

Our graph-based method BeTop made remarkable improvement on B-cell epitope prediction in comparison to the state-of-the-art methods. First, BeTop shows significant improvement on overall prediction accuracy. Second, BeTop is capable of predicting epitopes located at both protrusive and planar surface areas. Third, BeTop is able to identify multiple epitopes if an antigen contains them. The detailed results of all these are presented below together with highlights of those features that distinguish epitope subgraphs from non-epitope subgraphs.

Significant improvement of prediction accuracy

Four performance metrics are adopted to evaluate model performance--viz., sensitivity (sen), specificity (spe), f-score, and accuracy (acc). They are defined as sen = TP/(TP + FN), spe = TN/(TN + FP), f-score = 2*pre*sen/(pre + sen), and acc = (TP + TN)/(TP + FP + TN + FN), where TP, TN, FP, and FN represent the number of predicted true positive, true negative, false positive and false negative samples, respectively. Due to the imbalance nature in the composition of non-epitope residues and epitope residues in an antigen, accuracy is not competent to measure model performance. Instead, f-score is more appropriate to evaluate BeTop's performance and to compare with other models.

Ten fold cross validation is applied to measure the overall performance of BeTop on the 92 non-redundant antigen-antibody PDB complexes. The f-score comparison between BeTop and the state-of-the-art epitope prediction methods DiscoTope [7], SEPPA [9] and ElliPro [8] are shown in Figure 4. We note that ElliPro can produce a short list of candidate epitopes. Its performance reported here is summarized based on its best result among all of the predicted candidates for each antigen. In the case that BeTop identifies multiple epitopes for an antigen, its performance is reported in the same way as ElliPro for a fair comparison. From Figure 4, it can be seen that BeTop outperforms all existing models significantly. The f-score t-test p-values between BeTop and the other models are shown in Table 1 to illustrate the significance level that BeTop is better.

thumbnailFigure 4. B-cell epitope prediction performance by BeTop, DiscoTope, SEPPA, and ElliPro. The optimal parameters α, θ0 and τ0 are set to 0.3, 0.3 and 0.05, respectively.

Table 1. F-score t-test p-values between BeTop, DiscoTope, SEPPA and ElliPro.

The averaged sensitivity, specificity, accuracy and AUC values for DiscoTope, SEPPA, ElliPro and BeTop are shown in Table 2. It is clear that BeTop is remarkably better than other models in terms of sensitivity, accuracy and AUC. The specificity of BeTop is slightly lower than that of ElliPro, but this value is much better than the other two models. More detailed results for each antigen in terms of sensitivity, specificity, f-score and accuracy can be found in the supplementary material Additional File 1: Table S1.

Table 2. The averaged performances comparison between BeTop, DiscoTope, SEPPA and ElliPro on sensitivity, specificity, accuracy and AUC.

Additional File 1. Additional Table S1 -- The performance of BeTop on 92 antibody-antigen PDB complexes.

Format: PDF Size: 61KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

One of the novel ideas used in this study is reducing the weight of boundary edges to distinguish epitope from non-epitope. Thus, we further compare the performances of BeTop with suppressing weights of boundary edges and without suppressing weights of boundary edges. Experimental results show that the averaged f-scores are 0.45 and 0.41 for the two situations, with increment of f-score by 8.9%. The t-test p-value of 0.11 between the two sets of f-scores also demonstrates the improvement of performance by decreasing weights of edges enriched in boundary class.

Locating epitopes with planar formations

Existing conformational epitope prediction methods such as [7,8,10] heavily rely on the spatial structure information and non-planarity properties of antigens. They usually have a good performance on epitopes that have a protrusive surface, otherwise the performance becomes poor. To understand the effect of non-planarity of epitopes on epitope prediction, we divide all of the epitopes in our database into groups based on a non-planarity index. The non-planarity of a residue cluster is measured by the root-mean-square deviation of all the surface atoms of this cluster of residues. It is expected that those structure-based prediction models favor epitopes with large non-planarity but not at epitopes.

Our experimental result is shown in Figure 5. It is clear that BeTop works very well with an average f-score 0.432 on at epitopes, namely on those epitopes having a non-planarity less than 2Å. However, DiscoTope, SEPPA and ElliPro all had difficulties to detect such epitopes with f-scores of only 0.214, 0.207, and 0.337 respectively.

thumbnailFigure 5. Performance comparison at different levels of epitope non-planarity.

Taking PDB entry 1AR1 as example again (Figure 1), its epitope consists of 19 residues, and the non-planarity of this epitope is as small as 1.08Å, indicating a very flat surface area. The f-score achieved by BeTop is 0.88 (with 16 true positives and 1 false positive). However, ElliPro, SEPPA, DiscoTope made an f-score of 0.273 (with 7 true positives and 22 false positives), 0.000, and 0.000, respectively. As another example, the prediction performance by BeTop, ElliPro, SEPPA and DiscoTope on the epitope residues of PDB entry 1N8Z are 0.667, 0.194, 0.198 and 0.07, respectively. This epitope also has a very planar surface with non-planarity of 1.88Å.

For epitopes having a large non-planarity bigger than or equal to 3Å, BeTop also performs better than the other models. The f-score is improved by 65.6%, 55.7% and 11.8% over DiscoTope, SEPPA and ElliPro, respectively. In particular, in comparison to ElliPro, which detects twisted epitopes based on residues' protrusion index, BeTop still achieved a better performance.

In summary, the f-score of the 3 existing methods becomes poor when the non-planarity of epitopes becomes flat. However, BeTop performs equally well under both protrusive and planar conditions, demonstrating that our proposed BeTop graph model is invariant to the change of epitope non-planarity.

Identifying multiple epitopes from an antigen

Although BeTop is trained on single-epitope antigen-antibody complexes, the framework has no limitation on the number of predicted epitopes. To evaluate BeTop's capability in identifying multiple epitopes in an antigen, we tested it on a data set of epitopes that are comprehensively explored in [20].

The multiple epitopes of these antigens are determined by the following steps: (i) determine epitope residues for each complex by using the 4Å Euclidian distance criteria between the antigen and antibody; (ii) calculate a pair-wise epitope similarity between two complexes X and Y of the same antigen by using SXY = |X Y |/min(|X|, |Y|); (iii) cluster epitopes based on their similarities for each antigen; (iv) select representative epitopes for each cluster with the best resolution, and map all representative epitopes to one of them with the finest resolution. Finally 9 antigens with a total of 20 epitopes are obtained.

BeTop can identify 8 out of the 9 antigens with multiple epitopes; and for all of the 20 epitopes, BeTop can detect 19 of them. However, the conventional approaches would take the union of all the epitope residues on an antigen as a single epitope. The average performance of sensitivity, specificity, f-score and accuracy of applying BeTop to multiple epitope prediction are 0.393, 0.907, 0.321, and 0.858, respectively. As an example, BeTop achieves an averaged f-score as high as of 0.611 in identifying the two epitopes on the prion protein (Figure 6). Detailed performance is available at Additional File 2: Table S2.

thumbnailFigure 6. Two epitopes with 3 overlapping residues on the prion protein. The 17 epitope residues within PDB entry 1TQB are colored in orange, while the 15 epitope residues in PDB entry 2W9E are rendered in blue. The residues with green color are the three overlapping epitope residues.

Additional File 2. Additional Table S2 -- BeTop performance on multiple epitopes prediction.

Format: PDF Size: 44KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

As expected, BeTop can identify as many epitopes as possible when they exist on an antigen. For instance, there are four epitopes on the antigen hen egg white lysozyme. BeTop can detect all of the four epitopes with an average f-score and accuracy of 0.376 and 0.849. These experimental results show that multiple epitopes predicted by BeTop are not false positives, and it does not mix up multiple epitopes either.

Graphical triplet patterns for epitopes

We are interested in outstanding features that distinguish epitopes from non-epitopes. By transforming epitope and non-epitope subgraphs into feature vectors and selecting distinct features by LIBSVM, we obtained 144 from the 1770 features. See the full details in Additional File 3: Table S3. Features that favor epitopes are shown in Figure 7. Interestingly, residue triangles of the pattern XXY (no order constraint), where X is a polar residue and Y is a hydrophobic or polar residue, are more likely to be epitope residues, but the pattern XX itself has no such differentiability. This type X of residues include Glutamine (Q), Aspartate (D), Tyrosine (Y) and Leucine (L). For example, residue pair Glutamine-Glutamine (QQ) interacting with residue Arginine (R), Tyrosine (Y), Asparagine (N), Lysine (K), Serine (S), Glycine (G), or Proline (P) are rich in epitopes. But Glutamine-Glutamine itself cannot be used to distinguish epitopes from non-epitopes. Furthermore, these meaningful features indicate some general patterns including polar and hydrophilic homogeneous residue pair surrounded by hydrophobic or polar residues as shown in Figure 8(a), polar and hydrophobic homogeneous residue pair encircled by polar residues as shown in Figure 8(b), and hydrophobic homogenous residue pairs accompanied by hydrophobic residues as shown in Figure 8(c). In contrast, such phenomena are not observed in the features enriched in the non-epitope clusters; see Additional File 4: Figure S1.

Additional File 3. Additional Table S3 -- 144 features selected to separate epitope clusters from non-epitope clusters.

Format: PDF Size: 39KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

thumbnailFigure 7. Features that favor the epitope class. Here a feature is represented by two nodes connecting with a solid line. Two nodes with patterns XY and YZ form a complete triangle XYZ, while patterns X and XY form only a contacting residue pair XY. The blue-filled nodes are single residues in favor of the epitopes while the gray-filled nodes are not.

thumbnailFigure 8. Examples of triplet feature patterns rich in the epitopes. The single letters in each colored rectangle is a residue and the double-letter circles stand for contacting residue pairs. Different colors represent distinctive properties of residues. The edges connecting a residue pair and a single residue form interacting residue triangles. For instance, (a) shows a set of residue triangles including QQR, QQN, QQS, QQY, QQK, QQG and QQP that all favor the epitope class.

Additional File 4. Additional Figure S1 -- Negative features of non-epitope clusters that distinct from epitope clusters.

Format: PDF Size: 525KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

To test the statistical significance of these features, we calculated their G-test values [32]. The top ten features that are in favor of the epitopes and the top ten features that are enriched in the non-epitopes in terms of G-test are shown in Table 3. Intriguingly, the top ten features for the epitope class almost all have the form XXY (no order constraint); this observation consolidates the feature patterns we have identified. However, no similar patterns, such as XXY, can be found in non-epitope preferred features; see Table 3 and Additional File 3: Table S3.

Table 3. Top ten features that are in favor of the epitope class and also those that are enriched in the non-epitopes in terms of G-test.

Conclusion

Epitope prediction is an important way to understanding the immune basis of antibody-antigen interactions and is beneficial to prophylactic and therapeutic solutions. In this study, we proposed a novel graph-based model ("BeTop") to predict B-cell epitope by incorporating statistical ideas, graph clustering algorithms and supervised learning approaches. Our experimental results conducted on two data sets of non-redundant antigen-antibody complexes show that BeTop makes great improvements for identifying those planar epitopes and for distinguishing multiple epitopes in an antigen. We have also presented interesting features and triplet feature patterns for the epitopes which will be useful for us to form new hypothesis in the future studies.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

LZ designed the study and drafted the manuscript; LL helped to polish some of the biological ideas; LW, SH and JL supervised the design of the study and revised the manuscript; All authors read and approved the final manuscript.

Acknowledgements

We thank Mr. Zhenhua Li for helping us developing the web site. This work was supported by Nanyang Technological University [RG66/07].

This article has been published as part of BMC Bioinformatics Volume 13 Supplement 17, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S17.

References

  1. Abbas AK, Lichtman AH, Pillai S: Cellular and Molecular Immunology. 6th edition. W.B. Saunders Company; 2009. OpenURL

  2. Atassi M: Antigenic structure of myoglobin: The complete immunochemical anatomy of a protein and conclusions relating to antigenic structures of proteins.

    Immunochemistry 1975, 12(5):423-438. PubMed Abstract | Publisher Full Text OpenURL

  3. Benjamin DC, Berzofsky JA, East IJ, Gurd FRN, Hannum C, Leach SJ, Margoliash E, Michaels JG, Miller A, Prager EM, Reichlin M, Sercarz EE, Smith-Gill SJ, Todd PE, Wilson A: The antigenic structure of proteins - a reappraisal.

    Annu Rev Immunol 1984, 2:67-101. PubMed Abstract | Publisher Full Text OpenURL

  4. Pellequer JL, Westhof E, Van Regenmortel MHV: Predicting location of continuous epitopes in proteins from their primary structures. In Molecular Design and Modeling: Concepts and Applications Part B: Antibodies and Antigens, Nucleic Acids, Polysaccharides, and Drugs, Volume 203 of Methods in Enzymology. Edited by Langone JJ. Academic Press; 1991:176-201. PubMed Abstract | Publisher Full Text OpenURL

  5. Irving MB, Pan O, Scott JK: Random-peptide libraries and antigen-fragment libraries for epitope mapping and the development of vaccines and diagnostics.

    Curr Opin Chem Biol 2001, 5(3):314-324. PubMed Abstract | Publisher Full Text OpenURL

  6. Kulkarni-Kale U, Bhosle S, Kolaskar AS: CEP: a conformational epitope prediction server.

    Nucleic Acids Res 2005, 33:168-171. OpenURL

  7. Andersen PH, Morten N, Ole L: Prediction of residues in discontinuous B-cell epitopes using protein 3D structures.

    Protein Sci 2006, 15(11):2558-2567. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  8. Ponomarenko J, Bui HHH, Li W, Fusseder N, Bourne PE, Sette A, Peters B: ElliPro: a new structure-based tool for the prediction of antibody epitopes.

    BMC Bioinf 2008, 9:514+. BioMed Central Full Text OpenURL

  9. Sun J, Wu D, Xu T, Wang X, Xu X, Tao L, Li YX, Cao ZW: SEPPA: a computational server for spatial epitope prediction of protein antigens.

    Nucleic Acids Res 2009, 37(suppl 2):W612-W616. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  10. Rubinstein N, Mayrose I, Martz E, Pupko T: Epitopia: a web-server for predicting B-cell epitopes.

    BMC Bioinformatics 2009, 10:287+. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  11. Hopp TP, Woods KR: Prediction of protein antigenic determinants from amino acid sequences.

    Proc Natl Acad Sci USA 1981, 78(6):3824-3828. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  12. Karplus P, Schulz G: Prediction of chain flexibility in proteins: a tool for the selection of peptide antigen.

    Naturwissenschaften 1985, 72(4):212-213. Publisher Full Text OpenURL

  13. Larsen JE, Lund O, Nielsen M: Improved method for predicting linear B-cell epitopes.

    Immunome Res 2006., 2(2) PubMed Abstract | PubMed Central Full Text OpenURL

  14. Söllner J, Mayer B: Machine learning approaches for prediction of linear B-cell epitopes on proteins.

    J Mol Recognit 2006, 19:200-208. PubMed Abstract | Publisher Full Text OpenURL

  15. Saha S, Raghava GPS: Prediction of continuous B-cell epitopes in an antigen using recurrent neural network.

    Proteins: Struct, Funct, Bioinf 2006, 65:40-48. Publisher Full Text OpenURL

  16. El-Manzalawy Y, Dobbs D, Honavar V: Predicting linear B-cell epitopes using string kernels.

    J Mol Recognit 2008, 21(4):243-55. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Rubinstein ND, Mayrose I, Pupko T: A machine-learning approach for predicting B-cell epitopes.

    Mol Immunol 2008, 46(5):840-847. PubMed Abstract | Publisher Full Text OpenURL

  18. Reimer U: Prediction of linear B-cell epitopes.

    Methods Mol Biol 2009, 524:335-44. PubMed Abstract | Publisher Full Text OpenURL

  19. Sweredoski MJ, Baldi P: COBEpro: a novel system for predicting continuous B-cell epitopes.

    Protein Eng Des Sel 2009, 22(3):113-120. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  20. Zhao L, Wong L, Li J: Antibody-Specified B-Cell Epitope Prediction in Line with the Principle of Context-Awareness.

    IEEE/ACM Trans Comput Biol Bioinf 2011, 8(6):1483-1494. PubMed Abstract | Publisher Full Text OpenURL

  21. Barber CB, Dobkin DP, Huhdanpaa H: The Quickhull algorithm for convex hulls.

    ACM T. Math. Software 1996, 22(4):469-483. Publisher Full Text OpenURL

  22. van Dongen S: Graph Clustering by Flow Simulation.

    PhD thesis, University of Utrecht 2000. OpenURL

  23. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank.

    Nucleic Acids Res 2000, 28:235-242. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  24. Ponomarenko JV, Bourne PE: Antibody-protein interactions: benchmark datasets and prediction tools evaluation.

    BMC Struct Biol 2007, 7:64. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  25. Hubbard SJ, Thornton JM: Naccess V2.1.1 - Solvent accessible area calculations. [http://www.bioinf.manchester.ac.uk/naccess/] webcite

    1992.

  26. Huan J, Wang W, Bandyopadhyay D, Snoeyink J, Prins J, Tropsha A: Mining Protein Family Specific Residue Packing Patterns from Protein Structure.

    Eighth Annual International Conference on Research in Computational Molecular Biology (RECOMB) 2004, 308-315. OpenURL

  27. Rapberger R, Lukas A, Mayer B: Identification of discontinuous antigenic determinants on proteins based on shape complementarities.

    J Mol Recognit 2007, 20(2):113-121. PubMed Abstract | Publisher Full Text OpenURL

  28. Sweredoski MJ, Baldi P: PEPITO: improved discontinuous B-cell epitope prediction using multiple distance thresholds and half sphere exposure.

    Bioinformatics 2008, 24(12):1459-1460. PubMed Abstract | Publisher Full Text OpenURL

  29. Chen H, Zhou HX: Prediction of interface residues in protein-protein complexes by a consensus neural network method: Test against NMR data.

    Proteins: Struct, Funct, Bioinf 2005, 61:21-35. PubMed Abstract | Publisher Full Text OpenURL

  30. Schlessinger A, Ofran Y, Yachdav G, Rost B: Epitome: database of structure-inferred antigenic epitopes.

    Nucleic Acids Res 2006, 34:777-780. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  31. Chang CC, Lin CJ: LIBSVM: A library for support vector machines.

    ACM Transactions on Intelligent Systems and Technology 2011, 2:27:1-27:27. OpenURL

  32. Sokal RR, Rohlf FJ: Biometry: The Principles and Practices of Statistics in Biological Research. third edition. W. H. Freeman; 1994. OpenURL

  33. Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE: UCSF Chimera-a visualization system for exploratory research and analysis.

    J Comput Chem 2004, 25(13):1605-12. PubMed Abstract | Publisher Full Text OpenURL