Abstract
Background
Determining the disulfide (SS) bond pattern in a protein is often crucial for understanding its structure and function. In recent research, mass spectrometry (MS) based analysis has been applied to this problem following protein digestion under both partial reduction and nonreduction conditions. However, this paradigm still awaits solutions to certain algorithmic problems fundamental amongst which is the efficient matching of an exponentially growing set of putative SS bonded structural alternatives to the large amounts of experimental spectrometric data. Current methods circumvent this challenge primarily through simplifications, such as by assuming only the occurrence of certain iontypes (bions and yions) that predominate in the more popular dissociation methods, such as collisioninduced dissociation (CID). Unfortunately, this can adversely impact the quality of results.
Method
We present an algorithmic approach to this problem that can, with high computational efficiency, analyze multiple ions types (a, b, b^{o}, b^{*}, c, x, y, y^{o}, y^{*}, and z) and deal with complex bonding topologies, such as inter/intra bonding involving more than two peptides. The proposed approach combines an approximation algorithmbased search formulation with data driven parameter estimation. This formulation considers only those regions of the search space where the correct solution resides with a high likelihood. Putative disulfide bonds thus obtained are finally combined in a globally consistent pattern to yield the overall disulfide bonding topology of the molecule. Additionally, each bond is associated with a confidence score, which aids in interpretation and assimilation of the results.
Results
The method was tested on nine different eukaryotic Glycosyltransferases possessing disulfide bonding topologies of varying complexity. Its performance was found to be characterized by high efficiency (in terms of time and the fraction of search space considered), sensitivity, specificity, and accuracy. The method was also compared with other techniques at the stateoftheart. It was found to perform as well or better than the competing techniques. An implementation is available at: http://tintin.sfsu.edu/~whemurad/disulfidebond webcite.
Conclusions
This research addresses some of the significant challenges in MSbased disulfide bond determination. To the best of our knowledge, this is the first algorithmic work that can consider multiple ion types in this problem setting while simultaneously ensuring polynomial time complexity and high accuracy of results.
Background
Disulfide (SS) bonds are known to play an important role in protein structure and function. Among others, this includes: influencing protein folding and stabilization, formation of characteristic structural motifs such as the cysteine knot, mediation of thioldisulfide interchange reactions, and regulation of enzymatic activity. Early computational approaches for SS bond determination focused on two learningdriven formulations based on the protein primary structure [1]: residue classification (distinguish bonded and free cysteines) and connectivity prediction (determine the SS connectivity pattern). In recent times, the increasing availability and accuracy of mass spectrometry [2] (MS) has opened up an alternate approach; its essence lies in matching the theoretical spectra of ionized peptide fragments with experimentally obtained spectra to identify the presence of specific SS bonds. A diagrammatic representation of the key steps of a MSbased approach is presented in Figure 1, along with the different types of fragment ions that can be generated as an outcome of this process.
Figure 1. MSbased approach diagrammatic representation. (A) Once a protein is digested, the theoretically possible disulfide bonded peptides are compared with experimentally obtained precursor ions. In order to confirm each correspondence, the possible disulfide bonded fragment ions are next compared with experimentally generated MS/MS spectra. (B) Most of the different fragment ions (and their nomenclature) that can be observed. Ions types not represented here include b and y ions which have either lost a water molecule (b^{o}, y^{o}) or have lost an ammonia molecule (b^{*}, y^{*}).
MSbased methods generally outperform methods using sequencebased learning formulations, as showed by Lee and Singh [3]. However, a number of algorithmic challenges remain outstanding in realizing the potential of MSbased approaches. Salient among these are: (1) accounting for multiple ion types in the data [4,5]: To avoid an exponential increase in the search space, a common simplification is to limit the analysis to the spectra of bions and yions only [3,6,7]. However, this simplification may erroneously ignore the occurrence of other ions, such as: a, b^{o}, b^{*}, c, x, y^{o}, y^{*}, and z. While the occurrence of nonb/y ions is minimized (though not eliminated) in collisioninduced dissociation (CID), some of these ions can be present with greater likelihood in dissociation methods such as electron capture dissociation (ECD), electron transfer dissociation (ETD), and electrondetachment dissociation (EDD). In fact these ions types should be considered even in CID as illustrated by the example in Figure 2. (2) Design of efficient search and matching algorithms: The search space of possible disulfide topologies increases rapidly not only with the number of ion types being analyzed but also with the number of cysteines as well as the types of connectivity patterns. Thus, it is imperative to have algorithms that can accommodate the richness of the entire problem domain. (3) Automated datadriven determination of parameters: Many advanced algorithms in this area are intrinsically parametric. Often, determining the optimal value of these parameters automatically is in itself, a complex problem. This places the practitioner at a significant disadvantage. Support for automated and datadriven strategies for estimation of crucial parameters is therefore crucial to the realworld success of a method in this problem domain.
Figure 2. Multipleion spectra analysis. This figure illustrates the presence of multiple ions types (in green) after CID. In the first spectrum, note the presence of b^{o} and y^{o} ions with high intensity in the fragmentation of the precursor ion with sequence: FFLQGIQLNTILPDAR, for the protein Lysozyme [SwissProt: P11279]. In the second spectrum, a, b^{o}, b^{*}, and y^{o} ions (all with high intensity) can be observed after the fragmentation of a precursor ion existing in the protein Pratelet glycoprotein 4 [SwissProt P16671].
The contributions of this paper in context of the aforementioned challenges include: (1) Development of a highly efficient strategy for multiion disulfide bond analysis by considering a, b, b^{o}, b^{*}, c, x, y, y^{o}, y^{*}, and z ion types. To the best of our knowledge, this is the first algorithmic work that has considered all these iontypes in SS bond determination. (2) A fully polynomialtime algorithm that selectively generates only those regions of the search space where the correct solutions reside with a high likelihood. (3) A multipleregressionbased data driven method to calculate the critical parameters modulating the search, so as to ensure that the correct bonding topologies are not missed due to the truncation of the search space. At the same time, the parameter selection ensures that the search is focused on the most promising regions of the searchspace, and (4) A localtoglobal strategy that builds a globally consistent bonding pattern based on MS data at the level of individual bonds.
The proposed approach also implements the probabilitybased scoring model proposed in [8] for each specific disulfide bond based on the number of MS/MS matches and their respective abundance. These scores reflect the significance of the specific disulfide bond and can form the basis of analysis, such as that conducted in [9], to estimate the accuracy of peptide assignment to tandem mass spectra.
At a highlevel, the proposed approach can be thought of as a twostage databasebased matching technique (see Figure 3). From this perspective, it shares similarities with [10], where crosslinked peptides were also identified using a twolevel method. During the first stage of such twostage methods, the mass values of the theoretically possible disulfidebonded peptide structures are compared with precursor ion mass values derived from the MSspectra. In the second (confirmatory) stage, the theoretical spectra from the disulfidebonded peptide structures are compared with MS/MS experimental spectra. The confirmatory step is necessary since a disulfide bonded peptide may not actually correspond to a precursor ion, even if their mass values are similar. Our approach can be used to conduct this entire search process in (a low degree) polynomial time. This paper significantly extends our prior research where we had proposed efficient indexing strategies to speedup the search [11,12] as well as our more recent work [13], where a polynomial time approximation algorithm using handcrafted parameters was proposed for the first stage matching.
Figure 3. Twostage matching spectra for protein ST8SiaIV. (A) In the firststage (DMS vs. PMS), the theoretical disulfidebonded structure is matched with the doubly charged precursor ion with highest intensity, whose m/z = 1082.9. (B) For this initial match, the disulfidebonded peptide pair is fragmented and the fragments are matched with the MS/MS spectrum for the precursor ion (FMS vs. TMS), generating a list of validation matches.
Methods
We start the description of our method by providing, in Table 1, the key abbreviations used in the ensuing description and their respective definitions. In the first stage of the method, an Initial Match (IM) is said to be obtained when the difference between the detected mass of a targeted ion from the PMS and the calculated mass of a possible disulfidebonded peptide structure from the DMS is found to be less than a threshold T_{IM}. The second stage validates (or rejects) the initial matches. For each Initial Match, the validation occurs by searching for matches between product ions from the TMS and the theoretical spectra FMS. A Validation Match (VM) is said to occur when the difference between a precursor ion fragment mass from TMS and a disulfidebonded fragment structure mass from FMS falls below a validation match threshold T_{VM}.
Table 1. Abbreviations and their definitions
Unfortunately, the sizes of both FMS and DMS grow exponentially. For a disulfidebonded peptide structure consisting of k peptides, considering that there are f different fragment ion types possible, up to f^{k} types of fragment arrangements may occur in the FMS. If the ith fragment ion consists of p_{i} amino acid residues, then the complexity to compute the entire FMS for a disulfidebonded peptide structure is using a bruteforce approach. The DMS also grows exponentially. To understand this, let P = {p_{1}, p_{2}, …, p_{k}} be the list of cysteinecontaining peptides in a polypeptide chain. Further, let C = {c_{1}, c_{2}, …, c_{i}} be the list of the number of cysteines per cysteinecontaining peptide p_{i}. If is the total number of cysteines in a protein, the number of possible disulfide connectivity patterns (DMS size) is [1,14]: .
The subsetsum formulation: towards polynomialtime matching
Given the growth characteristics of the DMS and the FMS, an exhaustive searchandmatch strategy is clearly infeasible in the general case. This is especially true if multiple ion types are considered. Indexing [11,12] and filtering [15] are two possible approaches that have been considered for ameliorating this problem. In this paper we explore an alternative strategy that is based on the key insight that the entire search space (DMS or FMS) does not need to be generated to determine the matches. That is, we only want to generate the few disulfide bonded peptides whose mass is close to the (given) experimental spectra rather than generate all possible peptide combinations and subsequently testing and discarding most of these. This insight allows us to recast the DMS and FMS generation as instances of the subsetsum problem [16]. Recall, that given the pair (S, t), where S is a set of positive integers and t ∈ Z^{+}, the subsetsum problem asks whether there exists a subset of S that adds up to t. While the subsetsum problem is itself NPComplete, it can be solved using approximation strategies to obtain nearoptimal solutions, in polynomialtime [16].
Polynomial time DMS mass list construction
Our strategy lies in obtaining an approximate solution to the subsetsum problem by trimming as many elements from DMS as possible based on a parameter ε. To trim the DMS set by ε means to remove as many elements from DMS as possible such that if DMS^{*} is the resultant trimmed set, then for every element DMS_{i} removed from DMS, there will remain an element DMS_{i}^{*} in DMS^{*} which is “sufficiently” close in terms of its mass to the deleted element DMS_{i}. Specifically,
The approximation algorithm for creating the partial DMS is described by the APPROXDMS and TRIM routines (Figure 4). APPROXDMS takes the following parameters: (1) a sorted list of cysteinecontaining peptides mass values (CCP), (2) a target mass value from the PMS list (PMS_{val}), (3) the trimming parameter ε, and (4) the Initial Match threshold (T_{IM}). In lines 28 of Figure 4, all the variables and data structures are initialized. In lines 911, the theoretical disulfidebonded peptide structures are formed and stored in a temporary set called TempSet. Line 10 excludes values greater than the PMS_{val} plus a constant corresponding to the Initial Match threshold. The rationale behind this threshold is explained in the following section. Line 12 increments the DMS by invoking the routine MERGE, which returns a sorted set formed by merging the two sorted input sets DMS and TempSet, with duplicated values removed. In line 13, the TRIM routine is called to shorten the DMS set. Lines 1415 examine if the largest mass value in the constructed DMS set is sufficiently close to the targeted mass PMS_{val}. If so, an Initial Match occurs.
Figure 4. Pseudo code for APROXDMS and TRIM routines
Table 2 presents an example showing the effectiveness of the APROXDMS. In this specific case, 37.5% of the entire search space (all feasible combinations of cysteinecontaining peptides) was successfully trimmed, while ensuring that the correct IM was not missed. Another example illustrating the action of APPROXDMS on the BetaLG protein is available as supplemental information (see Additional File 1).
Table 2. Running APROXDMS on the ST8SiaIV C^{142}C^{292}bond
Additional File 1. Action of APPROXDMS on the protein BetaLG This example shows the effectiveness of the APROXDMS algorithm while trimming a DMS set generated for the protein BetaLG using MS/MS data.
Format: PDF Size: 10KB Download file
This file can be viewed with: Adobe Acrobat Reader
The complexity of both routines MERGE and TRIM is O(DMS+TempSet) and O(DMS), respectively. Further, for any fixed ε > 0, our algorithm is a (1 + ε)approximation scheme. That is, for any fixed ε > 0, the algorithm runs in polynomial time. The proof of the polynomial time complexity of APPROXDMS can be obtained by direct analogy to the proof of the polynomial time complexity of the subset sum approximation algorithm from [16] and is outlined in Appendix A.
Parameters estimation
APPROXDMS depends on two important parameters, namely, the match threshold T_{IM} and the trimming parameter ε. The match threshold is responsible for defining a “matching window”. This is necessary due to practical considerations such as the sensitivity of the instrument (i.e. 0.01Da, 0.1Da, and 1.0Da) and experimental noise, due to which an exact match is a rarity. We conducted an empirical study by using different values of T_{IM} for all our datasets. Based on the results, the T_{IM} value of ±1.0Da was found to minimize missing matches as well as the occurrence of false positives. Considering the smallest precursor ion mass involved, in these studies, the above value of T_{IM} guaranteed a matching accuracy of 99.86%.
The second parameter ε is much more important as it is crucial to the running time of the algorithm and its accuracy as evident from Eq. (1). To determine ε, we note that it is inversely proportional to the algorithm’s running time. However, a large value of ε would cause meaningful fragments to be left out of the DMS. At the same time, a small value for ε will lead to few data points being trimmed. Thus “guessing” appropriate values of ε can be complicated and suboptimal choices can significantly impact the quality of the results. We address the problem of datadriven estimation of ε using a regression framework where ε is treated as a dependent variable and based on the data, a functional relationship is obtained between it and the other (independent) variables. We model this functional relationship using the following independent variables: (1) the cysteinecontaining peptides (CCP) mass range defined by CCP_{max} and CCP_{min} corresponding to the peptides with the highest and lowest mass respectively. (2) The number of cysteinecontaining peptides k. A large k implies that the average difference in the mass of any two peptide fragments is small. Conversely, a small k implies fewer fragments with putatively larger differences in their masses. (3) The cysteinecontaining peptides average mass value CCP_{average}. The relationship between ε and these other variables is then obtained using multiplevariable regression. In our studies, the data for the regression was obtained using bootstrapping where groups of four proteins were randomly picked from the set of 9 proteins available to us. The functional relationship defining ε was obtained to be:
Polynomial time FMS construction
In creating the FMS, a strategy similar to the one used for generating the DMS can be used. This involves using an approximation algorithm, this time, to generate the theoretical spectra for all the IMs found during the firststage matching. We define another trimming parameter δ to trim the FMS mass list. It can be expected that the functional form of δ depends on the fragments mass range, as well as their granularity (extent to which fragments are broken down into smaller ions). In a manner similar to the case for estimating ε, we used regression to obtain the specific functional form for the dependent variable δ in terms of the variables AA_{max} (the largest amino acid residue mass), AA_{min} (the smallest amino acid residue mass), AA_{average} (the average amino acid residues mass), and p (average number of amino acid residues per fragment). Bootstrapping was once again utilized, resulting in the relationship shown in Eq. (3).
The pseudocode of the APPROXFMS procedure used for generating the FMS is shown in Figure 5. The function GENFRAGS(.), in line 7, generates multiple fragment ions (a, b, b^{o}, b^{*}, c, x, y, y^{o}, y^{*}, and z) for peptide sequences in Pep_{sequences}, which contains the disulfidebonded peptides involved in the IM being analyzed. Next, for each element in the FMS and for each fragment in the FragSet (lines 811), new disulfidebonded peptide fragment structures are formed. Line 10 rejects values greater than the TMS_{val}, considering the Validation Match threshold. In line 12, the current FMS set is combined with the disulfidebonded peptide fragments set TempSet using MERGE. In line 13, the FMS is trimmed using the TRIM routine. Lastly, a Validation Match VM is declared (lines 1415) when a correspondence is found between the mass of the largest value in FMS and an experimentally determined mass value TMS_{val}, given a Validation Match threshold.
Figure 5. Pseudo code for APROXFMS routine
Determining the globally consistent bond topology
Once all the Initial Matches and Validation Matches are calculated, we have a “local” (putative bondlevel) view of the possible disulfide connectivity. This local information needs to be integrated to obtain a globally consistent view. Our approach to this problem is motivated by Fariselli and Casadio [14]. Specifically, we model the location of the putative disulfide bonds by edges in an undirected graph G (V, E), where the set of vertices V corresponds to the set of cysteines. To each edge, we assign a match score. This score represents the combined importance of each single peak match within two spectra. Each specific peak match is weighted according to its intensity. The match score is given by:
In Eq. (4), the numerator corresponds to the sum of each validation match for a disulfide bond multiplied by the matched MS/MS fragment normalized intensity value (I_{N}). Here, VM_{i} is a binary value which is set to 1 if a confirmatory match was found for fragment i. The denominator similarly contains the sum of each experimental MS/MS fragment ion from TMS multiplied by I_{N}. Here, TMS_{i} is a binary variable which indicates the presence of a fragment i in the MS/MS spectrum.
Next, the globally consistent bond topology is found by solving the maximum weight matching problem for the graph G. A matching M in the graph G is a set of pairwise nonadjacent edges; that is, two edges do not share a common vertex. A maximum weight matching is defined as a matching M that contains the largest possible sum of the weights (match scores) of each possible edge (disulfide bond). We use the Gabow algorithm [17], as implemented in [18] for computing the maximum weight match.
Results
The proposed method was validated utilizing experimental data obtained using a capillary liquid chromatography system coupled with a ThermoFisher LCQ ion trap mass spectrometer LC/ESIMS/MS system. Details of the experimental protocols can be found in [19,20]. We used data from nine eukaryotic glycosyltransferases. These molecules and their SwissProt ID were: ST8Sia IV [Q92187], Betalactoglobulin [P02754], FucT VII [Q11130], C2GnTI [Q09324], Lysozyme [P00698], FT III [P21217], β14GalT [P08037], Aldolase [P00883], and Aspa [Q9R1T5].
We conducted five sets of experiments to investigate the proposed method and its efficacy. These experiments included: (1) Analysis of method’s efficiency, showing how the method successfully reduced the DMS and FMS search spaces. (2) Analysis of the effect of incorporating multiple ion types, demonstrating the importance of considering nonb/y ions in the determination of disulfide bonds. (3) Comparative analysis of the proposed method with established predictive techniques. (4) Comparative analysis of the method with MassMatrix, an established MSbased approach which can be used for determining SS bonds. In both experiment 3 and experiment 4, the aforementioned set of glycosyltransferases and their known SS bond topology provided us with the ground truth. (5) Analysis of the method in terms of established performance measures: Accuracy (Q_{2}), Sensitivity (Q_{c}), Specificity (Q_{nc}), and Matthew’s correlation coefficient (c).
Analysis of efficiency of the search
One of the most important characteristics of the proposed method is its efficiency in terms of excluding significant portions of a large and rapidly expanding search space. In Table 3 we compare the size of the complete DMS (containing all the disulfidebonded peptide structures generated for each protein) and the complete FMS (containing all the disulfidebonded fragment ions) with the truncated DMS and FMS obtained using the proposed approach.
Table 3. DMS and FMS mass space sizes comparison
It may be noted that across the molecules, on an average, the proposed approach required examining about 78% of the entire DMS and only about 14% of the entire FMS. It is crucial to note that this reduction in search was achieved without impacting the accuracy and having considered all multiple fragment ion types (a, b, b^{o}, b^{*}, c, x, y, y^{o}, y^{*}, and z). The DMS decrease was less than the FMS decrease because the disulfidebonded structures in the DMS were bigger and fewer in number and consequently dispersed across the spectra mass range. In Figure 6, we show the actual time taken to obtain a solution by generating the complete DMS and FMS, as well as their truncated counterparts, for each of the molecules.
Figure 6. Comparison of the computational time (in seconds) for the exhaustive and partial generation of DMS and FMS of the proteins from Table 3. On average there was a 49.5% decrease in time to compute the DMS and 88.7% decrease in time to compute the FMS. The computations were carried out on an Intel T2390 1.86 GHz singlecore processor with 1GB RAM.
Effects of incorporating multiple ion types: a case study
In this experiment, we investigated the effect of incorporating multiple ion types (a, b, b^{o}, b^{*}, c, x, y, y^{o}, y^{*}, and z) in determining the SS bonds as opposed to considering only b/yions. We found that multiple instances of combinations between b/y ions and other ions types occurred by analyzing the confirmatory matches for the different disulfide bonds. These combinations are available as supplemental information (see Additional File 2).
Additional File 2. Combination between b/y ions and other ions types on MS/MS data This example shows that combinations between ion types other than just b and/or y ions do occur, even for proteins that underwent CID (CID is a dissociation method which produces mainly b/y ions).
Format: PDF Size: 57KB Download file
This file can be viewed with: Adobe Acrobat Reader
The consideration of multiple ion types also contributed to the method’s accuracy in terms of determining specific SS bonds. Disulfide bonds previously missed due to their low match score could be identified when all ten different ion types were considered. The trypticdigested protein FucT VII (which underwent CID) constituted one such example. In FucT VII the bond C^{318}C^{321} was missed when considering only b/y ions (match score 29, pp=11, pp2 =15). However, as shown in Figure 7, this bond was identified when multiple ions types were included (match score 100, pp=31, pp2=70). The confidence measures pp and pp2 are described in the following section. To explain this improvement we note that C^{318}C^{321} was an intrabond involving cysteines that were close together. Consequently, CIDbased fragmentation was poor and the consideration of other ion types essentially improved the signaltobackground contrast. In this particular case, five other ion types  a_{4}, a_{5}, a_{6}, b^{o}_{7}, y^{*}_{7}  were present in the FucT VII MS/MS data besides the b ions represented in the spectrum on the right in Figure 7. In the following, we present details of how these ions contribute to the match score V_{s} (from Eq. (4)). We present the two cases: consideration of only b/yions (Eq. (5)) and consideration of multiple ion types (Eq. (6)). In the numerator we specify the contribution of each spectrum peak from Figure 7 (the ion corresponding to each VM_{i} × I_{N} term is showed in brackets).
Figure 7. Spectra samples from tryptic digested protein FucT VII. Spectra (m/z vs. normalized intensity) illustrating the confirmatory matches (whose intensity values were at least 10% of the maximum intensity) found for the disulfide bond between cysteines C^{318}C^{321} in protein FucT VII. The spectrum in the left shows the matches found when multiple ions were considered. The spectrum in the right shows the matches when only b/yions were considered.
We also observed that consideration of multiple iontypes led to significant increase in the match scores of the true disulfide bonds, whereas only a modest increase was noticed for false positives. This allowed us to increase the threshold we use on the match score V_{s} to identify highquality matches from 30 to 80 (a 166% increase). The positive effect of this increment on the specificity of the method can be illustrated by considering the protein Aldolase. In this molecule, consideration of only b/y ions led to a false positive SS bond identification between cysteines C^{135}C^{202} (V_{s}=30.8, with (original) threshold 30) However, when the multiple ionstypes were considered with the (increased) threshold on the match score, no SS bond was found between C^{135}C^{202} (V_{s}= 53.2, (incremented) threshold 80).
Comparative studies with predictive techniques
In this experiment we compared the proposed method with three well known predictive methods DiANNA [21], DISULFIND [22], and PreCys [23]. The results from each of the methods are shown in Table 4 along with the with the known disulfide bond linkages according to the SwissProt knowledgebase. As it can be seen, in terms of correct identifications (as well as minimizing false positives), the proposed approach outperformed all the predictive techniques.
Table 4. Comparison with predictive methods
Comparative studies with MassMatrix
At the stateoftheart MS2Assign [6] and MassMatrix [7] are two MSbased methods that can be applied to the problem of determining SS bond connectivity. In our previous work [3], the MS2DB system developed by us was found to be comparable to MS2Assign [6], albeit, in limited testing. Since the proposed method improves upon MS2DB and due to space limitations, we only present detailed comparative results with MassMatrix [7] in Table 5. As part of this experiment, for each SS bond, in addition to the empirical match score (Eq. (4)), a probability based scoring model proposed in [8] was implemented. This model provided two scores called pp and pp2 scores. The pp score helps to evaluate whether the number of VMs could be a random. The pp2 score evaluates whether the total abundance (intensity) of VMs could be a random. We refer the reader to [8] for a detailed description and formulae of the pp and pp2 scores. The reader may note that the proposed method had better pp and pp2 scores when compared to MassMatrix (higher pp and pp2 scores are better, indicating smaller pvalues). While the match scores (V_{s}) obtained with the proposed method were also higher than those obtained with MassMatrix (V^{*}_{s}), no inferences should be drawn as these scores are calculated differently in each of these methods. As can be seen from Table 5, every bond correctly determined by MassMatrix was also found by us. However, there were SS bonds in C2GnTI and Lysozyme that were found by the proposed method but not by MassMatrix.
Table 5. Comparison with MassMatrix
Quantitative assessment and analysis of the method’s performance
If the set of disulfide bonds are denoted by P and the set of cysteines not forming disulfide bonds by N, then true positive (TP) predictions occur when disulfide bonds that exist are correctly predicted. False negative (FN) predictions occur when bonds that exist are not predicted as such. Similarly, a true negative (TN) prediction correctly identifies cysteine pairs that do not form a bond. Finally, a false positive (FP) prediction, incorrectly assigns a disulfide link to a pair of cysteines, which are not actually bonded. Based on these definitions, we use the following four standard measures to analyze the proposed method.
Sensitivity (Q_{c}) = TP/P (7)
Specificity (Q_{nc}) = TN/N (8)
Accuracy (Q_{2}) = TP + TN/P + N (9)
In Table 6 we present the results obtained for our framework. With maximum specificity and high accuracy (98% average), the method correctly reported the connectivity for most of the proteins. The method only failed to identify three disulfide bonds. One intrabond in the BetaLG protein could not be found due to a blind spot caused by the same intrabond, making the protein’s fragmentation difficult. A blind spot occurs when the precursor ion fragmentation produces different fragments only at the outside boundaries of the intradisulfide bond. This can cause too few product ions to be generated; the limited information can prevent accurate determination of disulfide bonds using MSbased methods. One crosslinked bond in the FT III protein also could not be identified because this particular connectivity configuration creates a large disulfidebonded structure, which is poorly fragmented by tandem mass spectrometry. One bond in the C2GnTI protein could not be found, since the precursor ion cannot be formed by chymotryptic digestion, which was the digestion carried for C2GnTI. It is important to note that neither MassMatrix nor MS2Assign were able to identify these bonds.
Table 6. Sensitivity, specificity, accuracy and Mathew’s correlation coefficient results for all nine proteins analyzed
Conclusions
We have presented an algorithmic framework for determining SS bond topologies of molecules using MS/MS data. The proposed approach is computationally efficient, data driven, and has high accuracy, sensitivity, and specificity. It is not limited either by the connectivity pattern or by the variability of product ion types generated during the fragmentation of precursor ions. Furthermore, the approach does not require user intervention and can form the basis for highthroughput SS bond determination.
Authors' contributions
The algorithmic solution framework was designed by RS and implemented by WM. Computational studies and experiments were carried out by WM and RS. TYY developed the experimental protocols and generated the data. The paper was written by RS and WM.
Competing interests
The authors declare that they have no competing interests.
Acknowledgements
WM and RS were supported by funding from NSF grant IIS0644418 (CAREER). TYY was supported by NSF grant CHE0619163 and NIH grant P20MD000544.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S1.
References

Singh R: A review of algorithmic techniques for disulfidebond determination.
Brief Funct Genomic Proteomic 2008, 7(2):157172. PubMed Abstract  Publisher Full Text

Nesvizhskii AI, Vitek O, Aebersold R: Analysis and validation of proteomic data generated by tandem mass spectrometry.
Nat Methods 2007, 4(10):787797. PubMed Abstract  Publisher Full Text

Lee T, Singh R: Comparative Analysis of Disulfide Bond Determination Using ComputationalPredictive Methods and Mass SpectrometryBased Algorithmic Approach. In Proc, BIRD. CCIS 13; 2008:140153.

Steen H, Mann M: The abc’s (and xyz’s) of peptide sequencing.
Nat Rev Mol Cell Biol 2004, 5:699711. PubMed Abstract  Publisher Full Text

Johnson RS, Martin SA, Biemann K, Stults JT, Watson JT: Novel fragmentation process of peptides by collisioninduced decomposition in a tandem mass spectrometer: differentiation of leucine and isoleucine.
Analytical Chemistry 1987, 59:26212625. PubMed Abstract

Schilling B, et al.: MS2Assign, automated assignment and nomenclature of tandem mass spectra of chemically crosslinked peptides.
J Am Soc Mass Spectrom 2003, 14(8):834850. PubMed Abstract

Xu H, Zhang L, Freitas MA: Identification and Characterization of Disulfide Bonds in Proteins and Peptides from Tandem MS Data by Use of the MassMatrix MS/MS Search Engine.
J Proteome Res 2008, 7:138144. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Xu H, Freitas MA: A mass accuracy sensitive probability based scoring algorithm for database searching of tandem mass spectrometry data.

Keller A, Nesvizhskii AI, Kolker E, Aebersold R: Empirical statistical model to estimate the accuracy of peptide identification made by MS/MS database search.
Analytical Chemistry 2002, 74(20):53835392. PubMed Abstract

Chen T, Jaffe JD, Church GM: Algorithms for Identifying Protein Crosslinks via Tandem Mass Spectrometry.

Lee T, Singh R, Yen TY, Macher B: An Algorithmic approach to Automated HighThroughput Identification of Disulfide Connectivity in Proteins Using Tandem Mass Spectrometry.
Proc. Computational Systems Bioinformatics, CSB 2007, 4151.

Lee T, Singh R, Yen R, Macher B: A massbased hashing algorithm for the identification of disulfide linkage patterns in protein utilizing mass spectrometry data.
Proc. IEEE International Symposiumon ComputerBased Medical Systems, CBMS 2007, 397402.

Murad W, Singh R, Yen TY: PolynomialTime Disulfide Bond Determination Using Mass Spectrometry Data.
Proc. IEEE Computational Structural Bioinformatics Workshop, CSBW 2009, 7986.

Fariselli P, Casadio R: Prediction of disulfide connectivity in proteins.
Bioinformatics 2001, 17:95764. PubMed Abstract  Publisher Full Text

Frank A, Tanner S, Pevzner P: Peptide Sequence Tags for Fast Database Search in MassSpectrometry. In Proc. RECOMB. LNBI 3500; 2005:26341.

Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to Algorithms. 2nd edition. MIT Press, Cambridge, MA, U.S.A; 2005.

Gabow HN: An efficient implementation of Edmonds’ Algorithm for Maximum Matching on Graphs.

Rothberg E: MATHPROG: Solver for the maximum weight matching problem. [http://elib.zib.de/pub/Packages/mathprog/matching/weighted/] webcite

Thomas S, Yen TY, Macher BA: Eukaryotic glycosyltransferases: cysteines and disulfides.
Glycobiology 2002, 12:4G7G. PubMed Abstract  Publisher Full Text

Yen TY, Macher BA: Determination of glycosylation sites and disulfide bond structures using LC/ESIMS/MS analysis.
Methods in enzymology 2006, 415:103113. PubMed Abstract  Publisher Full Text

Ferre F, Clote P: DiANNA: A Web Server for Disulfide Connectivity Prediction.
Nucleic Acids Res 2005, 33(Web Server issue):W230W232. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ceroni A, et al.: DISULFIND: A Disulfide Bonding State and Cysteine Connectivity Prediction Server.
Nucleic Acids Res 2006, 34(Web Server issue):W177W181. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tsai CH, et al.: Improving disulfide connectivity prediction with sequential distance between oxidized cysteines.
Bioinformatics 2005, 21:44164419. PubMed Abstract  Publisher Full Text
Appendix A – Etudes of the proof of polynomial complexity
The proof that the proposed method is a fully polynomial approximation scheme consists of two parts. First, we need to show that each value returned by the APPROXDMS function is within 1 + ε from the optimal solution. Second, we need to show that the running time of the method is fully polynomial. We refer the reader to [16] for the proof of the first part and focus in the following on analyzing the complexity of the method. To show that the method is a fully polynomialtime approximation scheme, we derive a bound on the length of a DMS set. After trimming, successive elements DMS_{i} and of DMS must have a relationship . Therefore, each possible DMS set contains up to log_{1+ε}PML_{val} values. Since (x/(1 + x)) ≤ ln(1 + x) ≤ x and 0 < ε < 1, it can be shown that:
As can be seen from Eq. (11), this bound is (explicitly) polynomial in the size of the input PMS_{val}. It is also (implicitly) polynomial in the size of the set DMS since ε is directly proportional to the number of cysteinecontaining peptides k (per Eq. (2)) and these peptides are in turn combined to form each element of the DMS. A similar argument can be made for the APPROXFMS routine, completing thereby the proof that the proposed method is a fully polynomialtime approximation scheme.