Abstract
Background
Protein identification is one of the most challenging problems in proteomics. Tandem mass spectrometry provides an important tool to handle the protein identification problem.
Results
We developed a workefficient parallel algorithm for the peptide sequence tag problem. The algorithm runs on the concurrentread, exclusivewrite PRAM in O(n) time using log n processors, where n is the number of mass peaks in the spectrum. The algorithm is able to find all the sequence tags having score greater than a parameter or all the sequence tags of maximum length. Our tests on 1507 spectra in the Open Proteomics Database shown that our algorithm is efficient and effective since achieves comparable results to other methods.
Conclusions
The proposed algorithm can be used to speed up the database searching or to identify posttranslational modifications, comparing the homology of the sequence tags found with the sequences in the biological database.
Background
Protein identification is one of the most important goals of drug discovery research and in proteomics. Nowadays, the leading technique to identify a protein or a peptide is the tandem mass spectrometry (MS/MS). The basic idea of the identification of a peptide using tandem mass spectrometry is simple: a peptide is ionized and broken, at the peptide bond, in charge fragments (ions). The mass/charge ratio of the resulted fragments are visualized in a graphic called tandem mass spectrum or ms/ms spectrum (Figure 1). A ms/ms spectrum contains: the mass of the whole peptide, and a pattern of fragments that can be associated to a given sequence. Each fragment is characterized by mass/charge ratio and intensity: we refer to this pair as peak. A good quality spectrum should contain the complete series of yions (the fragment ions containing the carboxyl terminal), and the complete series of the bions (the fragment ions containing the amino terminus) (Figure 2). In this case it is very easy to reconstruct the amino acidic sequence for the peptide, since it is sufficient to compute the mass differences between two adjacent peaks in each of the two series. Unfortunately, in practice, many factors contribute to complicate the problem. Indeed, many bion or yion peaks might be absent from the spectrum, or it might be an imperfect fragmentation that causes a different types of ions, or the sample might be contaminated, or it might be present posttranslational modifications or many other type of peaks might unexpected appear in the spectrum. There are three different computational approaches motivated by peptide sequencing: the peptide identification searching in a database, the de novo peptide sequencing, and the peptide sequence tag. The first method finds the best matching peptide from a sequence database using a scoring function based on the likelihood that an identified peptide is actually the peptide of the spectrum [1,2]. This method is the mostly used but it is able only to identify peptide stored in a database. On the contrary the de novo method allows to identify a peptide using only the spectrum without any other previous knowledge and hence even if the peptide is not in a database. Many algorithms using this approach have been developed [310] having different time complexity, but the accuracy of them depends on the quality of the input spectrum: usually these algorithms cannot find the complete sequence due to missing peaks and hence the applicability of them is limited in practice. The peptide sequence tag approach combines the two previous methods: first the de novo method is applied to find a partial solution, so called sequence tag, and then a database search is applied to identify the complete sequence. The idea of using peptide sequence tags is not novel [11,12] and it is recently reproposed to increase the speed of the database searching [13,14] or to find posttranslation modifications [15]. Most of these algorithms are using a previous developed de novo approach and their time complexity to find the optimal solution according to any scoring function is at least O(n^{2}), where n is the number of the mass spectrum peaks. A simple scoring function can be defined as the sum of the correspondent mass peak abundances found in the spectrum [3] or it can be based on the iontype, favouring the b and y ions over the other types [7,10]. In this paper we propose a parallel algorithm to determine all the peptide sequence tags longer than an input number of amino acids or all those scoring more than an input number, according to any scoring function. The parallel approach is motivated by demand of efficiency, since the interpretation of mass spectra is a high throughput process. The algorithm is workefficient running in O(n) time on a concurrentread, exclusivewrite (CREW) PRAM [16] with log n processors, and it is a variation of the algorithm proposed in [5] to find peptide sequence tags. We simulate the parallelism by an implementation in Java on threads using barriers for synchronization. Our tests on 1507 spectra in the Open Proteomics Database shown that our algorithm is efficient and effective since achieves comparable results to other approaches.
Figure 1. A ms/ms spectrum. In the xaxe we have the mass/charge ratio of each fragment and in the yaxe the abundance of it.
Figure 2. A peptide fragmentation. In a mass spectrometer a whole peptide (below) is broken into two fragments, one of them charged. Usually the breaking point is the peptide bond: in this case we could obtain a bion that maintains the amino termini or a yion that maintains the carboxyl termini.
Methods
Let an experimental spectrum be given related to an unknown peptide P of mass m_{P}. A peptide sequence tag is a short string of amino acid mass differences deduced from the fragment spectrum. Let any scoring function and any number δ be given. Our task is to determine all the sequence tags scoring at least δ. If the score reduces to count the length of the string, the output consists in the sequence tags of lengths at least δ. We refer to this problem as the peptide sequence tag problem.
Although a spectrum may contain a few different types of ions, for simplicity, we consider bions and yions only. Therefore we assume M = {m_{1}, m_{2},…,m_{n}} to represent a spectrum where the real numbers m_{i} correspond to the m/z ratios of the peaks in the spectrum augmented with the numbers 1, 19, m_{P}−17, and m_{P}+1 that represent the “empty” bion, the “empty” yion, the weightiest bion, and the weightiest yion, respectively. Let us denote the set of the masses of the twenty amino acids by A. The peptide sequence tag problem can be reformulated in terms of paths in a graph. We build a labelled directed acyclic graph G = (V,E) such that
■ every node ν_{i} is associated to a m_{i} ∈ M (1 ≤ i ≤ n);
■ (ν_{i}, ν_{j}) ∈ E if and only if (m_{j} − m_{i}) ∈ A (1 ≤ i <j ≤ n).
The peptide sequence tag problem consists in determining any path between two nodes in the graph G with score greater than δ. The introduction of any scoring function corresponds to assign weights to the edges of the graph: the score of a path is the sum of the scores of the edges on the path.
The algorithm
The elements of A and M are stored in two sorted arrays in the shared global memory of the PRAM. We divide M into groups of log n consecutive elements, and we assign a “responsible” processor to each mass in each group so that the ith processor is responsible for the ith mass inside the group. We divide the algorithm in three procedures:
■ precomputation procedure
■ propagation procedure
■ determination procedure.
We repeat each procedure for every group of log n elements, from the first one to the last one, and in reverse order. The procedures presented in this section are CREW since they require concurrent access to A and M in reading, but only exclusive access to the global memory in writing. In order to simplify the description that follows we give value one to the weight of each edge. This assumption corresponds to determine the longest feasible path for the de novo peptide sequencing problem or feasible paths longer than any given value as solutions for the peptide sequencing tag problem. In the next paragraph we describe the three procedures.
Precomputation procedure
The first procedure consists in building the graph. Considering each group of log n masses, we associate a node to each mass, and so the ith processor is responsible for the ith node ν_{i} in the group. Processor i, for each element in A, checks if there exists a node ν_{j} in M that differs from it to the mass of the element in A. In this case we put an edge between ν_{i} and ν_{j} in the graph. We store this edge in two different adjacency lists, the so called predecessor (pred) and successor (succ) list:
Note that any node can have at most twenty predecessors or successors, or none. Since M is a sorted array, using a binary search algorithm to determine the predecessors and successors, the precomputation takes O(log n) time for each group and hence O(n) totally.
Propagation procedure
The second procedure of the algorithm permits, for each node, to compute the maximum length path passing through it. This goal is reached by iterating the search of the predecessor (successor) of every node using the pointer jumping technique [16] in every group. In order to handle the propagation, processor i stores and updates the current predecessor in the start_path pointer:
start_path[i] = h ∈ {1, …, n} ⇔ at least one path from ν_{i} to ν_{h} exists.
Note that all the predecessors of ν_{i} belong to the ν_{i}'s group or to any group that precedes it, being M sorted. Hence all the start_path pointers of the predecessors of ν_{i} are known, when ν_{i}'s group is processed, due to the order in which groups are handled. The same is done to update the current successor in end_path. We also calculate the distances d_{s} and d_{e} from any node ν_{i} to the ν_{h} node pointed by start_path[i] and end_path[i] pointers. At the end of the propagation procedure these two pointers, related to each node ν_{i}, will point to the termini nodes of the longer path passing through ν_{i}. Algorithm “Propagation” (Figure 3) describes only the computation of d_{s} and start_path for any group, while d_{e} and end_path can be obtained by replacing pred and start_path with succ and end_path, respectively. At first, for each node i, we initialize the start_path[i] pointer:
Figure 3. Algorithm of the Propagation procedure. Propagation procedure permits, for each node, to compute the maximum length path passing through it. Pseudocode is presented for the computation of start_path and d_{s} for any group, while d _{e} and end_path can be obtained by replacing pred and start_path with succ and end_path, respectively.
a) if i has only one predecessor j, then start_path[i] = j and d_{s}[i] = 1 (Fig. 3, stat. 3–5);
b) otherwise, start_path[i] = i and d_{s}[i] = 0 (Fig. 3, stat. 6–8).
Then, for each node i, we repeat the following steps until there are no changes in any start_path of the group (that is start_path[i] = start_path[start_path[i]], for all i in the group):
a) if start_path[i] points to a node different from i and different from start_path[start_path[i]], then we assign d_{s}[i] = d_{s}[i] + d_{s}[start_path[i]] as the new distance of the node, and start_path[i] = start_path[start_path[i]] for the pointer (Figure 4a; Figure 3 stat. 14–16);
Figure 4. Propagation. Propagation procedure after the initialization, updates the start_path pointers. pred edges are represented as continued arrows, start_path pointers are represented as broken arrows.(a) If start_path[i] points to a node different from i (e.g. a predecessor of i) and different from start_path[start_path[i]] = j, then we assign start_path[i] = j. (b) If i has all the predecessors with start_path pointers pointing to themselves or predecessors with start_path pointing to the same node j, then we assign start_path[i] = j. (c) If all the predecessors of i have the start_path pointers cycling on themselves or start_path pointing to a node without predecessors, we consider the predecessor j having the maximum d _{s} distance and we assign start_path[i]= j.
b) otherwise, if i has all the predecessors with start_path pointers pointing to themselves or predecessors with start_path pointing to the same node j, then we assign start_path[i] = j. d_{s}[i] becomes the maximum distance d_{s} of predecessors pointing to j, plus 1 (Figure 4b; Figure 3, stat. 18–22);
c) otherwise, if all the predecessors of i have the start_path pointers cycling on themselves or start_path pointing to a node without predecessors, we consider the predecessor j having the maximum d_{s} distance and we assign start_path[i] = j and d_{s}[i] = 1 + d_{s}[j] (Figure 4c; Figure 3, stat. 24–29);
d) otherwise, the node waits for some changes in the start_path pointers of its predecessors.
At the end of the propagation procedures, each node i knows the maximum distance d_{s}[i] + d_{e}[i] of a path passing through it, the starting node start_path[i], and the ending node end_path[i] of this path. The computational complexity of this procedure is O(log n) time for each group. Indeed, in the worst condition only one node at time is unlocked and it can upload the start_path. At the beginning, we have a set of pointer trees. We are interested in the sum of their heights. This sum is obviously less than log n. Pointer jumping and merging operations decrease the total height since a tree of height h is transformed into a star by applying pointer jumping in O(log(h)) steps, and the root of a star “hooks” to the parent of any of the root's predecessor in G. Therefore, in the worst case, if h_{max} is the maximum height of the initial set of, say k, pointer trees, all these trees degenerate into stars in O(log(h_{max})) time, and finally they are merged in a list ranking of roots, and the algorithm stops in O(k) time. Since h_{max}, k ≤ log n in every group, and we apply Algorithm “Propagation” to all the n/log n groups, the time complexity is O(n).
Determination procedure
This procedure allows to retrieve the solutions of the peptide sequence tag problem. Some change to the procedure permits to compute all the feasible paths of maximum length or all the feasible paths with length more than δ. We describe the latter case. At the end of the previous section, each node i having d_{s}[i] + d_{e}[i] > δ belongs to a solution of the peptide sequence tag problem. Moreover the set of the nodes i such that start_path[i] = i and d_{e}[i] > δ are the starting nodes of any solution. In order to print all the solutions we can use a sequential procedure, taking at most O(ns), where s is the number of the possible sequence tags. Indeed, beginning from each starting node i we print all the possible solutions visiting only the successors j such that d_{s}[i]+1+d_{e}[j] > δ, and so forth for the successors of these nodes.
Results
In order to understand the performance of our algorithm and to compare it with other existing software, we simulated the processes by using the multithreading in Java, addressing the synchronization by means of barriers. We tested our program on a four 2 GHz dualcore Intel processors 8GB RAM machine.
Our first dataset consists in 1363 annotated Escherichia Coli ion trap tandem mass spectra from the Open Proteomics Database (OPD) [17] with different Xcorr (97 spectra with Xcorr ≥ 2.5, 246 spectra with Xcorr ≥ 2.0 and 1363 spectra Xcorr ≥ 1.5), and our second dataset consists of the 280 spectra of [13]. We tested the program over all these spectra after running a data preprocessing to remove tiny noise peaks as in Mascot (personal communication).
For the first dataset, the algorithm looks for peptide sequence tags of maximum length. We evaluated the percentage of cases when the algorithm finds at least one correct sequence tag at different lengths k. We obtained the following percentage:
■ 99.6%, for k = 3;
■ 96.1%, for k = 4;
■ 96.1%, for k = 3;
■ 59.5%, for k = 3.
The average running time required to generate the sequence tags is 0.15 seconds. We compared our program with the public available program PepNovo on the same dataset of 280 spectra as in [13]. We evaluated the occurrence of at least one correct sequence tag in the generated sequence tag of maximum length found by our algorithm. Since in general the generated sequence tag is not unique, we used the scoring function defined in [7] to assign a score to the sequences. We evaluated the percentage of cases where any correct tag is contained in the highest scoring solution at different lengths. Additionally, we reported on the occurrence of any correct tag in the set of size three of the top scoring solutions. The results are listed in Table 1. We note that, since the percentages grow substantially if we consider the occurrence of correct sequence tags in the generated maximum length sequence tags, selectivity of the scoring function is low. Therefore better results could be obtained by using a different scoring function.
Table 1. Experimental results. Comparison of five tag generating methods on 280 spectra: for each tag length, algorithm and number of solution tags, the table displays the proportion of test spectra with least one correct tag.
The average running time required to generate the sequence tags is 0.11 seconds.
Conclusions
The problem of identifying modified or variant peptide sequences is a challenging one, especially when the spectrum for unmodified sequence is not present as a standard for comparison. By joining the best partial sequences of the de novo interpretation and the database search algorithms, sequence tag can increase the speed and the effectiveness of the identification, and the discovery of unknown modifications, sequence variations and possibly alternate splice sites in proteins. Here, we have proposed a new workefficient parallel algorithm to find peptide sequence tags. Our tests shown that our algorithm is efficient and accurate since achieves comparable results to other methods. Therefore, at least in theory, the proposed algorithm could be used to identify posttranslational modifications, comparing the homology of the sequence tags found with the sequences in the biological database.
List of abbreviations used
CREW – concurrent read, exclusive write
MS/MS – tandem mass (or mass/mass)
OPD – Open Proteomics Database
PRAM – Parallel Random Access Memory
RAM – Random Access Memory
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
SB participated in the design of the algorithm, in the coordination and analysis of the tests and in revising the content of the paper. EL conceived of the initial idea to use the parallelism in this context and supervised all the phases of the developed work. EM carried out the design of the algorithm and its implementation, participated in testing the program and analysis of the experimental results and drafted the manuscript. MS carried out the testing and debugging of the program, participated in the analysis of the experimental results, and helped in revising the paper. All authors read and approved the final manuscript.
Acknowledgements
We wish to thank Sonia Campa for help and useful discussions on the implementation of the algorithm.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 4, 2008: A Semantic Web for Bioinformatics: Goals, Tools, Systems, Applications. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/9?issue=S4.
References

Eng J, McCormack A, Yates J: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database.
Journal of American Society of Mass Spectrometry 1994, 5:976989.

Perkins D, Pappin D, Creasy D, Cottrell J: Probabilitybased protein identification by searching sequence databases using mass spectrometry data.
Electrophoresis 1999, 20:35513567. PubMed Abstract  Publisher Full Text

Bafna V, Edwards N: On de novo interpretation of tandem mass spectra for peptide identification. In Proceedings of the seventh annual international conference on Computational molecular biology: 1030 April 2003; Berlin. Edited by Vingron M, Istrail S, Pevzner P, Waterman M. ACM Press; 2003:918.

Brunetti S, Dutta D, Liberatori L, Mori E, Varrazzo D: An efficient algorithm for de novo peptide sequencing. In Proceeding of the seventh international conference on Adaptive and Natural Computing Algorithms: 2123 March 2005; Coimbra. Edited by Ribeiro B, Albrecht RF, Dobnikar A, Pearson DW, Steele NC. Springer Verlag; 2005:327342.

Brunetti S, Lodi E, Mori E: De novo peptide sequencing: a workefficient parallel algorithm. In Proceeding of the First International Conference on Bioinformatics Research and Development: 1214 March 2007; Berlin. Edited by Hochreiter S, Küng J, Palkoska J, Wagner R. OCG; 2007:6667.

Chen T, Kao MK, Tepel M, Rush J, Church GM: A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry.
In Proceedings of the eleventh annual ACMSIAM symposium on Discrete algorithms: 0911 January 2000; San Francisco Edited by Shmoys D: Society for Industrial and Applied Mathematics. 2000, 389398.

Danĉík V, Addona TA, Clauser KR, Vath JE, Pevzner PA: De novo peptide sequencing via tandem mass spectrometry.
Journal of Computational Biology 1999, 6:327342. PubMed Abstract  Publisher Full Text

Frank A, Pevzner PA: Pepnovo: de novo peptide sequencing via probabilistic network modeling.
Anal Chem 2005, 77:964973. PubMed Abstract

Lu B, Chen T: A suboptimal algorithm for de novo peptide sequencing via tandem mass spectrometry.
Journal of Computational Biology 2003, 10:112. PubMed Abstract  Publisher Full Text

Ma B, Zhang K, Hendrie C, Liang C, Li M, DohertyKirby A, Lajoie G: PEAKS: Powerful Software for Peptide De Novo Sequencing by MS/MS.
Rapid Communications in Mass Spectrometry 2003, 20:23372342.

Mann M, Wilm M: Errortolerant identification of peptides in sequence databases by peptide sequence tags.
Analytical Chemistry 1994, 66(24):43904399. PubMed Abstract

Han Y, Ma B, Zhang K: Spider: software for protein identification from sequence tags with de novo sequencing error.

Frank A, Tanner S, Bafna V, Pevzner P: Peptide sequence tags for fast database search in massspectrometry.
Journal of Proteome Research 2005, 4:12871295. PubMed Abstract  Publisher Full Text

Tabb DL, Saraf A, Yates JR: Gutentag: highthroughput sequence tagging via an empirically derived fragmentation model.

Liu C, Yan B, Song Y, Xu Y, Cai L: Peptide sequence tagbased blind identification of posttranslational modifications with point process model.
Bioinformatics 2006, 22:e307e313. PubMed Abstract  Publisher Full Text

Jájá J: An introduction to parallel algorithms. Addison Wesley Longman Publishing Co.: Redwood City; 1992.

Prince JT, Carlson MW, Wang R, Lu P, Marcotte EM: The need for a public proteomics repository.
Nat Biotechnol 2004, 22(4):471472. PubMed Abstract