Abstract
Background
Mass spectrometry based peptide mass fingerprints (PMFs) offer a fast, efficient, and robust method for protein identification. A protein is digested (usually by trypsin) and its mass spectrum is compared to simulated spectra for protein sequences in a database. However, existing tools for analyzing PMFs often suffer from missing or heuristic analysis of the significance of search results and insufficient handling of missing and additional peaks.
Results
We present an unified framework for analyzing Peptide Mass Fingerprints that offers a number of advantages over existing methods: First, comparison of mass spectra is based on a scoring function that can be customdesigned for certain applications and explicitly takes missing and additional peaks into account. The method is able to simulate almost every additive scoring scheme. Second, we present an efficient deterministic method for assessing the significance of a protein hit, independent of the underlying scoring function and sequence database. We prove the applicability of our approach using biological mass spectrometry data and compare our results to the standard software Mascot.
Conclusion
The proposed framework for analyzing Peptide Mass Fingerprints shows performance comparable to Mascot on small peak lists. Introducing more noise peaks, we are able to keep identification rates at a similar level by using the flexibility introduced by scoring schemes.
Background
Protein identification using mass spectrometry has become one of the central tools in proteomics and systems biology [1]: With growing protein sequence databases such as SwissProt [2], fast and accurate identification of a sample protein remains a central problem. There are two common strategies for protein identification using mass spectrometry: Peptide Mass Fingerprints [3] and protein identification from peptide sequence information using tandem mass spectrometry [4].
Peptide mass fingerprinting (PMF) is preceded by a protein separation step using gel or chromatographic separation. The separated protein is digested by specific enzymatic cleavage such as tryptic digestion, followed by mass spectrometric measurement of the resulting peptides. The resulting mass spectrum has to be preprocessed into a list of signal peaks that form the input to identification algorithms. In our approach, we concentrate on Matrix Assisted Laser Desorption/Ionization (MALDI) [5], the predominant ionization technique for PMF. This technique produces mainly singly charged ions, allowing us to talk of the mass m of a molecule, instead of its masstocharge ratio m/z.
To identify a measured protein from a sequence database, the database sequences are digested insilico and each predicted peak list is matched and scored with the measured peak list. Usually, computation of the statistical significance should follow, using a statistical background model. Software routinely used for identification of proteins using PMF includes the commercial systems Mascot [6] which uses peak counting together with heuristic information and ProFound [7] which relies on a bayesian scoring scheme. These systems have a comparable performance [8].
In [9], we presented a new approach for PMF protein identification. The approach is based on a reformulation of the identification problem as a global alignment problem. Further, pvalues of identifications are computed using a combinatorial algorithm using uniform character frequencies.
The statistics for pvalue computation is extended to a broader class of digestion enzymes and to arbitrary protein sequence models of independently and identically distributed (i.i.d.) amino acids in [10], where this model is also shown to be consistent with corresponding empirical SwissProt data.
Here, we validate the theoretical approach of peak list alignments as introduced in [9] and show the applicability of this approach on real proteomics data. We discuss several aspects of general scoring schemes to be used in peak list alignments; such schemes provide a unified framework that allows emulation and combination of already existing methods and ideas. We demonstrate how missing and additional peaks can explicitly be taken into account and peak intensities can be consistently added into the scoring procedure. We evaluate our method, called SAMPI (SAMPI: aligning mass spectra for protein identification), on real proteomics mass spectrometry data and compare the method to PMF identification using the standard software Mascot.
Results and discussion
To evaluate our method, 375 PMF tryptic mass fingerprints of charge state [M + H]^{+ }from an inhouse proteomics experiment on the organism Corynebacterium glutamicum (Cg) are measured on a Bruker Ultraflex mass spectrometer. The proteins are separated using SDSPAGE gel electrophoresis before mass measurement. Wellseparated spots are digested with trypsin and peptide masses measured by mass spectrometry. For identification, trypsin is set as a cleavage enzyme, Carbamidomethyl is set as a fixed mass modification of ≈ +57 Da for Cysteine, a mass tolerance of 1 Da is set, and no missed cleavages are allowed.
Processing the raw spectra
To assess robustness and flexibility of the method, two different peak lists are derived for each raw spectrum. First, the peak list from the manufacturers peak detection software: It is conservative in picking only the highest abundant peaks; with about 0–90 peaks, 20 on average, these peak lists were comparatively small. Second, we apply a peak detection algorithm developed in our group that derives much larger peak lists of 34–729 peaks, 277 on average. A comparison of the peak list lengths is shown in Figure 1. For unknown reasons, the manufacturers software only delivers 325 peak lists, 9 of which were empty. The other algorithm delivers 375 valid peak lists. For better comparison, we differentiate the peak lists delivered by the algorithm of our group in the following by "PL" and "PL_{316}", denoting the whole set of peak lists and the set of peak lists where the manufacturers peak detection also delivers a corresponding nonempty peak list. Due to the different peak detection, the mass ranges for the measured and predicted peak lists were set to 500–3000 Da for the Bruker software, and 800–3000 Da for our peak lists. All peaks outside this range are discarded.
Figure 1. Peaklist sizes: Number of detected peaks using the Bruker software (solid) and our peak detection method (dashed).
Databases
Both sets of peak lists are identified using Mascot versions v1.9 and v2.1 and the Gaussian scoring scheme described below with different parameters.
For estimating the false positive rate, we proceed as follows: In a first step, all peak lists are identified using the inhouse Cg protein sequence database with 3,510 sequences. The sequence identifiers for each identification are recorded for later comparison.
In a second step, all peak lists are again identified using a concatenation of the SwissProt database, release 48 with 155,824 entries, to the above Cg database. An identification is assumed to be correct if it is the same Cg sequence as recorded before. Conversely, an identification is assumed to be incorrect, if it is not a Cg sequence.
This approach makes it necessary to discard protein sequences that are equal to or highly similar to any Cg sequence from the SwissProt database beforehand. We therefore do an allagainstall comparison of the SwissProt database with 194,317 sequences and the Cg database with BLAST [11]. Sequences from the SwissProt database with an evalue of 10^{30 }or better are discarded and the remaining 155,824 sequences are appended to the Cg database.
The performance of PMF identification algorithms can thus be evaluated and compared independently on the set of sample mass spectra.
Results
The runtimes of both SAMPI and Mascot on the modified SwissProt database (FASTA format, no database indexing) and each of the two peak lists including all preprocessing times are listed in Table 1. The results for both versions of Mascot and the Gaussian scheme with several additional/missing scores are listed in Table 2. All parameter combinations were tested with and without use of peak intensities. Not surprising, different parameter sets lead to different numbers of correct identifications. Nevertheless, these numbers do not change rapidly with changing parameters, indicating a robust behavior of the alignment identification procedure. Using the small manufacturer's peak lists, a small penalty of additional and missing peaks yields a comparable number of correct identifications as Mascot. Using peak intensities in the scoring, this number drops considerably. This is most likely due to the fact that these peak lists already consist of the highest abundant peaks, which are now scaled to 1/3 to 1, distorting the relevance of peaks. This problem might be resolved by using a full rank statistic to scale peak intensities as used, e.g., in [12], instead of the implemented robust linear rescaling. Since we describe a proofofconcept of the peak list alignment framework, we did not investigate this issue further. Using the larger, noisy peak lists results in the complete opposite behavior: Now, without using intensities to discriminate important and nonimportant peaks, the identification rate drops to about 1/2 for both the Gaussian schemes and Mascot. Additionally using the robust intensities leads to a good identification rate again. Note that now, higher penalties for additional and missing peaks are also helpful.
We found the score separation of correct and incorrect identifications to be comparable to Mascot (Figures 2 and 3).
Figure 2. Score distributions. Mascot: Distribution of Mascot scores of correct (solid line) and incorrect (dashed line) identifications using the Cg+SwissProt database, 1 Da mass tolerance, no missed cleavages, and 316 spectra.
Figure 3. Score distributions. SAMPI: Distribution of SAMPI scores of correct (solid line) and incorrect (dashed line) identifications using the Cg+SwissProt database, parameter set B, no missed cleavages, and 316 spectra.
Conclusion
We propose a new formulation of protein identification using Peptide Mass Fingerprinting as an alignment problem. We introduce general peakwise scoring schemes and show how these can be used to score two peak lists by dynamic programming. The scoring schemes provide a large amount of flexibility by allowing the user to independently set matching, additional and missing scores. They also allow consistent inclusion of additional features such as peak intensities into the identification process.
A mathematical model based on random weighted strings is used to efficiently estimate the statistical significance of an alignment score. This model only needs character frequencies as parameters, which can be estimated even with small sequence databases, allowing the use of speciesspecific protein data. The significance score is then computed to get comparable scores independent of the underlying database and sequence lengths. We also propose a first example of an alignment scoring scheme, called Gaussian score, using mass difference and robust intensities. We tested our approach on biological PMF data and compared our results to the standard software Mascot. We were able to correctly identify a comparable number of proteins using peak lists produced by the machine vendor's peak detection software. Using our own peak detection with about 8–10 times as many peaks, we showed the flexibility of our approach by correctly identifying approximately the same number of proteins whereas the performance of Mascot dropped considerably to about half the number of correct identifications.
For the near future, we plan to incorporate missed cleavages into the program. They can easily be handled by the alignment algorithm, but the statistical model has to be extended slightly. Further, we plan to incorporate the method into the ProDB proteomics platform [13] and to compare it to other protein identification tools besides Mascot. As already discussed in the Results and Discussion section, the incorporation of intensity information is not optimal. This is not due to the framework but rather due to the use of scaled intensity values. Nevertheless, using full rank statistics or other probabilistic intensity incorporations are to be investigated in the future. As a last point, the score normalization is not as good as expected, especially on smaller peak lists; the reasons and possible improvements are to be investigated. The flexible alignment framework together with the deterministic, modelbased significance computation seems promising, although some improvements are clearly necessary.
Methods
To identify a measured peak list using peak lists predicted from database sequences, a measure is needed for the similarity of two peak lists. Our scoring of similarity is based on a peakwise scoring function to score a pair of peaks, one of them possibly being a "gap" peak. The optimal matching of two peak lists can then be computed in a way similar to global sequence alignment.
For computing a statistical significance of an alignment score, we introduce a nullmodel based on a random protein model and estimate the alignment score distribution.
The general method works as follows, where the individual steps are explained in detail below: In a preprocessing step, a peak list is computed from each entry in the particular protein sequence database; it is called the predicted peak list of the sequence. Further, several statistics are computed for later use in the identification's significance estimation. These statistics are the length distribution and the joint lengthmass distribution of cleavage fragments. From these, the occurrence probabilities are computed for each possible fragment mass and each protein length contained in the database. Now, the highest scoring protein sequence is computed for each measured spectrum by aligning this spectrum to each predicted spectrum, computing the alignment score and returning the sequence with the highest scoring predicted spectrum. Further, a statistical significance is computed for each such alignment score.
Spectra alignments
Peaks and peak lists
Every peak p_{i }has a mass m_{i }∈ and possibly other attributes (a_{i,1}, ..., a_{i,k}) ∈ , k ≥ 0. A peak list of length n is a list = {p_{1}, ..., p_{n}} of peaks p_{i }∈ × . A peak list is sorted by mass, thus m_{i }<m_{j }if i <j.
Note that we allow the set of peak attributes to be empty. The simplest type of peak is a peak having only its mass m ∈ ℝ. A peak with mass and relative intensity could be represented as an element of ℝ × [0,1].
Scoring peaks and spectra
Let _{p }= {p_{1}, ..., p_{n}} and _{m }= {p'_{1}, ..., p'_{n'}} be two peak lists of length n and n', respectively. For 1 ≤ i ≤ n and 1 ≤ j ≤ n', let p_{i }∈ × and p'_{j }∈ × ' where we allow the sets of additional peak attributes and ' to be different. An example would be a measured peak list _{m}, with relative intensity and a predicted peak list _{p }with the generating string fragment as additional attributes. To compute an optimal matching between peak list _{p }and peak list _{m}, we first define a scoring function that scores two individual peaks.
A peakwise scoring function score is a function
score: (_{p }∪ {ε}) × (_{m }∪ {ε}) → ℝ
mapping a predicted and a measured peak to a real value. Here, ε denotes a special "gap" peak. For two peaks p ∈ _{p }and p' ∈ _{m}, we say that score(p, p') is a matching score. We call score(p, ε) a missing score and p a missing peak, as it is not matched to any peak in _{m}. Similarly, score(ε, p') is called an additional score for an additional peak p'. For completeness, we define score(ε, ε) := ∞.
We want to stress that missing and additional peaks are likely to be seen even if the measured spectrum stems from a measurement of a known sequence. Additional peaks, peaks that are seen in the measurement but cannot be explained by the sequence, may simply be chemical noise from the biochemical sample preparation. Missing peaks may occur due to incorrect peak detection or failed ionization of the corresponding fragment. Of course, missing and additional peaks also occur if the measured spectrum does not stem from the sequence under investigation.
Example 1 (Peak counting)
Using only peak mass as attribute, a peak counting score could ignore missing and additional peaks, i.e. set score(p, ε) = score(ε, p') = 0, and give a positive score whenever the difference of the two peak masses m and m' is not too large:
for some positive constant δ.
Noting that it would be meaningless to match two pairs of peaks that overcross in mass, we compute the optimal matching between two spectra, i.e. the matching yielding the highest sum of peakwise scores, as a global alignment, using the wellknown dynamic programming recurrence. Let E[i, j] denote the score for the optimal matching between the two spectra up to peaks p_{i }and p'_{j}, respectively. Then the alignment table is computed as
The score score(_{p}, _{m}) of the optimal matching, given in E[n, n'], is called the alignment score of the spectra _{p }and _{m}.
As in the case of sequence alignment, the optimal matching itself can be recovered by backtracking in the dynamic programming table E[·, ·]. The alignment score can be computed in time O(n·n'), but faster implementations are possible, using only a diagonal band in E[·, ·].
This approach is a standard technique [14,15], and has been successfully applied to such diverse problems as tree ring and liquid chromatography matching [16,17]. A more formal model of peak list alignment can be found in [9].
Scoring schemes
Although a peak in a measured peak list is described at least by its mass and absolute intensity, most identification algorithms only make use of its mass [18]. This is partly because mass is the most discriminative parameter measured and partly because intensity depends heavily on the actual parameter settings of the machine. The basis for many schemes is the observation that a measurement error between the "real" mass of a molecule and the measured mass can be described by a Gaussian distribution with mean 0 and a standard deviation sd dependent on the machine settings and experiment type. The mean might also deviate from 0 if the machine is not calibrated correctly. We will now introduce a family of scoring schemes, the Gaussian scores, that will be used in further sections to demonstrate the applicability of our approach. Note however, that the approach is by no means limited to this mass measurement error distribution.
Mass difference
The matching score score(p, p') for two peaks p and p' with masses m and m', respectively, is the probability of a Gaussian distributed random variable Z with mean 0 and standard deviation sd to exceed ±m  m' in the respective direction, i.e., score(p, p') = ℙ(Z ≥ m  m'). This score drops exponentially from 1 to 0 with increasing mass difference. As it is always positive, we set the score to ∞ if it falls below 0.05, that is, if the mass difference exceeds ≈ 2·sd. A similar approach is taken in ProFound and the tandem MS software SCOPE [19], whereas Mascot uses a constant positive matching score similar to that of Example 1.
Robust incorporation of intensities
In order to incorporate intensities of measured peaks into the scoring, we applied methods from robust statistics successfully used in tandem MS scoring [20]: All peaks in the peak list were ranked according to their absolute intensity. The intensity of the 10% highest abundant peaks were set to 1, the intensity of the 10% lowest abundant peaks to 0. The intensities of the remaining peaks were scaled linearly between 0 and 1. Thus, a very high abundant peak of chemical noise or a small number of wrongly detected, low abundant peaks cannot spoil the interpretation of the whole peak list. Up to this point, we only use intensity values of measured peaks, resulting in an asymmetric scoring scheme. Given an appropriate prediction model [18,21], it would also be possible to incorporate predicted intensities into the scoring. Writing int(p') for a peak's scaled intensity, the matching score is multiplied with (1 + 2 int(p'))/3, yielding a factor of 1/3 for lowest and 1 for highest abundant peaks. Again, the approach is suitable for using any other incorporation of intensity information, such as logarithmic transforms as proposed, e.g., in [22].
Scoring gap peaks
Using peakwise scoring schemes allows us to explicitly take additional and missing peaks into account.
For additional peaks, a constant penalty c_{1 }is given. If intensities are used in the scoring, this penalty is again multiplied by the scaled intensity of the peak: score(ε, p') = c_{1}·int(p'). Very low abundant peaks are then penalized by 0 and thus simply ignored, and very high abundant peaks that are not explained are highly penalized. For missing peaks, the Gaussian score always gives a constant penalty c_{2}, but as with matching scores, predicted peak intensities could also be used.
Background model and significance of alignment scores
To estimate the significance of a score of a measured spectrum and a sequence of certain length L, we compute a table of mass occurrence probabilities in random weighted strings in a preprocessing step. Using these probabilities, we get a background model for predicted spectra allowing us to estimate the contribution of each measured peak to the overall alignment score under a welldefined nullmodel without sampling. We proceed as follows: After introducing a formal model of random protein sequences and their digestion, we compute the joint lengthmass distribution of cleavage fragments in such random proteins. We then compute the probability that in a random protein of given length, at least one fragment of certain mass m occurs and will thus give rise to a corresponding peak in the predicted spectrum. All these quantities can be computed once in a preprocessing step. Using the mass occurrence probabilities, we estimate the expectation and variance of an alignment score for computing pvalues of such the score.
Weighted strings
A weighted alphabet is a finite alphabet Σ together with a weight or mass function μ: Σ → ∨_{>0}, assigning a mass to each of its characters. Its domain can be extended to strings s ∈ Σ* by setting . Such strings are called weighted strings.
If each character σ ∈ Σ occurs with probability ℙ(σ), we call an i.i.d. sequence of such characters a random weighted string. The parameters of this model, i.e., the character frequencies, can be robustly estimated from a sequence database. As they are the only parameters needed for subsequent significance computations, we can use speciesspecific models where only small sequence databases are available.
Here, we use the alphabet of amino acids of size 20 together with the molecular mass of the amino acids in Dalton (Da), with 1 Da approximately the weight of a neutron. For the computations, we require the masses to be integers. As measured masses are only known to some precision, we can simply scale the real mass by an appropriate precision factor (0.1 or 0.01 for PMF/MALDI) and denote the resulting integer masses by μ*(σ). For a precision of 0.1 and character frequencies estimated from SwissProt, release 48, a sample of the weighted amino acid alphabet is given in Table 3:
Table 3. Example weighted amino acids
This model is readily extendible to capture distributions of masses for each character, such that copies of the same amino acid in a protein may have different masses. This allows to model isotopic mass distributions and different amino acid masses due to posttranslational modifications such as phosphorylation or methylation.
Cleavage schemes
Most proteases cleave a peptide right after the occurrence of a specific cleavage character in the amino acid sequence, except in the presence of a prohibition character directly following the cleavage character. In the case of trypsin, the set of cleavage characters is Γ = {K, R} and the set of prohibition characters is Π = {P}. Together, Γ and Π form a cleavage scheme.
Applying a cleavage scheme on a weighted string results in a fragmentation of this string, a set of successive, nonoverlapping substrings, the fragments.
Example 2 (Fragmentation of a string)
Let Σ := {A, B, C}, be a weighted alphabet with weights μ(A) = 1, μ(B) = 2, μ(C) = 3, let Γ := {B}, Π := {A} be a cleavage scheme on Σ. Then the string s = ABBACCBACBBB is fragmented into the fragments AB, BACCBACB, B, B of weights μ(AB) = 3, μ(BACCBACB) = 17 and μ(B) = 2.
For the sake of brevity, we concentrate on cleavage schemes without prohibition characters. Then, a fragment is simply a string of noncleavage characters followed by a cleavage character. Generalizations to arbitrary cleavage schemes and more details on the stochastic models and efficient computation can be found in [10].
Mass occurrence probabilities
Let f^{L}[l, m*] denote the probability that the first fragment of a random weighted string of length L has length l and integer mass m*. The main recurrence is given for the lengthmass distribution of the inner part of a fragment, consisting solely of noncleavage characters:
with initial condition f'[0, 0] = 1.
The fragment lengthmass distribution can be computed by adding the cleavage character to the right and taking care of the finite string length L.
Lemma 1 (Fragment probabilities)
The fragment probability f^{L}[l, m*] is given for l <L by
and for the boundary l = L by
The probability that a fragment of length l does not have mass m is computed as ^{L}[l, m*] = u[l]  f^{L}[l, m*], where u[l] denotes the probability that a fragment has length l; it is a geometric distribution.
Taking the complementary probability ^{L}[l, m*], we can compute the mass occurrence probability p[L, m*] that at least one fragment of mass m* occurs in the fragmentation of a random weighted string of length L.
Lemma 2
The occurrence probability p[L, m*] = 1  [L, m*] of mass m* in a random weighted string of length L is given by [0, m*] = 1 and
Both tables have to be computed up to the largest sequence length L_{max }in the sequence database and up to the largest integer fragment mass . For PMF using MALDI, m_{max }≈ 3,000 Da and for SwissProt as sequence database, L_{max }≈ 10,000. For a mass precision of one decimal, using doubles, we would need about 30,000·10,000·8 ≈ 2.24 GB of main memory for each table. As ^{L}[·, ·] is only needed during computation of [·, ·], only a very small part of about 3–4 MB is required at any time. To efficiently compute the significance of an alignment score, however, the occurrence probability table p[·, ·] needs to be kept in memory. Its columns can be computed independently and entries of each column depend smoothly on L (the occurrence probability will not change abruptly if the sequence length grows), it is thus sufficient to store only the first 100 entries of each column completely and then store every 25th row, performing a linear interpolation to get intermediate values. Comparing the exact values in each column to the values computed by the described interpolation scheme, we found the interpolation error to be smaller than 10^{9 }in every case. Note that the interpolation nodes are the exact values, so the interpolation error does not accumulate with growing string length. The mass occurrence probability p[L, m] is given for masses m = 1000.0, 1500.0, 2000.0, 2500.0, 3000.0 Da and a precision of 0.1 Da in Figures 4 and 5, for string length up to 50 and 1000, respectively, showing the continuous behavior of the function for L > 40. The "hump" at small string lengths can be explained by the fact that for these lengths, the only possible fragment of mass m is whole the string itself. For greater string length, the corresponding fragment(s) must be "real" fragments, subject to tighter constraints on their combinatorial character composition, e.g. they must have a cleavage character at the end. This "hump" is located around L ≈ m/μ_{avg}, where μ_{avg }denotes the average character mass. For average molecular masses and SwissProt frequencies we have μ_{avg }≈ 111.2 Da. By further exploiting the fact that f^{L}[l, m*] = 0 for l > m*/, where is the smallest integer character mass in Σ, both ^{L}[l, m*] and [L, m*] can be computed in time O(L_{max}·). We would like to refer the interested reader to [10] for details and proofs on the memory and time efficient implementation.
Figure 4. Occurrence probabilities. The mass occurrence probabilities p[L, m] for masses. m = 1000.0, 1500.0, 2000.0, 2500.0, 3000.0 Da and string length L = 1 ... 30. Precision 0.1 Da.
Figure 5. Occurrence probabilities. The mass occurrence probabilities p[L, m] for masses. m = 1000.0, 1500.0, 2000.0, 2500.0, 3000.0 Da and string length L = 1 ... 1000. Precision 0.1 Da.
Alignment score distribution
The alignment score distribution is efficiently and deterministically estimated by adding the contribution of each peak to the overall alignment score. To compute the contribution of each measured peak p'_{j}, ∈ _{m}, let _{j }denote the support of peak p'_{j}, that is the set of integer masses m* for which a predicted peak p having integer mass m* would contribute a positive matching score score(p, p'_{j}) > 0. Let be the random variable that contains the sum of the matching scores over all peaks p in any spectrum _{p }generated by a random weighted string of length L that have masses in . The expectation and variance of this random variable are then given by
Similarly, we define random variables for the additional scores. Assuming independence of peaks, the overall matching and additional scores are simply the sum of these scores:
The missing scores are given for all masses that are not inside the support of any measured peak:
We omit the details for additional and missing peaks and again refer the interested reader to [9]. The alignment score score(_{p}, _{m}) for a measured spectrum _{m }and a random predicted spectrum _{p }generated by a random weighted string of length L is finally given by
score(_{p}, _{m}) = X^{match}(L) + X^{add}(L) + X^{miss}.
As score(_{p}, _{m}) is the sum of nearly independent random variables, we can expect it to have a normal distribution for reasonable scoring schemes and peak lists. This distribution is completely determined by its expectation and variance.
Note that the alignment algorithm computes an optimal onetoone peak matching score, whereas the estimation procedure corresponds to a manytoone matching of peaks. As shown below, neither the peak independence assumptions nor the onetoone peak matching are a problem in practice, as violations of either assumption do not contribute enough to alter the score distribution noticably.
We tested the two assumptions using two different parameter sets A and B for the Gaussian score given in Table 4, and computing the alignment scores of 10,000 random amino acid sequences of length 250 and a randomly chosen measured spectrum from our dataset. We found the estimated distributions in good agreement with their empirical counterparts, as shown in Figures 6 and 7.
Table 4. Parameter sets for evaluation
Figure 6. Alignment score distribution. A: Solid line: Densities of empirical alignment score distribution using 10,000 randomly generated protein sequences of length 250 with SwissProt amino acid frequencies. Dashed line: Density of approximating normal distribution with parameters computed as described in the text. Both alignments for one measured spectrum and SAMPI score with parameter set A.
Figure 7. Alignment score distribution. B: Solid line: Densities of empirical alignment score distribution using 10,000 randomly generated protein sequences of length 250 with SwissProt amino acid frequencies. Dashed line: Density of approximating normal distribution with parameters computed as described in the text. Both alignments for one measured spectrum and SAMPI score with parameter set B.
Using significance as score
As the alignment score is an additive score, its value and distribution is dependent on the number of peaks in the measured and predicted spectra. This makes it difficult to compare alignment scores for different measured spectra and sequence lengths.
To avoid these problems, we will not use the alignment score itself, but its significance to rank the candidate sequences, a method previously shown to be effective for tandem MS data [20]. For each pair of measured and predicted spectra, the alignment score is computed, its distribution is estimated using the method described, and the pvalue – the probability that a random sequence of the same length as the aligned sequence gives an alignment score at least as good as the computed one – is computed from this distribution. We then take – log_{10}(pval.) as the significance score to rank the candidates. In the evaluation, we always used the significance score unless explicitly stated otherwise.
Authors' contributions
HMK and SB developed the alignment formulation and score statistics. HMK performed the SAMPI experiments and evaluation and wrote the manuscript. AW provided the mass spectrometry data and performed the Mascot experiments. All authors proofread and approved the manuscript.
Acknowledgements
HansMichael Kaltenbach is supported by the "International NRW Graduate School in Bioinformatics and Genome Research." Implementation of the presented method was provided by Tobias Marschall, Marcel Martin, Matthias Steinrücken, and Henner Sudek. The authors want to thank Sven Rahmann for helpful discussions and Martina Mahne for providing the original mass spectrometry data.
References

Aebersold R, Mann M: Mass spectrometrybased proteomics.
Nature 2003, (422):198207. PubMed Abstract  Publisher Full Text

Bairoch A, Boeckmann B: The SWISSPROT protein sequence data bank.
Nucleic Acids Res 1992, 20:20192022. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Henzel WJ, Watanabe C, Stults JT: Protein Identification: The Origins of Peptide Mass Fingerprints.
J Am Soc Mass Spectrometry 2003, 14:931942. Publisher Full Text

Sadygov RG, Cociorva D, Yates JR: Largescale database searching using tandem mass spectra: looking up the answer in the back of the book. [http://dx.doi.org/10.1038/nmeth725] webcite
Nat Methods 2004, 1(3):195202. PubMed Abstract  Publisher Full Text

Karas M, Hillenkamp F: Laser desorption ionization of proteins with molecular masses exceeding 10,000 Daltons.
Anal Chem 1988, 60:22992301. PubMed Abstract  Publisher Full Text

Perkins D, Pappin D, Creasy D, Cottrell J: Probabilitybased protein identification by searching sequence databases using mass spectrometry data.
Electrophoresis 1999, 20:35513567. PubMed Abstract  Publisher Full Text

Zhang W, Chait BT: ProFound: an expert system for protein identification using mass spectrometric peptide mapping information.
Anal Chem 2000, 72(11):24822489. PubMed Abstract  Publisher Full Text

Chamrad DC, Körting G, Stühler K, Meyer HE, Klose J, Blüggel M: Evaluation of algorithms for protein identification from sequence databases using mass spectrometry data.
Proteomics 2004, 4:619628. PubMed Abstract  Publisher Full Text

Böcker S, Kaltenbach HM: Mass Spectra Alignments and Their Significance. In Combinatorial Pattern Matching. Volume 3537. Edited by Apostolico A, Crochemore M, Park K. Springer; 2005::429441.

Kaltenbach HM, Sudek H, Böcker S, Rahmann S: Statistics of cleavage fragments in random weighted strings. [http://bieson.ub.unibielefeld.de/volltexte/2006/900/] webcite
Tech. Rep. TR200506, Technische Fakultät der Universität Bielefeld, Abteilung Informationstechnik 2005.

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool.
J Mol Biol 1990, 215:403410. PubMed Abstract  Publisher Full Text

Havilio M, Haddad Y, Smilansky Z: Intensitybased statistical scorer for tandem mass spectrometry.
Anal Chem 2003, 75:435444. PubMed Abstract  Publisher Full Text

Wilke A, Rückert C, Bartels D, Dondrup M, Goesmann A, Hüser AT, Kespohl S, Linke B, Mahne M, McHardy AC, Pühler A, Meyer F: Bioinformatics support for highthroughput proteomics.
J Biotechnol 2003, 106(2–3):14756. PubMed Abstract  Publisher Full Text

Gusfield D: Algorithms on Strings. Trees, and Sequences. Cambridge University Press; 1997.

Huang X, Waterman MS: Dynamic programming algorithms for restriction map comparison.
Comput Appl Biosci 1992, 8(5):511520. PubMed Abstract

Wenk C: Applying an Edit Distance to the Matching of Tree Ring Sequences in Dendrochronology.
In Proceedings of Combinatorial Pattern Matching (CPM99) Edited by Crochemore M, Paterson M. 1999, 1645:223242.

Bylund D, Danielsson R, Malmquist G, Markides KE: Chromatographic alignment by warping and dynamic programming as a preprocessing tool for PARAFAC modelling of liquid chromatographymass spectrometry data.
Journal of Chromatography A 2002, 961:237244. PubMed Abstract  Publisher Full Text

Gay S, Binz PA, Hochstrasser DF, Appel RD: Peptide mass fingerprinting peak intensity prediction: Extracting knowledge from spectra.
Proteomics 2002, 2:13741391. PubMed Abstract  Publisher Full Text

Bafna V, Edwards N: SCOPE: A probabilistic model for scoring tandem mass spectra against a peptide database.
Bioinformatics 2001, 17:S13S21. PubMed Abstract  Publisher Full Text

Wan Y, Yang A, Chen T: PepHMM: A Hidden Markov Model Based Scoring Function for Mass Spectrometry Database Search. In Proc of RECOMB 2005. Volume 3500. Springer; 2005::342356.

Schütz F, Kapp EA, Simpson RJ, Speed TP: Deriving statistical models for predicting peptide tandem MS product ion intensities.
Biochem Soc Trans 2003, 31(Pt 6):14791483. PubMed Abstract  Publisher Full Text

Wolski WE, Lalowski M, Martus P, Herwig R, Giavalisco P, Gobom J, Sickmann A, Lehrach H, Reinert K: Transformation and other factors of the peptide mass spectrometry pairwise peaklist comparison process.
BMC Bioinformatics 2005, 6:285. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text