Abstract
Background
In mass spectrometrybased proteomics, the statistical significance of a peptidespectrum or proteinspectrum match is an important indicator of the correctness of the peptide or protein identification. In bottomup mass spectrometry, probabilistic models, such as the generating function method, have been successfully applied to compute the statistical significance of peptidespectrum matches for short peptides containing no posttranslational modifications. As topdown mass spectrometry, which often identifies intact proteins with posttranslational modifications, becomes available in many laboratories, the estimation of statistical significance of topdown protein identification results has come into great demand.
Results
In this paper, we study an extended generating function method for accurately computing the statistical significance of proteinspectrum matches with posttranslational modifications. Experiments show that the extended generating function method achieves high accuracy in computing spectral probabilities and false discovery rates.
Conclusions
The extended generating function method is a nontrivial extension of the generating function method for bottomup mass spectrometry. It can be used to choose the correct proteinspectrum match from several candidate proteinspectrum matches for a spectrum, as well as separate correct proteinspectrum matches from incorrect ones identified from a large number of tandem mass spectra.
Keywords:
mass spectrometry; spectral probabilities; dynamic programmingBackground
Peptide and protein identification in mass spectrometry (MS)based proteomics involves searching tandem mass spectrometry (MS/MS) spectra against a protein database using a search engine. In bottomup MS, most search engines calculate a similarity score between a spectrum and a peptide and report a bestscoring peptidespectrum match (PSM) for each spectrum [15]. A PSM is correct if the spectrum is generated from the matched peptide. It is vital to distinguish correct PSMs from those incorrect ones.
Two main approaches have been proposed to address this problem. In the first approach, a large data set of MS/MS spectra is searched against a concatenated targetdecoy protein database to find a bestscoring PSM for each spectrum, and the PSM is reported if its score exceeds a prespecified threshold. The false discovery rate (FDR) of the reported PSMs is estimated based on the fact that the number of decoy hits and the number of incorrect target hits are approximately the same [6]. This approach is simple and powerful when a large population of PSMs is reported. However, it fails to decide the correctness of single PSMs. In addition, it is unable to compute accurate FDRs when the target protein database is small (e.g., a database with only one protein) or when only a small number of PSMs are reported [7].
In the second approach, the statistical significance (Evalue or pvalue) of a PSM is computed for determining the correctness of the PSM. Due to the complexity of MS/MS spectra, many statistical models have limited accuracy. By contrast, Kim et al. proposed a probabilistic method for computing spectral probabilities and statistical significance of PSMs [8]. This method achieves high accuracy, but it is not obvious how to extend it to PSMs with posttranslational modifications (PTMs).
With the rapid developments in instrumentation, topdown MS, which analyzes intact proteins or long peptides, has become available in many laboratories. More than a thousand proteins can be identified in a single topdown MS experiment [9] and many methods have been proposed for identification of proteoforms using topdown tandem mass spectra [1017]. Although the evaluation of PSMs in bottomup MS has been intensively studied, no systematic studies have been carried out for evaluating proteinspectrum matches (PrSMs) in topdown MS. Similar to bottomup MS, there is now an increasing demand to accurately estimate the statistical significance of single PrSMs. For instance, a topdown MS/MS spectrum can be matched to two different proteins: one contains a PTM; the other does not. Comparing the Evalues of the two PrSMs can determine which one is better. Meng et al. developed a Poisson model for the problem, but the model does not include PTMs [18]. As topdown MS/MS spectra are often mapped to proteoforms with PTMs, accurate estimation of statistical significance of PrSMs with PTMs is useful and challenging. We proposed a method for Evalue computation of PrSMs by breaking a protein into several subproteins without PTMs, but it is extremely time consuming [17]. In this paper, we study an extended generating function method for accurately computing spectral probabilities and statistical significance of PrSMs in topdown MS. Our method naturally extends the generating function method in bottomup MS [8]. Spectral probabilities reported by the extended generating function method are further utilized for estimating FDRs of identified PrSMs using a method proposed in [7], in which decoy databases are not needed. Experiments show that the extended generating function method achieves high accuracy in computing spectral probabilities and FDRs.
Methods
A topdown MS/MS spectrum generated from a protein consists of a precursor mass, corresponding to the molecular mass of the protein, and a list of peaks, corresponding to fragment ions of the protein. Each peak represents the masstocharge ratio and the abundance of the fragment ion. The residue mass of a spectrum S is defined as M(S) = PrecursorMass − WaterMass, where PrecursorMass is the precursor mass of the spectrum, and WaterMass is the mass of a water molecule. Because topdown MS/MS spectra are very complex, and the charge states of most fragment ions are high, high mass resolution and high mass accuracy spectra are absolutely required. The first step in topdown spectral interpretation is usually spectral deconvolution, which converts a complex topdown spectrum to a list of monoisotopic neutral masses (a deconvoluted spectrum) [19,20]. The neutral masses are further converted to a list of prefix residue masses (PRMs) corresponding to the masses of protein prefixes [21]. For a collisioninduced dissociation (CID) spectrum, the PRM spectrum is generated as follows: (1) the residue mass of the experimental spectrum is added to the PRM spectrum; (2) for each neutral mass x extracted from the experimental spectrum, two masses x and PrecursorMass − × are added to the PRM spectrum. If mass x corresponds to a protein suffix (prefix), then mass PrecursorMass − × corresponds to a protein prefix (suffix) [22]. The proposed extended generating function method can be applied to all types of spectra, such as CID and electrontransfer dissociation (ETD) spectra, because all these types of spectra can be converted to PRM spectra. All masses in PRM spectra are discretized by scaling the masses with a constant and rounding the values to integers [23]. For highly accurate topdown spectra, a scaling constant 274.335215 is used (e.g. mass(G) = 57.021464 × 274.335215 = 15642.995586 ≈ 15643) to reduce the rounding error to 2.5 parts per million (ppm) [22]. In the following analysis, we assume that only PRM spectra with integer masses are studied and peak intensities are ignored.
Scores of PrSMs
A PRM spectrum S is represented as an ordered list of integer masses, in which the largest one is M(S). Let be the set of the 20 standard amino acids with integer residue masses M(r) for (the residue masses of amino acids are discretized using the same discretization method for PRM spectra). The residue mass of r is also denoted as ‖r‖. The residue mass M(P) of a protein P is the sum of the residue masses of all amino acids of the protein. It differs from the molecular mass of the protein by the mass of a water molecule. A protein P with m amino acids is associated with an ordered list of integer masses p_{1 }< p_{2 }<. . . < p_{m}, where p_{i }is the sum of the residue masses of the first i amino acids and p_{m }= M(P).
If the residue masses of spectrum S and protein P are the same value N , the mass count score of S and P is the number of shared masses (except for the residue mass N) in S and P, denoted by CScore(S, P). The mass shift of a PTM is the mass difference between the modified form (with the PTM) and the unmodified form of an amino acid residue. When a PTM occurs at the ith amino acid of P and the mass shift d of the PTM is positive, the modified form of P is denoted by Q_{i}_{,}_{d}(P) = {p_{1}, p_{2}, . . ., p_{i−}_{1}, p_{i }+ d, . . ., p_{m }+ d}. When the mass shift of the PTM is a negative value −d, Q_{i}_{,}_{−d}(P) = {p_{1}, p_{2}, . . ., p_{i−}_{1}, p_{i }− d, . . ., p_{m }− d}. In addition, if a mass in p_{i }− d, . . ., p_{m }− d is negative or not greater than p_{i−}_{1}, the mass is removed from Q_{i}_{,}_{−d}(P). Let be the set of all modified proteins of P with a PTM of mass shift d. When the protein is not ambiguous, we use shortened notations . To identify an experimental PRM spectrum S generated from protein P with a PTM, one can find the mass shift d of the PTM by comparing the residue masses of S and P , and compute the mass count score between S and each of the modified proteins in to find the best match. The PTM mass count score of S and P is defined as CScore(S, Q), where d = M(S) − M(P).
Random proteins
Let Pr(r) be the probability that an amino acid is observed at a position in a random protein. In practice, the frequencies of amino acids in the SwissProt database [24] can be used to estimate Pr(r). The probability that a random protein P with amino acids r_{1}r_{2 }. . . r_{m }is observed is
where L represents the length of the random protein. To simplify computation, a uniform probability Pr(L = m) = 1/MaxLength is chosen, where MaxLength is the length of the longest protein in the SwissProt database. Despite the difference between the uniform distribution and the distribution of protein length in the target protein database, experimental results showed the uniform distribution does not introduce large errors into the computation of spectral probabilities.
Spectral probabilities
Let be the set of negative/positive mass shifts of allowed PTMs. Any number in is a valid mass shift. Let S be an experimental PRM spectrum and P a random protein. The residue mass difference between S and P is a random variable D = M(S) − M(P ). The spectral probability of S with respect to a threshold t and one PTM is the probability that the residue mass difference and PScore(S, P ) ≥ t:
where 1 in SpecProb(S, t, 1) represents one PTM. From the definition of PScore(S, P),
Computing SpecProb(S, t, 1) accurately and efficiently is a problem that has not been solved. In the following subsections, we propose two upper bounds of SpecProb(S, t, 1). The two upper bounds can be calculated accurately and efficiently using dynamic programming algorithms. The second upper bound is better than the first one and is used for estimating SpecProb(S, t, 1). Since the second upper bound is larger than SpecProb(S, t, 1), a constant K is introduced for correcting errors in estimated spectral probabilities. In practice, the value of K can be estimated from training data sets.
The first upper bound of spectral probabilities
Based on Equation (2) and the union bound of probabilities,
Let q denote the right hand part of the above inequality. The value of q is an upper bound of SpecProb(S, t, 1). Next, we describe a dynamic programming algorithm for computing the value of q. The algorithm is an extension of the generating function method in [8]. In this algorithm, a spectrum S with a residue mass N is represented as a 0/1 vector S = s_{1}s_{2 }. . . s_{N}, where s_{i }= 1 if the spectrum has a prefix residue mass i and 0 otherwise. For example, a spectrum with a PRM list {2, 5, 8, 10} (10 is the residue mass of the spectrum) is represented as 0100100100. We first study the case where all mass shifts are positive; negative mass shifts will be discussed at the end of this subsection. A three dimensional table T (i, j, k) is computed to acquire the upper bound, where i is the number of PTMs in modified proteins. Let S[1 : j] be the subspectrum s_{1}s_{2 }. . . s_{j }of S. The residue mass of S[1 : j] is j. The value T (0, j, k) is the probability that M(P) = j and the mass count score CScore(S[1 : j], P) = k. Let be set of all proteins with a residue mass j. We define a function: f(S, P, k) = 1 if CScore(S, P) = k; 0 otherwise. Then,
Suppose P contains m amino acids and the residue mass of P is j. If the last amino acid of P is r, then j − ‖r‖ is the prefix residue mass of the first m − 1 amino acids of P, where ‖r‖ is the residue mass of r. In the vector representation of S, if S contains a prefix residue mass j − ‖r‖, s_{j−‖r‖ }= 1; otherwise, s_{j−‖r‖ }= 0. The recurrence function for computing T(0, j, k) was given in [8]:
Let D_{j }= j − M(P ), the random variable representing the difference between j and the residue mass of random protein P. The value T(1, j, k) is the sum of probabilities
Suppose the residue mass of protein P is j − d, that is, . Let m be the number of amino acids in P and Q_{m}_{,}_{d }the modified protein of P whose PTM is on the last amino acid. Because the first m−1 masses of Q_{m}_{,}_{d }are unchanged compared with P,
Combined with Equation (4),
Let r be the last amino acid of P and P' the protein containing the first m − 1 amino acids of P. All the m − 1 masses in Q_{l}_{,}_{d}(P'), 1 ≤ l ≤ m − 1, are the same to the first m − 1 masses in Q_{l}_{,}_{d}(P). While the m − 1th mass j − ‖r‖ in Q_{l}_{,}_{d}(P) is included in the computation of mass count scores, the mass j − ‖r‖ in Q_{l}_{,}_{d}(P') is not included because it is the residue mass. Thus,
It follows
Combining the fact that Pr(P) = Pr(P')Pr(r) and Equations (6) and (8),
With Equations (6), (7) and (9), the recurrence function for T(1, j, k) is
When PTMs with negative mass shifts d are allowed, j − d in Equation (10) is larger than j. The value T(1, j − d, k) has not been computed when T(1, j, k) is computed, making Equation (10) invalid. To avoid this problem, a short amino acid sequence g is introduced to guarantee that j − d − M(g) < j. Let be the set of all amino acid sequences g = r_{1}r_{2 }. . . r_{l }satisfying M(g) > −d and M(r_{1}r_{2 }. . . r_{l−}_{1}) ≤ −d (d is negative). Equation (10) is modified to
where ‖g‖ is the residue mass of g, and for a sequence g = r_{1}r_{2 }. . . r_{l}. The value of q is , where N and n are the residue mass and the number of masses of S, respectively. The time complexity for computing T(0, j, k) and T(1, j, k) is O(N · t · z), where z is the sum of the sizes of and all , .
The second upper bound of spectral probabilities
The only difference between two modified proteins Q_{i}_{,}_{d }and Q_{i}_{+1,}_{d }is the ith mass. If p_{i }in P (which is not changed in Q_{i}_{+1,}_{d}) does not equal any mass in S, then CScore(S, Q_{i}_{+1,}_{d}) ≤ CScore(S, Q_{i},_{d}). Based on this observation, if p_{i }does not equal any mass in S, Q_{i}_{+1,}_{d }is removed from . In this way, a new set is acquired containing Q_{1,d }and all Q_{i},_{d }satisfying that p_{i−}_{1 }equals a mass in S. It follows that . From Equation (1) and the union bound of probabilities,
Let q' denote the right hand part of the above inequality. Compared with q, the value of q' is a better upper bound for SpecProb(S, t, 1). Similar to the method for computing q, we fill out a three dimensional array T(i, j, k) for computing q'. The recurrence function for filling out T(0, j, k) is the same to Equation (5). We change the definition of T(1, j, k) by replacing with in Equation (6). To compute T(1, j, k), the second and third terms of the righthand part of Equation (11) should be changed so that only the probabilities for the modified proteins in are summed up.
Similar to the proof of Equation (7), consider a random protein . Let Q_{m},_{d }be the modified protein of P whose PTM is on the last amino acid, and r the last amino acid of P. If , then j − d − ‖r‖ is a mass in S or j − d − ‖r‖ = 0 (in the extreme case that P contains only one amino acid, j − d − ‖r‖ = 0), and vice versa. Therefore, if j − d − ‖r‖ = 0 or s_{j−d−‖r‖ }= 1, then Pr(P ) · f(S[1 : j], Q_{m},_{d}, k) is included in the computation of T(1, j, k).
For a positive mass shift d, we define as the set of amino acids r ∈ R satisfying that j − d − ‖r‖ = 0 or the element s_{j−d−‖r‖ }= 1. For a negative mass shift d, we introduce a set of amino acid sequences g = r_{1}r_{2 }. . . r_{l }satisfying: (1) M(g) > −d, (2) M(r_{1}r_{2 }. . . r_{l−}_{1}) ≤ −d, and (3) j − d − ‖g‖ = 0 or the element s_{j−d−‖g‖ }= 1. Then Equation (11) is changed to:
and . The time complexity for computing T(0, j, k) and T(1, j, k) is similar to the method in the previous subsection.
Since the scores CScore(S, Q) for are not independent, q' is usually larger than the spectral probability SpecProb(S, t, 1). To estimate SpecProb(S, t, 1) more accurately, q' is multiplied by a constant value K for correction:
Pvalues and Evalues
Let , where N is the residue mass of S. From table T(0, j, k) described in the previous subsection, we can compute the probability that the residue mass difference D between S and P is in :
Using Equations (13) and (14), the conditional spectral probability of S with respect to threshold t and one PTM is
Intact proteins may have N or Cterminal truncations, e.g., the removal of a signal peptide. If a topdown MS/MS spectrum is matched to an intact protein without N or Cterminal truncations, the PrSM is called a complete PrSM. A PrSM matched to an intact protein with an N/Cterminal truncation is called a suffix/prefix PrSM. An internal PrSM corresponds to an intact protein with both N and Cterminal truncations.
Similar to the Evalues defined in BLAST [25], the Evalue of a PrSM describes the number of hits one can "expect" to see by chance when searching the spectrum against a protein database of a particular size. Suppose a complete PrSM contains one mass shift (PTM) in and its PTM mass count score is t. We count the number Z of proteins in the target database with a residue mass in . The Evalue of the complete PrSM is estimated as Z · CSP(S, t, 1). The pvalue of the PrSM is estimated as 1 − (1 − CSP(S, t, 1))^{Z}.
For prefix, suffix and internal PrSMs, we count the numbers Z_{p}, Z_{s}, and Z_{i }of prefixes/ suffixes/internal subproteins in the target database with a residue mass in . Because some prefixes/suffixes/internal subproteins are not independent, a constant factor C_{p}/C_{s}/C_{i }is multiplied in the computation of Evalues of prefix/suffix/internal PrSMs for correction. For example, if a prefix PrSM contains one mass shift (PTM) in and its PTM mass count score is t, the Evalue of the PrSM is estimated as C_{p }· Z_{p }· CSP(S, t, 1).
Multiple PTMs
The dynamic programming algorithm for computing the second upper bound can be extended to estimate Evalues of PrSMs with multiple PTMs. When multiple PTMs are allowed, we replace T(0, j, k) and T(1, j, k) in Equation (12) by T(i, j, k) and T(i − 1, j, k) to estimate spectral probabilities with respect to i PTMs:
Results
The extended generating function method, TDGF (TopDown Generating Function), was implemented in JAVA and tested on a desktop with a 3.3GHz (AMD Opteron 6204) CPU and 16 GB memory.
Data sets
A Salmonella typhimurium (ST) data set [13] was used to test TDGF. A protein mixture of ST was analyzed using an LTQOrbitrap (Thermo Fisher Scientific). MS and MS/MS spectra were collected at a resolution of 60,000 and 30,000, respectively. The experiment was repeated using gasphase fractionation. A total of 14,041 collisioninduced dissociation (CID) MS/MS spectra were acquired. The detailed experiment procedure can be found in [13].
The performance of TDGF on proteoform identification was tested on an Escherichia coli (EC) data set. An EC cell lysate was separated by an intact protein reversed phase liquidchromatography (RPLC) system and analyzed by an LTQOrbitrap Velos (Thermo Fisher Scientific). MS and MS/MS spectra was collected at a resolution of 60,000. A total of 3,704 higherenergy Ctrap dissociation (HCD) MS/MS spectra were obtained.
Spectral probabilities for PrSMs with one PTM
The accuracy of TDGF was evaluated using two approaches based on conditional spectral probabilities (defined in Equation (15)) and FDRs.
Evaluation based on conditional spectral probabilities
To evaluate TDGF, we generated a set of PrSMs with "correct" conditional spectral probabilities and compared the "correct" conditional spectral probabilities with those reported by TDGF.
Selection of PrSMs Previous analysis results in [17] showed that most PrSMs identified in the ST data set contained no PTMs. To increase the number of identified PrSMs with one PTM, a mutated ST protein database was generated by adding a glycine residue to the middle of each protein sequence in the ST proteome. When the mutated ST protein database is used, a PrSM without PTMs can be identified as a PrSM with one PTM.
All MS/MS spectra in the ST data set were deconvoluted using MSDeconv [20] and searched against the mutated ST proteome using MSAlign+ [17]. The error tolerances for precursor masses and fragment masses were set to 15 ppm, and carbamidomethylation was set as the fixed PTM. By restricting the search space to only complete PrSMs with one PTM, MSAlign+ identified 4,291 PrSMs. For each of 4,291 PrSMs, TDGF was employed to compute the conditional spectral probability, which was used only for selecting PrSMs, not for evaluating TDGF. The parameter K in Equation (13) was set to 1. Since blind PTM search was used in MSAlign+, the allowed mass shifts were set to and , where α is the mass of a tryptophan (W) residue. The running time for computing conditional spectral probabilities was 726 minutes (about 12 hours). For 203 of the 4,291 complete PrSMs, the conditional spectral probabilities reported by TDGF were in [10^{−5}, 10^{−4}]. The 203 PrSMs were selected for computing "correct" conditional spectral probabilities.
Computation of "correct" conditional spectral probabilities For each of the 203 PrSMs (spectra), a random database of 10^{6} proteins was generated. In the random database, the residue masses of all proteins are in , where N is the residue mass of the spectrum. The PTM mass count score between the spectrum and each protein in the database was computed; and the number x of proteins satisfying that the PTM mass count score ≥ t was counted. The conditional spectral probability of the PrSM with respect to one PTM and threshold t was estimated as x/10^{6}. Since the above method follows the definition of conditional spectral probabilities, the results are treated as "correct" conditional spectral probabilities. Finally, one PrSM was removed from the list of 203 PrSMs because the estimated conditional spectral probability (using a random database) was 0.
Evaluation of TDGF The 202 PrSMs were randomly divided into a training data set (101 PrSMs) and a test data set (101 PrSMs). The training data set was used to estimated the value of K in Equation (13). We set K = 1 (the value of K will be determined later) and used TDGF to compute the conditional spectral probabilities for the training PrSMs. Let p_{1 }and p_{2 }be the conditional spectral probabilities of a PrSM estimated by the random databasebased method and TDGF, respectively. The error of p_{2 }is defined as e =  log(p_{1}) − log(p_{2}) (base 10). To minimize the average error of the conditional spectral probabilities reported by TDGF, the best value of log(K) is the average of the log ratio . Using the training data set, K was set to the best value 0.55. In practice, the default values of K are learned from various types of training data, such as CID and ETD data, and are provided so that the users do not need to estimate K for their own data sets. With K = 0.55, TDGF was employed to compute the conditional spectral probabilities for the test PrSMs. The errors of these conditional spectral probabilities were obtained by comparing them with the "correct" ones (Figure 1). The errors for 98 test PrSMs (97%) were ≤ 0.5. When the error is 0.5, there is about a three fold difference between the conditional spectral probabilities reported by the two methods. The results show that the spectral probabilities estimated by TDGF are accurate for most of the test PrSMs.
Figure 1. A comparison of the conditional spectral probabilities (for PrSMs with one PTM) estimated by the random databasebased method and TDGF. For each of the 101 test PrSMs, the error of the conditional spectral probability reported by TDGF is computed. For each cutoff of e, the number of PrSMs with an error < e is counted.
Evaluation based on FDRs
With the spectral probabilities reported by TDGF, the "estimated" FDR of a set of identified PrSMs for a cutoff pvalue can be computed using the functions in [7]. For the same cutoff pvalue, the "correct" FDR can be obtained by the targetdecoy approach. Because the "estimated" FDR is based on the spectral probabilities reported by TDGF, if the "estimated" FDR is similar to the "correct" FDR, then the spectral probabilities reported by TDGF are accurate.
Using all the 4,291 complete PrSMs with one PTM, we computed "estimated" FDRs for cutoff pvalues in {0.0001, 0.0002, . . ., 1.0000} based on spectral probabilities. In the targetdecoy approach, all spectra were searched against a concatenated target and shuffled decoy protein database. Because the FDR reported by the targetdecoy approach was 0 when the cutoff pvalue was smaller than 8.27 × 10^{−4}, we compared only the FDRs for cutoff pvalues greater than 8.27 × 10^{−4 }(Figure 2). The FDRs estimated based spectral probabilities were consistent with those reported by the targetdecoy approach. For example, the targetdecoy approach and the spectral probability approach reported cutoff pvalues 0.0327 and 0.0262 for 1% FDR, respectively. The difference between the two pvalues is only 0.0065, which is evidence that the spectral probabilities reported by TDGF are accurate.
Figure 2. A comparison of the FDRs of PrSMs with one PTM estimated by the targetdecoy approach and computed based on spectral probabilities. For a given cutoff pvalue, the two reported FDRs are compared, and − log(FDR) (base 10) is plotted against − log(cutoff pvalue) (base 10).
Prefix, suffix and internal PrSMs
In this subsection, we describe the methods for estimating parameters C_{p}, C_{s }and C_{i }introduced in Section Methods. A substring a_{i}a_{i}_{+1 }. . . a_{j }of a string S = a_{1}a_{2 }. . . a_{n }is denoted by S[i : j]. To estimate the parameter C_{p }for prefix PrSMs, a new random protein database was generated for each of the selected 202 PrSMs: (1) a total of 1,000 long random protein sequences with 1,200 amino acids each were generated, and (2) prefixes S[1 : 201], . . ., S[1 : 1200] were extracted from each of the 1,000 long protein sequences. In total, 10^{6 }prefixes were added to the random protein database. The conditional spectral probabilities estimated using the new random databases are different from those using the random databases in Subsection "Computation of correct conditional spectral probabilities" because the protein sequences in the new random databases are not independent. Parameter C_{p }was estimated as the average ratio 0.693 between the probabilities computed based on the new databases and the random databases in Subsection "Computation of correct conditional spectral probabilities" for the 202 PrSMs. Parameter C_{s }can be set to the same to C_{p}.
To estimated the parameter C_{i }for internal PrSMs, a third type of random protein databases were used: (1) a total of 4 long random protein sequences with 1200 amino acids each were generated, and (2) 2.5 × 10^{5 }substrings S[i : j] (1 ≤ i ≤ 500, i+201 ≤ j ≤ i + 700) of the each long protein sequence were added to the random database. Using the same method for computing C_{p}, parameter C_{i }was estimated as 0.508.
Spectral probabilities for PrSMs with two PTMs
Similar to PrSMs with one PTM, a mutated protein database was created to increase the number of identified PrSMs with two PTMs. Two glycine residues were added each protein in the ST protein database: one is at the onethird position of the protein; the other at the twothirds position. MSAlign+ identified 2,404 complete PrSMs with two PTMs, and TDGF was used to compute the spectral probabilities for the 2,404 PrSMs. The running time for computing spectral probabilities was 1,317 minutes (about 22 hours). Because it is extreme slow to find the best PrSM with two PTMs by searching a spectrum against a large random protein database with 10^{6 }proteins, the evaluation method based on conditional spectral probabilities was not used. Only the evaluation method based on FDRs was applied. With all the 2,404 identified PrSMs, FDRs based on spectral probabilities and based on the targetdecoy approach were computed for cutoff pvalues in {0.0001, 0.0002, . . ., 1.0000}. When the cutoff pvalue is smaller than 0.016 (− log pvalue >1.80), the FDRs estimated by the two methods are similar. For 1% FDR, the targetdecoy approach and the spectral probability approach estimated similar cutoff pvalues 0.0164 and 0.0116, respectively. However, the FDRs based on spectral probabilities are not consistent with the "correct" FDRs (reported by the targetdecoy approach) when the cutoff pvalue is larger than 0.016 (Figure 3). One possible reason is that the filtering method implemented in MSAlign+ fails to keep the best PrSMs when their pvalues are not small enough. From the above analysis, the spectral probabilities estimated by TDGF are accurate when they are smaller than 0.016.
Figure 3. A comparison of the FDRs of PrSMs with two PTMs estimated by the targetdecoy approach and computed based on spectral probabilities. For a given cutoff pvalue, the two reported FDRs are compared, and − log(FDR) (base 10) is plotted against − log(cutoff pvalue) (base 10).
Comparison with ProSightPC
All MS/MS spectra in the EC data set were deconvoluted by MSDeconv [20]. The EC proteome database was downloaded from the SwissProt database; a combined protein database was generated by concatenating the EC proteome database and a shuffled decoy database. To test the performance of TDGF on proteoform identification, MSAlign+ coupled with TDGF was applied to search the deconvoluted spectra against the combined database. The error tolerances for precursor masses and fragment masses were set as 15 ppm and two unknown PTMs were allowed. Using 1% spectrumlevel FDR, 1,478 spectra were identified.
ProSightPC [10] was also applied to analyze the EC data set. ProSightPC provides several search modes for topdown spectral identification, including the absolute mass mode and the biomarker mode. Since some spectra in the EC data set were generated from truncated proteins, the biomarker mode was chosen for the analysis of the EC data set. The error tolerances for precursor masses and fragment masses were set as 15 ppm. ProSightPC identified 627 spectra with 1% spectrumlevel FDR. All the 627 spectra were identified by MSAlign+ coupled with TDGF. The test results show that MSAlign+ coupled with TDGF outperformed the biomarker mode of ProSightPC.
Conclusions
The experiments showed that the extended generating function method achieves high accuracy in computing spectral probabilities of PrSMs with PTMs. It is a nontrivial extension of the generating function method proposed in [8]. With accurate spectral probabilities and Evalues, one can easily choose the correct PrSM from several candidate PrSMs for a spectrum, as well as separate correct PrSMs from incorrect ones identified from a large number of spectra. In addition, it provides a way to evaluate single PrSMs efficiently.
Competing interests
The authors declare that there are no competing interests.
Authors' contributions
XL, SL, and SK designed the TDGF method. XL implemented the TDGF method in JAVA. XL and MS did the experiments on tandem mass spectrometry data sets. XL and SL wrote the manuscript. All authors have read and approved the final manuscript.
Acknowledgements
This work was supported by a startup fund provided by Indiana UniversityPurdue University Indianapolis.
Declarations
Publication of this article was funded by a startup fund provided by Indiana UniversityPurdue University Indianapolis.
This article has been published as part of BMC Genomics Volume 15 Supplement 1, 2014: Selected articles from the Twelfth Asia Pacific Bioinformatics Conference (APBC 2014): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S1.
References

Eng JK, McCormack AL, Yates JR: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database.
Journal of the American Society for Mass Spectrometry 1994, 5:976989. PubMed Abstract  Publisher Full Text

Perkins DN, Pappin DJ, Creasy DM, Cottrell JS: Probabilitybased protein identification by searching sequence databases using mass spectrometry data.
Electrophoresis 1999, 20:35513567. PubMed Abstract  Publisher Full Text

Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, Yang X, Shi W, Bryant SH: Open mass spectrometry search algorithm.
Journal of Proteome Research 2004, 3:958964. PubMed Abstract  Publisher Full Text

Craig R, Beavis RC: A method for reducing the time required to match protein sequences with tandem mass spectra.
Rapid Communication of Mass Spectrometry 2003, 17:23106. Publisher Full Text

Kim S, Mischerikow N, Bandeira N, Navarro JD, Wich L, Mohammed S, Heck AJR, Pevzner PA: The generating function of CID, ETD, and CID/ETD pairs of tandem mass spectra: Applications to database search.
Molecular & Cellular Proteomics 2010, 9:28402852. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Elias JE, Gygi SP: Targetdecoy search strategy for mass spectrometrybased proteomics.
Methods in Molecular Biology 2010, 604:5571. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gupta N, Bandeira N, Keich U, Pevzner PA: Targetdecoy approach and false discovery rate: when things may go wrong.
Journal of the American Society for Mass Spectrometry 2011, 22:111120. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kim S, Gupta N, Pevzner PA: Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases.
Journal of Proteome Research 2008, 7:33543363. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tran JC, Zamdborg L, Ahlf DR, Lee JE, Catherman AD, Durbin KR, Tipton JD, Vellaichamy A, Kellie JF, Li M, Wu C, Sweet SMM, Early BP, Siuti N, LeDuc RD, Compton PD, Thomas PM, Kelleher NL: Mapping intact protein isoforms in discovery mode using topdown proteomics.
Nature 2011, 480:254258. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zamdborg L, LeDuc RD, Glowacz KJ, Kim YB, Viswanathan V, Spaulding IT, Early BP, Bluhm EJ, Babai S, Kelleher NL: ProSight PTM 2.0: improved protein identification and characterization for top down mass spectrometry.
Nucleic Acids Research 2007, 35:701706. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Shen Y, Tolić N, Hixson KK, Purvine SO, Anderson GA, Smith RD: De novo sequencing of unique sequence tags for discovery of posttranslational modifications of proteins.
Analytical Chemistry 2008, 80:77427754. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Frank AM, Pesavento JJ, Mizzen CA, Kelleher NL, Pevzner PA: Interpreting topdown mass spectra using spectral alignment.
Analytical Chemistry 2008, 80:24992505. PubMed Abstract  Publisher Full Text

Tsai YS, Scherl A, Shaw JL, MacKay CL, Shaffer SA, LangridgeSmith PRR, Goodlett DR: Precursor ion independent algorithm for topdown shotgun proteomics.
Journal of the American Society for Mass Spectrometry 2009, 20:21542166. PubMed Abstract  Publisher Full Text

Karabacak NM, Li L, Tiwari A, Hayward LJ, Hong P, Easterling ML, Agar JN: Sensitive and specific identification of wild type and variant proteins from 8 to 669 kDa using topdown mass spectrometry.
Molecular & Cellular Proteomics 2009, 8:846856. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fagerquist CK, Garbus BR, Williams KE, Bates AH, Boyle S, Harden LA: Webbased software for rapid topdown proteomic identification of protein biomarkers, with implications for bacterial identification.
Applied and Environmental Microbiology 2009, 75:434153. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kellie JF, Tran JC, Lee JE, Ahlf DR, Thomas HM, Ntai I, Catherman AD, Durbin KR, Zamdborg L, Vellaichamy A, Thomas PM, Kelleher NL: The emerging process of top down mass spectrometry for protein analysis: biomarkers, proteintherapeutics, and achieving high throughput.
Molecular BioSystems 2010, 6:15329. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liu X, Sirotkin Y, Shen Y, Anderson G, Tsai YS, Ting YS, Goodlett DR, Smith RD, Bafna V, Pevzner PA: Protein identification using topdown spectra.

Meng F, Cargile BJ, Miller LM, Forbes AJ, Johnson JR, Kelleher NL: Informatics and multiplexing of intact protein identification in bacteria and the archaea.
Nature Biotechnology 2001, 19:9527. PubMed Abstract  Publisher Full Text

Horn DM, Zubarev RA, McLafferty FW: Automated reduction and interpretation of high resolution electrospray mass spectra of large molecules.
Journal of the American Society for Mass Spectrometry 2000, 11:330332.

Liu X, Inbar Y, Dorrestein PC, Wynne C, Edwards N, Souda P, Whitelegge JP, Bafna V, Pevzner PA: Deconvolution and database search of complex tandem mass spectra of intact proteins: A combinatorial approach.
Molecular & Cellular Proteomics 2010, 9:27722782. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tanner S, Shu H, Frank A, Wang LC, Zandi E, Mumby M, Pevzner PA, Bafna V: InsPecT: identification of posttranslationally modified peptides from tandem mass spectra.
Analytical Chemistry 2005, 77:46264639. PubMed Abstract  Publisher Full Text

Liu X, Hengel S, Wu S, Tolić N, PasaTolić L, Pevzner PA: Identification of ultramodified proteins using topdown spectra.
Journal of Proteome Research 2013, 12:58305838. PubMed Abstract  Publisher Full Text

Liu X, Mammana A, Bafna V: Speeding up tandem mass spectral identification using indexes.
Bioinformatics 2012, 28:16927. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R, O'Donovan C, Redaschi N, Suzek B: The Universal Protein Resource (UniProt): an expanding universe of protein information.

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool.
Journal of Molecular Biology 1990, 215:40310. PubMed Abstract