Abstract
PeptideProphet is a postprocessing algorithm designed to evaluate the confidence in identifications of MS/MS spectra returned by a database search. In this manuscript we describe the "what and how" of PeptideProphet in a manner aimed at statisticians and life scientists who would like to gain a more indepth understanding of the underlying statistical modeling. The theory and rationale behind the mixturemodeling approach taken by PeptideProphet is discussed from a statistical modelbuilding perspective followed by a description of how a model can be used to express confidence in the identification of individual peptides or sets of peptides. We also demonstrate how to evaluate the quality of model fit and select an appropriate model from several available alternatives. We illustrate the use of PeptideProphet in association with the TransProteomic Pipeline, a free suite of software used for protein identification.
Introduction
In massspectrometry shotgun proteomics, the first phase of analysis is the identification of peptides in complex biological mixtures digested by enzymes such as trypsin. Dependent on the peptides in the biological mixture, an experiment will produce a certain number of spectra (call it N). MS/MS spectra are individually matched to peptides by searching through a database of peptides predicted from the genome of the organism. The way the searches are performed can be constrained using different search parameters, such as the number of tryptic termini (NTT), number of missed cleavages (NMC) or the mass difference of the observed precursor ion mass and the weight of the theoretical peptide (ΔM).
We will discuss PeptideProphet in the context of two database search algorithms: SEQUEST [1] and Tandem with the kscore plugin [2,3]. SEQUEST attempts to determine a direct correlation between an observed spectrum and sequences of amino acids in a protein sequence database. Typical quantities associated with SEQUEST include: XCorr, ΔCn, SpRank. Typical quantities associated with Tandem with the kscore plugin include: logDot (logarithm of dot product between observed and theoretical spectrum) and ΔDot. PeptideProphet can be used with any database search algorithm that returns a quantitative score.
Given a database search algorithm, every spectrum that is observed will be scored against the peptides in the database. For each spectrum, the highest scoring peptide (depending on the scoring criterion) is typically chosen as the best match. The best match is the potential peptide sequence that generated its corresponding observed spectrum. Thus, we have N spectra that have been matched to a peptide and we will refer to these spectra as identified spectra.
The necessity of PeptideProphet arises because the spectra are subject to noise making it difficult to determine if the peptide that it is matched to is correct. The spectrum itself is generated from a peptide sequence and peaks can be missing or reduced in intensity. Because the spectrum that is being generated is subject to noise the databasebased criterion will vary when comparing theoretical spectra to observed spectra. Additionally, when searching the database, the correct peptide sequence may be absent. Because of this noise, how do we determine confidence in an identified spectrum? Traditional standards (such as just accepting all above XCorr > 2.5) does not reflect the quality of the identification. Such a rule may accept too many incorrectly identified spectra. Thus, statistical inference is needed to model the presence of noise.
PeptideProphet [4] is a postprocessing and rescoring algorithm for determining confidence in identified spectra found using a database search. PeptideProphet is one of the first methods for the assessment of confidence. It is based on a probability model and an Empirical Bayesian approach to model fitting. It is now not a single model, but a family of models [5].
The overview of PeptideProphet is as follows:
1. Rescoring: produce a score which reflects the quality of an identified spectrum, while summarizing multiple quantities, such as XCorr and ΔCn or logDot and ΔDot. The rescoring separates incorrectly and correctly identified spectra scores as much as possible.
2. Modeling: produce a probabilitybased model for the distribution of correctly and incorrectly identified spectra. The model must be then fit to the scores of all identified spectra.
3. Evaluation of the Quality of Fit: determine how well the scores fit the probabilitybased model.
4. Inference
(a) Evaluation of confidence in individual identified spectra using the posterior probability.
(b) Evaluation of confidence in sets of identified spectra: produce a cutoff on the scores to determine a set of correctly identified spectra while controlling the FalseDiscovery Rate, defined as the expected proportion of false positives.
We will first discuss the basic version of PeptideProphet and then discuss the three extensions.
Materials
Human plasma dataset
This dataset uses the first LCMS/MS replicate file from the Western Consortium of the National Cancer Institute's Mouse Models of Human Cancer [6]. The data was obtained using the Multiple Affinity Removal System and was matched using a semitryptic SEQUEST search against an IPI human protein database allowing a 3 Dalton mass tolerance and 01 missed cleavage sites. More details on the spectra can be seen in [7].
Controlled mixture
This dataset uses spectra generated from a linear ion trap Fourier transform instrument that was published in [8]. In particular the spectra from Mixture 3 was used, where 16 known trypsindigested proteins from different mammals were analyzed. Spectra were also matched using a semitryptic SEQUEST search against a database file with the 16 known proteins concatenated with human influenza proteins allowing a 3 Dalton mass tolerance and 02 missed cleavage sites. Matches to human influenza proteins are known to be incorrect. More details on the dataset can be seen in [8].
Methods
Statement of the problem from a statistical perspective, and terminology
Every statistical approach requires the definition of the following components in the problem:
1. PeptideProphet works with the observed spectra as the experimental unit where we have N observed spectra with N being generally large (in the thousands or more). Since the number of spectra N is typically very large, the identified spectra can be viewed as the underlying population.
2. An observed score is interpreted as a test statistic. In statistics the summarized score S is called a test statistic because it is the function of the observed experimental unit that is being used to answer our hypotheses.
3. PeptideProphet assumes that the test statistic comes from a mixture of two distributions: one from the distribution of correct identifications, and the other from the distribution of the incorrect identifications. The distributions may be characterized by a few parameters (parametric) or many parameters (semi or nonparametric).
4. The goal of PeptideProphet is to test two competing hypotheses for each identified spectrum. Let T_{i }be the true status of identified spectrum i where T_{i }= 0 indicates that the identified spectrum was incorrectly identified and where T_{i }= 1 indicates that the identified spectrum was correctly identified. We then wish to compare:
5. Inference: confidence is determined for individual spectra or sets of spectra.
• If the researcher is interested in a set of spectrum identifications, the False Discovery Rate should be controlled.
We determine the confidence in a set of spectra by controlling the False Discovery Rate. The False Discovery Rate, given a cutoff δ, is the expected proportion of all scores S_{i }>δ that are truly incorrect (the proportion of accepted identified spectra that are false positives). This situation is synonymous to performing N multiple hypothesis tests where using the values in Table 1. P (R > 0) is assumed to be 1 when we perform many tests (N is large). The False Discovery Rate is the expected proportion of incorrectly rejected null hypotheses out of the total rejected hypotheses. For a given cutoff if we were to repeat the experiment an infinite number of times and use the same cutoff each time the expected False Discovery Rate is the average proportion of incorrectly identified and accepted spectra out of the total number of incorrectly identified spectra.
Table 1. Table of multiple hypothesis testing quantities
An alternative confidence rate that is rarely used is the False Positive Rate (FPR). The False Positive Rate, given a cutoff δ is the expected proportion of all truly incorrectly identified spectra that are considered to be correctly identified. From the terms in Table 1 it is represented by
Many users prefer the qvalue which is the minimum False Discovery Rate required for a score s_{i }to be considered significant. It is represented by , where Γ represents the set of all possible cutoff scores [9]. This confidence measure is used to describe a score s_{i }at a single point but examines the False Discovery Rate of all possible scores. Unlike the False Discovery Rate, the qvalue is a monotonic quantity with respect to the score cutoff.
• If the researcher is interested in specific spectrum identifications the posterior error probability is most commonly used as it quantifies the confidence of a single identified spectrum.
The posterior error probability represents which we also denote as PEP. In other words using a probability model for S_{i}, we can find the probability of an identified spectrum being incorrect given its test statistic. Note that we can also calculate which is the probability of an identified spectrum being correct given its test statistic. The posterior error probability is also called the local false discovery rate (locfdr) [10,11].
Alternatively the pvalue can be used. If s_{i }is the ith observed score then the pvalue represents , or the probability of observing a score equal to or greater than s_{i }assuming that the ith identified spectrum was incorrectly identified. The pvalue is similar to the FPR in that the pvalue is the probability of observing a score equal to or greater than s_{i }assuming that it is one of the N_{0 }truly null hypotheses.
For each spectrum, PeptideProphet establishes a score reflecting the quality of an identified spectrum
First each spectrum (experimental unit) is observed and potentially identified using a databasebased criterion (XCorr, ΔCn, logDot, Δdot, etc.), PeptideProphet rescores the identified peptide with a discriminant function, using the databasebased criterion as the covariates for fitting the discriminant function. The goal is to fit a function that separates correct scores from incorrect scores. If S_{i }is the summarized score for the ith identified spectrum from a SEQUEST search result, a discriminant function produces a linear function f:
such that S > 0 for correctly identified spectra and S < 0 for incorrectly identified spectra.
If S_{i }is the summarized score for the ith identified spectrum from a Tandem search result, a linear discriminant function is used but with different coefficients:
In the basic version of PeptideProphet the β's are estimated empirically from a controlled mixture and are dependent on the precursor ion charge (i.e. a separate discriminant function was trained for 1+, 2+, 3+ precursor ion charges).
PeptideProphet relates observable and unobservable quantities via a joint probability distribution
PeptideProphet relates scores S_{i }to parameters via a sampling distribution of the test statistic under H_{0i }and H_{ai}. All scores S_{i}'s are independent and identically distributed (iid). The sampling distribution of S_{i }is assumed to follow a Normal(μ, σ) distribution if the identified spectrum is correct (T_{i }= 1) and Gamma(α, β, γ) distribution if the identified spectrum is incorrect (T = 0). Notationally we have that and that . Note that other forms of the distribution of scores for incorrect identifications such as the Gumbel distribution are often used with no effect on the theory presented here. Among all identified spectra an additional parameter π_{0 }is used to represent the overall proportion of incorrect identifications of identified spectra in the population. This formulation results in a 2group mixture model similar to what is established by Efron [10] where we may write that
The last equality is due to the fact that all scores are independent and identically distributed (iid). Due to different discriminant functions being used for each charge, a different sampling distribution and set of parameters are produced for each precursor ion charge (we will refer to this simply as the charge).
There may be additional information available, such as the NTT (number of tryptic termini), NMC (number of missed cleavages), and ΔM (delta mass) that can be used to improve the estimation of the sampling distribution of the identified spectra [7,12,13]. For example, the use of NTT = 0 in unconstrained searches often leads to improved estimation of the parameters even in lower quality datasets [5]. This is incorporated into the model above by assuming the existence of additional distributions for incorrect and correct identifications:
Note that the density functions of f_{T}_{=0}, _{NTT}, f_{T}_{=0, NMC}, f_{T}_{=0}, _{ΔM}, f_{T}_{=1, NTT}, f_{T}_{=1}, _{NMC}, and f_{T}_{=1}, _{ΔM }are discrete. It is assumed, conditional on the identified spectrum being incorrect or correct, that the members of (S_{i}, NTT_{i}, NMC_{i}, δM_{i}) are independent, as shown above.
PeptideProphet estimates parameters of interest in an Empirical Bayesian approach
PeptideProphet is considered an Empirical Bayesian approach because it uses each identified spectrum twice: once to estimate via the ExpectationMaximimzation [14] algorithm the parameters of the sampling distribution (π_{0}, μ, σ, α, β, and γ) and second to estimate the confidence in the correctness of an identified spectrum. The EMalgorithm iterates between two steps, called the Estep and the Mstep in order to estimate the value of model parameters. With a large enough set of identified spectra (say 100), the EMalgorithm will always converge [14]. The algorithm starts with initial values of model parameters π_{0}, μ, σ, α, β, and γ.
In the Estep, given the estimated values of the model parameters, the probability of each score being correct (or incorrect) is calculated. Given a single observed score s_{i }and its correctness status T_{i}, usage of Bayes Theorem yields which corresponds to the ratio of the Gamma density scaled by π_{0 }over the sum of the Gamma and Normal densities scaled by π_{0 }and 1  π_{0 }at score s_{i}.
In the Mstep, given estimated membership probabilities for each score s_{i}, the model parameters are reestimated by finding the values with the maximum likelihood. The estimate of π_{0 }is . For the Normal distribution the estimates of μ and σ^{2 }are:
For the Gamma distribution, the estimate of γ is simply the minimum of the scores s_{i}, i = 1, ..., N. In order to estimate α and β let and . Then the estimates of α and β are
Due to the speed of the algorithm in working with only two mixture components, the process of the E and Mstep can be iterated repeatedly until the model parameters do not change by a specified ε where ε is a small number, such as 0.0001. The algorithm then outputs estimated parameters of α, β, δ, μ and σ, as well as the estimate of π_{0 }(denoted with hats when estimates). The algorithm is detailed in Figure 1. Figures 2b and 2a shows two fits of PeptideProphet to the Human Plasma dataset of charges 2 and 3. Note that the EM algorithm can be substituted for alternative algorithms such as the Method of Moments.
Figure 1. Pseudocode of the EMalgorithm for iteratively estimating model parameters and membership probabilities.
Figure 2. PeptideProphet fits on the Human Plasma Dataset. PeptideProphet fits on the Human Plasma Dataset with Tandem Scores on charges 2 (left) and 3 (right). The blue and red curves correspond to the fitted frequency curves of the correct (Normal) and incorrect (Gamma) distributions. The Charge 2 fits yields a mixture distribution with a much stronger separation than the fit to Charge 3.
Evaluation of the quality of fit of PeptideProphet
Deviations of the assumptions, or a low number of identified spectra can lead to an inadequate or unstable model fit and incorrect conclusions. This can be diagnosed by visual inspection, and also by the bootstrap. We recommend using visual inspection over goodness of fit tests as tests do not explore the specific fitting issues that may influence subsequent inference of the identified spectra. In fact goodness of fit tests simply attempt to summarize the goodness of fit into one summary statistic whereas we are typically interested in the fit at certain locations of the mixture distribution. There are several visual attributes of the mixture distribution that researchers should be aware of and some remedies for them.
Do the empirical scores follow the fitted curves well? Particular attention needs to be paid to the tails of the distributions, especially the right tail of the distribution of scores of incorrect identifications (red) and the left tail of the distribution of scores of correct identifications (blue). This is often of most interest to researchers as the identified spectra in these regions are considered to be borderline correct or incorrect. In the case of Figure 2b the curves fit the histogram well but in Figure 2a there are many mismatches in the bars and the fitted curves. The culprit of these mismatches is likely due to the small number of spectra. The right portion of the Normal distribution is fit with approximately only 30 spectra. If the data is comprised of a large number of spectra but is deviating from the fitted curves, robust procedures can also be considered and will be discussed later.
Do the curves highly overlap? Although high overlap does not necessarily indicate a poor fit it will lead to smaller sets of confidently identified spectra. Overlaps that occur in situations of highly constrained searches can be remedied with techniques in later sections. Overlap in the case of a small number of spectra (Figure 2a) may be remedied by artificially adding observations using decoys which will also be subsequently demonstrated.
An issue that is not commonly addressed however is the number of identified spectra available to fit the mixture model. The number of identified spectra required to fit a reliable model depends highly on the separation and the form of the observed scores. A statistical approach to examine the stability of the fitted model can be done via the bootstrap.
Bootstrapping can be performed by sampling with replacement B samples (spectra) where each is of size N from the original dataset. At least 100 to 500 bootstrapped samples are recommended. For each bootstrapped sample b, we can refit the PeptideProphet model to receive bootstrapped estimates of , , , , , and . The bias, variance, and mean squared error (MSE) of the procedure used to estimate a parameter can be found using the bootstrapped estimates. In the case of μ, the bootstrap bias estimate is . Large biases imply that the estimation procedure is systematically over or underestimating the true value of a parameter. Note that as B increases the bias does not move towards 0. The bootstrap variance estimate is defined as . Smaller variability is desired. The bias and variability of an estimation procedure is often summarized using the mean squared error, which is .
Three hundred bootstrapped samples for the Human Plasma data for charges 2 and 3 were performed and the bootstrapped estimates for π_{0}, μ, and σ are shown in Figure 3. Although the means of the bootstrapped distribution are close to the original estimates (marked in red) the bootstrapped distributions for these parameters are more skewed for Charge 2 than for Charge 3. Additionally the variance of the bootstrapped estimates is significantly greater in the Charge 2 case for μ and σ showing how unstable the estimates for the Charge 2 distribution given the small number of identified spectra.
Figure 3. Bootstrapped samples of π_{0}, μ , and σ for Charges 2 and 3 of the Human Plasma data. The original estimates are marked by the vertical line. The length of the horizontal axes are equal for the plots of a particular parameter. The Charge 2 distributions are slightly skewed compared to Charge 3 distributions and the mean squared errors are much greater in Charge 2 distributions. The variability of the Charge 2 distributions are visibly much greater indicating unstable estimates.
The mean squared error summarizes the overall deviation of parameter estimates from B bootstrapped samples to the original estimates. The experimenter may also view the deviations that occur between the original sample and a single bootstrapped sample. Although a histogram of both samples would suffice, a quantiletoquantile plot is an easytoread plot that exemplifies the deviations between the two plots. The quantiletoquantile plot plots the quantiles of one distribution versus the matched quantiles of the other. For example if there are 10 values in two datasets the quantiletoquantile plot would display the 10, 20, 30,..., and 100th percentiles of one distribution matched with the respective 10, 20, 30, ..., and 100th percentiles of the second distribution. Distributions that are alike should result in a quantiletoquantile plot that is linear. Deviations from linearity at different quantiles in the plot imply differences between the two distributions at those associated quantiles. Although no quantiletoquantile plot will be perfectly linear the plot should not deviate much at the center and right portions of the plot as the accuracy of the estimated confidence of identified spectra relies heavily upon a good fit at these locations. The quantiletoquantile plot for Charge 2 in Figure 4 displays the deviation in quantiles of the original mixture distribution and the quantiles of a random bootstrapped sample. The deviations noticeably occur in the right half of the plot which corresponds to the right portion of the axis in Figure 2a indicating that the instability of the estimate is due to the right half of the plot. More specifically, it is due to the low number of identified spectra in this area of the plot.
Figure 4. Quantiletoquantile plot comparing the quantiles of the original mixture distribution of the Human Plasma data. Quantile to quantile plot comparing the quantiles of the original mixture distribution of the Human Plasma data for Charges 2 (left) and 3 (right) compared to the quantiles of randomly bootstrapped samples. The quantiletoquantile plot for Charge 2 shows more deviation in quantiles due to the low number of identified spectra in the score range between 2 and 8.
Estimating the confidence of spectrum identifications
Estimating the confidence of a set of spectrum identifications
In order to determine the correctness of the spectrum identifications, a decision rule is defined where any spectrum identification with a score above δ is concluded to be correct. In many experiments we are interested in the statistical properties of the list of spectrum identifications with scores above δ.
In order to estimate the False Discovery Rate given a decision rule cutoff two approaches may be used. Because all scores are assumed to follow the same fitted distribution the False Discovery Rate can be estimated with [15]. This can be seen by using the areas under the colored curves in Figure 5. In a second approach, PeptideProphet traditionally estimates the False Discovery Rate by interpreting the posterior error probabilities as local false discovery rates [10,11]. The estimated overall False Discovery Rate at point t is the average of the estimated local false discovery rates of identified spectra with scores greater than .
Figure 5. The PeptideProphet fit to the Human Plasma dataset of Tandem scores of Charge 2. The PeptideProphet fit to the Human Plasma dataset of Tandem scores of Charge 2 with fitted frequency curves from Figure 2b. The four confidence measures of the Posterior Error Probability (PEP), pvalue, False Discovery Rate (FDR), and False Positive Rate (FPR) are shown at a score of 1. The Posterior Error Probability at 1 is 0.156 and the estimated False Discovery Rate is 0.083. The pvalue and FPR are equivalent and equal to 0.004. In the formula for the estimated FDR, red is the estimate for V from Table 1 while blue combined with red is an estimate for R from Table 1.
The False Positive Rate for a cutoff t can also be estimated using the area under the fitted frequency curve of the distribution of scores for incorrect identifications as seen in Figure 5. Mathematically this is equivalent to the pvalue, or since each incorrect score follows the same distribution. Note that the False Positive Rate ignores the distribution of scores for correct identifications.
The estimation of the qvalue at a specific point ρ requires the estimation of the False Discovery Rate at every point s_{i }from i = 1, 2, ..., N. The qvalue for a point ρ is the minimum False Discovery Rate among all points s_{i }such that s_{i }≤ ρ. The estimated False Discovery Rate can be found using the modelbased estimates or by interpreting each posterior error probability as a local false discovery rate. The qvalue is often useful if a monotonically increasing error rate is desired for decreasing cutoff values. For example, in the case of Figure 5 suppose the experimenter was only interested in scores around 4. Using modelbased estimates, the estimated False Discovery Rate with a cutoff at 4 is 0.01503874 but the estimate of the False Discovery Rate with a cutoff at 3.8 is 0.01489971 suggesting that the error rate is lower for a lower cutoff value. To avoid this issue, the qvalue can be used as it finds the minimum False Discovery Rate at each cutoff value. The qvalue at 4 is 0.01489939 (found using increments of 0.01 searching all FDR values from 4 to 4).
Estimating the confidence of an individual spectrum identification
We now discuss the estimation of the posterior error probability and the pvalue. These measures are properties of a single spectrum and are synonymous to performing a single hypothesis test. In Figure 5 the posterior error probability and pvalue only apply to spectra at a single point.
According to Bayes Theorem the posterior probability of T_{i }= 0 (our hypotheses of interest) given its test statistic is . Following the Empirical Bayesian step where parameters are estimated we have that . Because the posterior error probability is equivalent to the local false discovery rate we also have that .
The pvalue is estimated as which is the right tailend of the Gamma density past s_{i}.
The posterior error probability may be preferred over the pvalue because it also yields an estimate for the probability of an identified spectrum to being correct (1  PEP). The advantage of the pvalue is that it only requires the use of the distribution of scores for incorrect identifications as it ignores the distribution of scores for correct identifications. Notice that in Figure 5 the pvalue at a score of 1 is a low value of 0.004 but that the Posterior Error Probability at 1 is a much higher value at 0.156.
PeptideProphet can use a decoy database to estimate the parameters of the distributions of scores for incorrect identifications
When there is significant overlap between the two density functions or a low number of identified spectra it is difficult for the EMalgorithm to estimate π_{0 }and the parameters of the Gamma and Normal distributions. In this case PeptideProphet employs the TargetDecoy approach to better estimate the Gamma distribution. We first describe the two forms of TargetDecoy: the concatenated strategy and the separate strategy [16,17]. The objective of both strategies is to introduce decoys in order to estimate the error rate since decoys are known to be incorrectly identified spectra. Reversed sequences (decoy sequences) are commonly generated by taking the target database and reversing each target sequence. Alternative methods are to use randomized sequences where amino acid sequences are generated using a prespecified probability distribution [16].
In the concatenated TargetDecoy strategy each spectrum is searched in a single database that is composed of both target and decoy sequences. This involves competition between the best correct peptide sequence, the best incorrect forward peptide sequence, and the best (incorrect) decoy peptide sequence. Hits where the best incorrect decoy peptide sequence is found to be the match are used to estimate the FDR.
In the separate TargetDecoy strategy each spectrum is searched once in the forward database and searched again independently in the decoy database. The distribution of scores from the peptides identified via the decoy database is used to estimate the form of the distribution of incorrectly identified spectra. This approach ignores competition between forward and decoy sequences.
The semisupervised version of PeptideProphet utilizes the concatenated TargetDecoy strategy by simply combining the target and decoy sequences into the same database. The decoy scores are forced to only contribute to the estimation of α, β, and γ of the Gamma distribution. PeptideProphet accomplishes this by assuming any decoy match has a posterior error probability of 1. In the EMalgorithm as described earlier, p_{i }for any decoy is assumed to be 1 at every iteration. The semisupervised version of PeptideProphet helps estimate the parameters of the Gamma distribution better and thus indirectly improves the estimation of π_{0}, μ, and σ as well. As seen in Figure 6a for the case of the Human Plasma dataset the improved estimation of the distributions also increased the separation between the distributions. As seen in Figure 6b the use of decoys helped prevent the possible mistake of having high confidence in scores around the 0 to 1 range.
Figure 6. Semisupervised estimation of parameters. Semisupervised estimation of parameters of the same distribution of scores as in Figures 2b and 2a. For Charge 3 the slight rightward shift of the Gamma distribution (distribution of scores for incorrect identifications) also encouraged a large rightward shift of the Normal Distribution (distribution of scores for correct identifications). The two vertical lines indicate the means of the Normal distributions. The addition of decoys for Charge 2 allowed to algorithm to learn that most of the identified spectra with scores from 0 to 1 are likely to be incorrect. Without decoys this may have been overlooked.
PeptideProphet can use a decoy database for semiparametric estimation of the probability distribution
The quality of fit of the Gamma and Normal distributions may rely on how the database is searched (constrained versus unconstrained search) or the search algorithm that is used [12]. As is the case in many statistical modelings, there is no guarantee that the scores of the identified spectra necessarily follow the Gamma and Normal distributions. Previously, decoys were used to estimate parameters of prespecified distributions. Now we will use decoys for datadependent estimation of the distributions themselves.
One approach is to estimate the distributional forms using a kernel density (semiparametric) approach [12] as opposed to maximum likelihood estimation. Kernel density estimates first discretizes the horizontal axis into bins. For a specified bandwidth h, the distribution of scores for incorrect identifications is estimated using where n_{0 }is the number of decoys, K is the Normal density function, and S_{i }is the score of decoy i. The greater the h the smoother the function while the smaller the h the more rough the function. The parameter h can be specified using any method such as using the mean integrated square error. Crossvalidation can be used as well. The distribution of scores for correct identifications is estimated in the same fashion as well but using only forward scores. Pseudocode of the semiparametric approach can be seen in Figure 8.
Figure 7. The Controlled Mixture dataset fit with the basic PeptideProphet and the semiparametric version. The Controlled Mixture dataset fit with the basic PeptideProphet and the semiparametric version of PeptideProphet utilizing the kernel density estimator. The smoothed estimator allowed for a more finetuned fit to the estimated (asymmetric) distribution of the correctly identified spectra.
Figure 8. Pseudocode of the semiparametric version of PeptideProphet.
An example of this approach can be seen in Figure 7. The parametric fit of the distribution of scores for correct identifications clearly deviates from the Normal curve as the mode of the correct hits is shifted to the right. The semiparametric approach produces a curve that more robustly fits the leftskewed distribution of scores for correct identifications.
To avoid overfitting, this approach should only be used in the cases of strong deviations between the fitted distributions and the observed scores, such as the parametric fit (dashedlines) in Figure 7. Overfitting typically occurs in experiments with a small number of spectra, such as in Figure 2a. Overfitting can be checked via bootstrapping by seeing if bootstrapped samples do not reflect the same need for a semiparametric fit at certain score values. This can be done via quantile to quantile plots or by checking mean squared errors. If users anticipate good separation, parametric PeptideProphet is often sufficient for practical purposes.
PeptideProphet can be extended to dynamically estimate the coefficients of the discriminant function from the data
Overlap in the distributions of scores of correct and incorrect identifications can be due to a suboptimal scoring function, which does not discriminate well between the properties of correct and incorrect identifications. This often occurs in cases of constrained searches where the database that is searched is much smaller than the unconstrained search space that was used to find the coefficients in the fixed discriminant function. For additional information on constrained versus unconstrained searches, see [5]. A solution to this is to adapt the discriminant function to each experiment or search approach which can improve the separation between the distribution of scores for incorrect and correct identifications [13].
Pseudocode of the adaptive version of PeptideProphet can be seen in Figure 10. The main step in the algorithm is to update β's from Equations 1 or 2 by extracting identified spectra with high posterior error probabilities and identified spectra with low posterior error probabilities. When retraining the β's the algorithm will randomly sample identified spectra with low posterior error probabilities I times and produce I different estimates. The average of these I β's is the updated β. This entire step is repeated by reestimating posterior error probabilities and updating β until the β do not change by a small ε.
Figure 9. Semiparametric fits with dynamically estimated coefficients. Semiparametric fits of the distributions of scores for correct and incorrect identifications on the Controlled Mixture Dataset from a constrained search (tryptic peptides, narrow mass tolerance) using fixed discriminant coefficients (left) versus adaptive discriminant coefficients (center). The right tail of the distribution of scores for incorrect identifications can be seen penetrating the distribution of scores for correct identifications more deeply in the fixed case implying greater discriminative ability when using the adaptive discriminant function. The improved performance of adaptive coefficients can be seen in the plot of the estimated FDR versus the estimated number of significant correctly identified spectra (right). Recall that in this dataset, target scores are assumed correct. The estimated FDR here was estimated by the ratio of the number of decoys to target scores.
Figure 10. Pseudocode of the adaptive version of PeptideProphet.
The improvement of the adaptive discriminant function over the fixed discriminant function for the Controlled Mixture dataset in a constrained search space is displayed in Figure 9. Only tryptic peptides with a narrow mass tolerance were searched.
This approach is also useful for incorporating lower ranked peptide matches (i.e. for a given spectrum, instead of only considering the best peptide sequence match, use the new discriminant function to also rescore peptide sequence matches that ranked close to the best peptide sequence match). Every time a new discriminant function is estimated (when the I are averaged) a new summarized score is calculated for the top 5 (can be changed of course) Peptide Matches for every spectrum. The highest scoring peptidespectrummatch is used in the training of the next discriminant function.
Implementation of the PeptideProphet in the TransProteomic Pipeline
The TransProteomic Pipieline (TPP) is an open source program developed at the Institute for Systems Biology designed for complete proteomic analysis starting from spectrum identification to protein identification and quantification and can be downloaded from http://sourceforge.net/projects/sashimi/ webcite[18]. In this section we assume that search results have already been converted to pepXML files, which is the standard input for PeptideProphet. A discussion of this can be found at (http://tools.proteomecenter.org/wiki/index.php?title=TPP_Tutorial webcite).
We present an example using the Human Plasma dataset where the spectra are searched through Tandem with the kscore plugin with TPP version 4.4. PeptideProphet automatically models all precursor ion charges and outputs the probability of correct identification. A mixture model using the Normal for the distribution of correct scores and a Gumbel distribution for the distribution of incorrect scores.
In Figure 11 of the 17543 identified spectra are listed. The first column lists the probability of correct identification (1  PEP), so numbers close to 1 here are desirable. The remaining columns list, in order, the spectrum label, Tandem expect score, the fraction of ions matched, the peptide sequence match, the protein match, and the calculated neutral peptide mass. In this example any protein label with a "rev" is a decoy. Each hyperlink will lead to additional information. For example, clicking on a peptide sequence will lead to a BLAST search or clicking on the fraction of ions matched will display the observed spectrum.
Figure 11. pepXML viewer from TPP. The output of PeptideProphet is stored in pepXML format. The pepXML viewer visualizes the content of pepXML and posterior probabilities associated with each identified spectrum.
Clicking on 0.7664, or the ninth entry "2b_plasma_0mM_C1.00024.00024.1" on the identified spectra list, results in information of the model fit by PeptideProphet in Figure 12 and the estimated parameter values for charge 2 in Figure 13.
Figure 12. Scoring results for identified spectra from a PeptideProphet fit in TPP. PeptideProphet output of sensitivity error analysis and figures of estimated mixture models. The bottom portion shows the fitted curves for different charges. The light blue curves represent the distribution of scores for incorrect identification, purple for correct identifications, and black the sum of the two distributions. The red vertical line also indicates the score for the identified spectra that we clicked on with its additional information at the bottom of the figure.
Figure 13. Parameter estimates for a PeptideProphet fit in TPP. Estimated parameter values of the PeptideProphet mixture model for charge 2. The parameters of accurate mass difference (ΔM) are not fully displayed.
We will now discuss how to use the information in Figures 12 and 13 to estimate the confidence measures discussed previously:
1. False Discovery Rate: estimates of the False Discovery Rate can be obtained three ways. In the upperright hand corner of Figure 12 estimated False Discovery Rates under the "Error" column is given for 1  PEP values under the "Min Prob" column. In other words, "Min Prob" represents the minimum posterior probability of being correct in order to conclude that an identified spectrum is correct. For example, a "Min Prob" of 0.95 implies that only identified spectra with PEP's lesser than 0.05 are considered correct or that (1  PEP) must be greater than 0.95 to be considered correct.
A second approach is to use the estimated model parameters in Figure 13 to estimate the False Discovery Rate for identified spectra of charge 2. The estimate for (1  π_{0}) is 0.04 which yields an estimate of π_{0 }as 0.96. The Normal's (Gaussian) estimated mean μ is 2.6 with an estimated standard deviation σ of 1.90. The Gumbel's estimated μ_{G }parameter is 1.16 with an estimated β parameter as 0.76. Alternatively, the expected value (mean) of the Gumbel is 1.16 with a standard deviation of 0.98. If the experimenter is not interested in NTT, NMC, and ΔM, for a cutoff score t, the estimated FDR can then be estimated by where P(S >tT = 0) is found using the Normal distribution and P(S >tT = 1) is found using the Gumbel distribution. Suppose the experimenter wanted to restrict the FDR calculation to identified spectra with only 0 missed cleavages. According to the output in the distribution of correct scores, a randomly selected correctly identified spectra has a 0.926 probability of having 0 missed cleavages and a 0.074 probability of having 1 to 2 missed cleavages. For the distribution of incorrect scores, probabilities are 0.404 and 0.596 for 0 and 1 missed cleavages respectively. The estimated FDR would then be
The calculation takes into account that among the correctly identified spectra, it is estimated that a majority of the identified spectra have 0 missed cleavages.
A third approach of estimating the False Discovery Rate is to download all posterior probabilities, convert them to posterior error probabilities (local false discovery rates) by taking the complement, define a cutoff point t and then to calculate .
2. False Positive Rate or pvalue: using the Gumbel's estimated parameters, the false positive rate can be found by looking at the tail area.
3. qvalue: the qvalue at a specific point δ can be calculated by estimating the False Discovery Rate at the score value of every identified spectra and then by finding the minimum False Discovery Rate among all scores .
4. Posterior Error Probability and Local False Discovery Rate: these are most easily found by finding the complement of the values in the first column of Figure 11 or by looking at the complement of "prob" at the bottom center of Figure 12. Note that these probabilities automatically incorporate NTT, NMC, and ΔM. If the experimenter was interested in the posterior error probability of a score independent of NTT, NMC, and ΔM, this can still be calculated using the estimated model parameters.
All inference for semisupervised and semiparametric PeptideProphet cases are identical. Inference would be identical for the adaptive version of PeptideProphet but it is not implemented in TPP at this time but is available from the authors upon request.
Following the execution of PeptideProphet the next step in analysis is often the identification of proteins present in the sample. In this different analysis, the experimental unit changes from being a spectrum to a peptide. TPP can be used to run ProteinProphet, a computational algorithm that can utilize PeptideProphet's estimated probabilities to determine the probability for the presence of proteins in two steps [19]. In the first step the posterior probability of a peptide being correctly identified from PeptideProphet is decreased for peptides that are the only peptide linked to a protein and increased for peptides that are linked to proteins explained by many peptides. In the second step the probability of a protein being in the sample is calculated as the probability that at least one of its associated peptides were identified in the sample. This is if is the adjusted probability of a peptide being in the sample where i is indexed from 1 to the number of peptides linked to the protein in question.
Discussion
PeptideProphet is available for use on the TransProteomic Pipeline with many other database search tools (X!Tandem, MASCOT, OMSSA, Phenyx, ProbID, InsPecT, MyriMatch). The statistical approach of PeptideProphet is generalizable to any database search algorithm that returns a quantitative score for each identified spectrum.
Although we used the Gamma and Normal distributions to model the components of the PeptideProphet model, there are no limitation to the choice of parametric distribution for describing the distributions of scores for incorrect and correct identifications in PeptideProphet. The Gumbel distribution, with parameters μ and β is another common distribution used for the distribution of scores of incorrect identifications. A generalization of the Gumbel distribution is the Extreme Value Distribution. Additional information, such as the NTT, may be incorporated into the summarized score by using a different machine learning approach instead of a discriminant function. Quantities like the NTT were left out of the summarized discriminant score due to its discrete nature. For example, the logistic regression function would allow discrete and continuous covariates to be transformed into a single summarized score while separating identified spectra with T = 0 from identified spectra with T = 1.
The TargetDecoy approach used in this manuscript is an approach that pioneered the use of decoys for the estimation of the False Discovery Rate and its results are often compared to other techniques [16]. For the estimation of the False Discovery Rate PeptideProphet and TargetDecoy methods in our experience produce similar results especially when the semisupervised version of PeptideProphet is used as its search approach is similar to the concatenated version of TargetDecoy. In fact, PeptideProphet can be considered as an extension of the concatenated version of TargetProphet because of its additional modeling objectives. PeptideProphet simply has distributional assumptions and can be used to estimate confidence of individual spectrum identifications or sets of spectrum identifications (local and global FDR estimates) whereas targetdecoy is limited to sets (global FDR estimate only). Also, if there is heavy overlap TargetDecoy will outperform basic PeptideProphet but Semisupervised PeptideProphet and TargetDecoy should be similar.
An alternative approach which relaxes the parametric assumptions is the variable component approach which uses an unknown mixture of Gaussians to represent the incorrect and correct distributions of scores [12]. The correct distribution is represented by a mixture distribution of k_{0 }normal distributions (that may have different means and variances) and the incorrect distribution is represented by a separate mixture distribution of k_{1 }normal distributions. Parameters k_{0 }and k_{1 }are unknown. Each score s_{i }is a member of either the overall correct or incorrect distributions, but are then further assigned as a member to one of the subcomponents of the mixture representing the correct or incorrect distribution. Gibbs sampling is used to estimate the forms of the subcomponents (which also suggests the complexity of this approach). Although the variable component and kernel methods perform similarly there are minor computational and modeling issues to consider [12]. The advantages to the variable component method are that: (1) The model is still parametric, which may help reduce the chance of overfitting, (2) Kernel estimation may over fit, especially if the bandwidth is too low, and (3) It does not completely rely on decoys for the negative whereas kernel density estimation uses decoys only for estimating the negative distribution. The advantages of the kernel approach are that: (1) The variable component method is much more computationally intensive and more complicated (and thus the Kernel Estimation is less intensive), (2) The variable component method requires the specification of priors, and (3) Kernel estimation is very well known and commonly used.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
K.M. implemented the statistical analysis framework, analyzed the datasets and wrote the manuscript. O.V. supervised the statistical aspects of the work, and wrote the manuscript. A.N. supervised the statistical and the mass spectrometrybased aspects of the work.
Acknowledgements
The authors would like to thank Hyungwon Choi for providing Rcode for the PeptideProphet model fits. The work was supported in part by the NSF CAREER award DBI1054826 to OV, and by NIH grants R01GM094231 and R01CA126239 to AN.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 16, 2012: Statistical mass spectrometrybased proteomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S16.
References

Eng J, McCormack A, Yates J: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database.
American Society for Mass Spectrometry 1994, 5:976989. Publisher Full Text

Craig R, Beavis R: TANDEM: matching proteins with tandem mass spectra.
Bioinformatics 2004, 20(9):14661467. PubMed Abstract  Publisher Full Text

MacLean B, Eng J, Beavis R, McIntosh M: General framework for developing and evaluating database scoring algorithms using the TANDEM search engine.
Bioinformatics 2006, 22(22):28302832. PubMed Abstract  Publisher Full Text

Keller A, Nesvizhskii A, Kolker E, Aebersold R: Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search.
Analytical Chemistry 2002, 74:53835392. PubMed Abstract  Publisher Full Text

Nesvizhskii A: A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics.
Journal of Proteomics 2010, 73:20922123. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Whiteaker J, Zhang H, Eng J, Fang R, Piening B, Feng L, Lorentzen T, Schoenherr R, Keane J, Holzman T, Fitzgibbon M, Lin C, Zhang H, Cooke K, Liu T, II DC, Anderson L, Watts J, Smith R, McIntosh M, Paulovich A: Headtohead comparison of serum fractionation techniques.
Journal of Proteome Research 2007, 6:828836. PubMed Abstract  Publisher Full Text

Choi H, Nesvizhskii A: Semisupervised modelbased validation of peptide identifications in mass spectrometrybased proteomics.
Journal of Proteome Research 2008, 7:254265. PubMed Abstract  Publisher Full Text

Klimek J, Eddes J, Hohmann L, Jackson J, Peterson A, Letarte S, Gafken P, Katz J, Mallick P, Lee H, Schmidt A, Ossola R, Eng J, Aebersold R, Martin D: The standard protein mix database: a diverse data set to assist in the production of improved peptide and protein identification software tools.
Journal of proteome research 2007, 7:96103. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Storey J: A direct approach to false discovery rates.
Journal of the Royal Statistical Society. Series B 2002, 64(3):479498. Publisher Full Text

Efron B: Microarrays, empirical Bayes and the twogroups model.
Statistical Science 2008, 23:122. Publisher Full Text

Kall L, Storey J, MacCoss M: Posterior error probabilities and false discovery rates: two sides of the same coin.
Journal of Proteome Research 2008, 7:4044. PubMed Abstract  Publisher Full Text

Choi H, Ghosh D, Nesvizhskii A: Statistical validation of peptide identifications in largescale proteomics using the targetdecoy database search strategy and flexible mixture modeling.
Journal of Proteome Research 2008, 7:286292. PubMed Abstract  Publisher Full Text

Ding Y, Choi H, Nesvizhskii A: Adaptive discriminant function analysis and reranking of MS/MS database search results for improved peptide identification in shotgun proteomics.
Journal of Proteome Research 2008, 7:48784889. PubMed Abstract  Publisher Full Text

Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete data via the EM algorithm. [http:/ / www.jstor.org/ discover/ 10.2307/ 2984875?uid=3738032&uid=2&uid=4&sid =21101269442551] webcite
Journal of the Royal Statistical Society. Series B 1977, 39:138.

Storey J: The positive false discovery rate: a Bayesian interpretation and the qvalue.
Annals of Statistics 2003, 31(6):20132035. Publisher Full Text

Elias J, Gygi S: Targetdecoy search strategy for increased confidence in largescale protein identifications by mass spectrometry.
Nature Methods 2007, 4(3):207214. PubMed Abstract  Publisher Full Text

Käll L, Storey J, MacCoss M, Noble W: Assigning significance to peptides identified by tandem mass spectrometry using decoy databases.
Journal of Proteome Research 2008, 7:2934. PubMed Abstract  Publisher Full Text

Deutsch E, Mendoza L, Shteynberg D, Farrah T, Lam H, Tasman N, Sun Z, Nilsson E, Pratt B, Prazen B, Eng JK, Martin DB, Nesvizhskii AI, Aebersold R: A guided tour of the Trans Proteomic Pipeline.
Proteomics 2010, 10:11501159. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Nesvizhskii A, Keller A, Kolker E, Aebersold R: A statistical model for identifying proteins by tandem mass spectrometry. [http://pubs.acs.org/doi/abs/10.1021/ac0341261] webcite
Analytical Chemistry 2003., 75 PubMed Abstract  Publisher Full Text