Department of Statistics, Purdue University, 250 N. University Street, West Lafayette, Indiana, USA

Department of Computer Science, Purdue University, 305 N. University Street, West Lafayette, Indiana, USA

Department of Pathology, University of Michigan, 4237 Medical Science I, Ann Arbor, Michigan, USA

Abstract

PeptideProphet is a post-processing algorithm designed to evaluate the confidence in identifications of MS/MS spectra returned by a database search. In this manuscript we describe the "what and how" of PeptideProphet in a manner aimed at statisticians and life scientists who would like to gain a more in-depth understanding of the underlying statistical modeling. The theory and rationale behind the mixture-modeling approach taken by PeptideProphet is discussed from a statistical model-building perspective followed by a description of how a model can be used to express confidence in the identification of individual peptides or sets of peptides. We also demonstrate how to evaluate the quality of model fit and select an appropriate model from several available alternatives. We illustrate the use of PeptideProphet in association with the Trans-Proteomic Pipeline, a free suite of software used for protein identification.

Introduction

In mass-spectrometry shotgun proteomics, the first phase of analysis is the identification of peptides in complex biological mixtures digested by enzymes such as trypsin. Dependent on the peptides in the biological mixture, an experiment will produce a certain number of spectra (call it

We will discuss PeptideProphet in the context of two database search algorithms: SEQUEST

Given a database search algorithm, every spectrum that is observed will be scored against the peptides in the database. For each spectrum, the highest scoring peptide (depending on the scoring criterion) is typically chosen as the best match. The best match is the potential peptide sequence that generated its corresponding observed spectrum. Thus, we have

The necessity of PeptideProphet arises because the spectra are subject to noise making it difficult to determine if the peptide that it is matched to is correct. The spectrum itself is generated from a peptide sequence and peaks can be missing or reduced in intensity. Because the spectrum that is being generated is subject to noise the database-based criterion will vary when comparing theoretical spectra to observed spectra. Additionally, when searching the database, the correct peptide sequence may be absent. Because of this noise, how do we determine confidence in an identified spectrum? Traditional standards (such as just accepting all above

PeptideProphet

The overview of PeptideProphet is as follows:

1. Rescoring: produce a score which reflects the quality of an identified spectrum, while summarizing multiple quantities, such as

2. Modeling: produce a probability-based model for the distribution of correctly and incorrectly identified spectra. The model must be then fit to the scores of all identified spectra.

3. Evaluation of the Quality of Fit: determine how well the scores fit the probability-based model.

4. Inference

(a) Evaluation of confidence in individual identified spectra using the posterior probability.

(b) Evaluation of confidence in sets of identified spectra: produce a cutoff on the scores to determine a set of correctly identified spectra while controlling the False-Discovery Rate, defined as the expected proportion of false positives.

We will first discuss the basic version of PeptideProphet and then discuss the three extensions.

Materials

Human plasma dataset

This dataset uses the first LC-MS/MS replicate file from the Western Consortium of the National Cancer Institute's Mouse Models of Human Cancer

Controlled mixture

This dataset uses spectra generated from a linear ion trap Fourier transform instrument that was published in

Methods

Statement of the problem from a statistical perspective, and terminology

Every statistical approach requires the definition of the following components in the problem:

1. PeptideProphet works with the observed spectra as the

2. An observed score is interpreted as a test statistic. In statistics the summarized score

3. PeptideProphet assumes that the test statistic comes from a mixture of two distributions: one from the distribution of correct identifications, and the other from the distribution of the incorrect identifications. The distributions may be characterized by a few parameters (parametric) or many parameters (semi or non-parametric).

4. The goal of PeptideProphet is to test two competing _{i }_{i }_{i }

5. Inference: confidence is determined for individual spectra or sets of spectra.

• If the researcher is interested in a set of spectrum identifications, the False Discovery Rate should be controlled.

We determine the confidence in a set of spectra by controlling the False Discovery Rate. The False Discovery Rate, given a cutoff _{i }

Table of multiple hypothesis testing quantities

**# Not Rejected**

**# Rejected**

**Total**

# True Nulls

_{0}

# True Alternatives

_{0}

Total

Table 1:

An alternative confidence rate that is rarely used is the False Positive Rate (FPR). The False Positive Rate, given a cutoff

Many users prefer the q-value which is the minimum False Discovery Rate required for a score _{i }_{i }

• If the researcher is interested in specific spectrum identifications the posterior error probability is most commonly used as it quantifies the confidence of a single identified spectrum.

The posterior error probability represents _{i}

Alternatively the p-value can be used. If _{i }_{i }_{i }_{0 }truly null hypotheses.

For each spectrum, PeptideProphet establishes a score reflecting the quality of an identified spectrum

First each spectrum (experimental unit) is observed and potentially identified using a database-based criterion (_{i }

such that

If _{i }

In the basic version of PeptideProphet the

PeptideProphet relates observable and unobservable quantities via a joint probability distribution

PeptideProphet relates scores _{i }_{0i }and _{ai}_{i}_{i }_{i }_{0 }is used to represent the overall proportion of incorrect identifications of identified spectra in the population. This formulation results in a 2-group mixture model similar to what is established by Efron

The last equality is due to the fact that all scores are independent and identically distributed (iid). Due to different discriminant functions being used for each charge, a different sampling distribution and set of parameters are produced for each precursor ion charge (we will refer to this simply as the charge).

There may be additional information available, such as the NTT (number of tryptic termini), NMC (number of missed cleavages), and Δ

Note that the density functions of _{T}_{=0}, _{NTT}_{T}_{=0, NMC}, _{T}_{=0}, _{ΔM}, _{T}_{=1, NTT}, _{T}_{=1}, _{NMC}, and _{T}_{=1}, _{ΔM }are discrete. It is assumed, conditional on the identified spectrum being incorrect or correct, that the members of (_{i}_{i}_{i}_{i}

PeptideProphet estimates parameters of interest in an Empirical Bayesian approach

PeptideProphet is considered an Empirical Bayesian approach because it uses each identified spectrum twice: once to estimate via the Expectation-Maximimzation _{0}, _{0},

In the E-step, given the estimated values of the model parameters, the probability of each score being correct (or incorrect) is calculated. Given a single observed score _{i }_{i}_{0 }over the sum of the Gamma and Normal densities scaled by _{0 }and 1 - _{0 }at score _{i}

In the M-step, given estimated membership probabilities _{i}_{0 }is ^{2 }are:

For the Gamma distribution, the estimate of _{i}

Due to the speed of the algorithm in working with only two mixture components, the process of the E and M-step can be iterated repeatedly until the model parameters do not change by a specified _{0 }(denoted with hats when estimates). The algorithm is detailed in Figure

Pseudocode of the EM-algorithm for iteratively estimating model parameters and membership probabilities

**Pseudocode of the EM-algorithm for iteratively estimating model parameters and membership probabilities**.

PeptideProphet fits on the Human Plasma Dataset

**PeptideProphet fits on the Human Plasma Dataset**. PeptideProphet fits on the Human Plasma Dataset with Tandem Scores on charges 2 (left) and 3 (right). The blue and red curves correspond to the fitted frequency curves of the correct (Normal) and incorrect (Gamma) distributions. The Charge 2 fits yields a mixture distribution with a much stronger separation than the fit to Charge 3.

Evaluation of the quality of fit of PeptideProphet

Deviations of the assumptions, or a low number of identified spectra can lead to an inadequate or unstable model fit and incorrect conclusions. This can be diagnosed by visual inspection, and also by the bootstrap. We recommend using visual inspection over goodness of fit tests as tests do not explore the specific fitting issues that may influence subsequent inference of the identified spectra. In fact goodness of fit tests simply attempt to summarize the goodness of fit into one summary statistic whereas we are typically interested in the fit at certain locations of the mixture distribution. There are several visual attributes of the mixture distribution that researchers should be aware of and some remedies for them.

An issue that is not commonly addressed however is the number of identified spectra available to fit the mixture model. The number of identified spectra required to fit a reliable model depends highly on the separation and the form of the observed scores. A statistical approach to examine the stability of the fitted model can be done via the bootstrap.

Bootstrapping can be performed by sampling with replacement

Three hundred bootstrapped samples for the Human Plasma data for charges 2 and 3 were performed and the bootstrapped estimates for _{0},

Bootstrapped samples of **π _{0}, μ **, and

**Bootstrapped samples of π_{0}, **. The original estimates are marked by the vertical line. The length of the horizontal axes are equal for the plots of a particular parameter. The Charge 2 distributions are slightly skewed compared to Charge 3 distributions and the mean squared errors are much greater in Charge 2 distributions. The variability of the Charge 2 distributions are visibly much greater indicating unstable estimates.

The mean squared error summarizes the overall deviation of parameter estimates from

Quantile-to-quantile plot comparing the quantiles of the original mixture distribution of the Human Plasma data

**Quantile-to-quantile plot comparing the quantiles of the original mixture distribution of the Human Plasma data**. Quantile to quantile plot comparing the quantiles of the original mixture distribution of the Human Plasma data for Charges 2 (left) and 3 (right) compared to the quantiles of randomly bootstrapped samples. The quantile-to-quantile plot for Charge 2 shows more deviation in quantiles due to the low number of identified spectra in the score range between 2 and 8.

Estimating the confidence of spectrum identifications

Estimating the confidence of a set of spectrum identifications

In order to determine the correctness of the spectrum identifications, a decision rule is defined where any spectrum identification with a score above

In order to estimate the False Discovery Rate given a decision rule cutoff two approaches may be used. Because all scores are assumed to follow the same fitted distribution the False Discovery Rate can be estimated with

The PeptideProphet fit to the Human Plasma dataset of Tandem scores of Charge 2

**The PeptideProphet fit to the Human Plasma dataset of Tandem scores of Charge 2**. The PeptideProphet fit to the Human Plasma dataset of Tandem scores of Charge 2 with fitted frequency curves from Figure 2b. The four confidence measures of the Posterior Error Probability (PEP), p-value, False Discovery Rate (FDR), and False Positive Rate (FPR) are shown at a score of 1. The Posterior Error Probability at 1 is 0.156 and the estimated False Discovery Rate is 0.083. The p-value and FPR are equivalent and equal to 0.004. In the formula for the estimated FDR,

The False Positive Rate for a cutoff

The estimation of the q-value at a specific point _{i }_{i }_{i }

Estimating the confidence of an individual spectrum identification

We now discuss the estimation of the posterior error probability and the p-value. These measures are properties of a single spectrum and are synonymous to performing a single hypothesis test. In Figure

According to Bayes Theorem the posterior probability of _{i }

The p-value is estimated as _{i}

The posterior error probability may be preferred over the p-value because it also yields an estimate for the probability of an identified spectrum to being correct (1 -

PeptideProphet can use a decoy database to estimate the parameters of the distributions of scores for incorrect identifications

When there is significant overlap between the two density functions or a low number of identified spectra it is difficult for the EM-algorithm to estimate _{0 }and the parameters of the Gamma and Normal distributions. In this case PeptideProphet employs the Target-Decoy approach to better estimate the Gamma distribution. We first describe the two forms of Target-Decoy: the concatenated strategy and the separate strategy

In the concatenated Target-Decoy strategy each spectrum is searched in a single database that is composed of both target and decoy sequences. This involves competition between the best correct peptide sequence, the best incorrect forward peptide sequence, and the best (incorrect) decoy peptide sequence. Hits where the best incorrect decoy peptide sequence is found to be the match are used to estimate the FDR.

In the separate Target-Decoy strategy each spectrum is searched once in the forward database and searched again independently in the decoy database. The distribution of scores from the peptides identified via the decoy database is used to estimate the form of the distribution of incorrectly identified spectra. This approach ignores competition between forward and decoy sequences.

The semisupervised version of PeptideProphet utilizes the concatenated Target-Decoy strategy by simply combining the target and decoy sequences into the _{i }_{0},

Semisupervised estimation of parameters

**Semisupervised estimation of parameters**. Semisupervised estimation of parameters of the same distribution of scores as in Figures 2b and 2a. For Charge 3 the slight rightward shift of the Gamma distribution (distribution of scores for incorrect identifications) also encouraged a large rightward shift of the Normal Distribution (distribution of scores for correct identifications). The two vertical lines indicate the means of the Normal distributions. The addition of decoys for Charge 2 allowed to algorithm to learn that most of the identified spectra with scores from 0 to 1 are likely to be incorrect. Without decoys this may have been overlooked.

PeptideProphet can use a decoy database for semiparametric estimation of the probability distribution

The quality of fit of the Gamma and Normal distributions may rely on how the database is searched (constrained versus unconstrained search) or the search algorithm that is used

One approach is to estimate the distributional forms using a kernel density (semi-parametric) approach _{0 }is the number of decoys, _{i }

The Controlled Mixture dataset fit with the basic PeptideProphet and the semiparametric version

**The Controlled Mixture dataset fit with the basic PeptideProphet and the semiparametric version**. The Controlled Mixture dataset fit with the basic PeptideProphet and the semiparametric version of PeptideProphet utilizing the kernel density estimator. The smoothed estimator allowed for a more fine-tuned fit to the estimated (asymmetric) distribution of the correctly identified spectra.

Pseudocode of the semiparametric version of PeptideProphet

**Pseudocode of the semiparametric version of PeptideProphet**.

An example of this approach can be seen in Figure

To avoid overfitting, this approach should only be used in the cases of strong deviations between the fitted distributions and the observed scores, such as the parametric fit (dashed-lines) in Figure

PeptideProphet can be extended to dynamically estimate the coefficients of the discriminant function from the data

Overlap in the distributions of scores of correct and incorrect identifications can be due to a suboptimal scoring function, which does not discriminate well between the properties of correct and incorrect identifications. This often occurs in cases of constrained searches where the database that is searched is much smaller than the unconstrained search space that was used to find the coefficients in the fixed discriminant function. For additional information on constrained versus unconstrained searches, see

Pseudocode of the adaptive version of PeptideProphet can be seen in Figure

Semiparametric fits with dynamically estimated coefficients

**Semiparametric fits with dynamically estimated coefficients**. Semiparametric fits of the distributions of scores for correct and incorrect identifications on the Controlled Mixture Dataset from a constrained search (tryptic peptides, narrow mass tolerance) using fixed discriminant coefficients (left) versus adaptive discriminant coefficients (center). The right tail of the distribution of scores for incorrect identifications can be seen penetrating the distribution of scores for correct identifications more deeply in the fixed case implying greater discriminative ability when using the adaptive discriminant function. The improved performance of adaptive coefficients can be seen in the plot of the estimated FDR versus the estimated number of significant correctly identified spectra (right). Recall that in this dataset, target scores are assumed correct. The estimated FDR here was estimated by the ratio of the number of decoys to target scores.

Pseudocode of the adaptive version of PeptideProphet

**Pseudocode of the adaptive version of PeptideProphet**.

The improvement of the adaptive discriminant function over the fixed discriminant function for the Controlled Mixture dataset in a constrained search space is displayed in Figure

This approach is also useful for incorporating lower ranked peptide matches (i.e. for a given spectrum, instead of only considering the

Implementation of the PeptideProphet in the Trans-Proteomic Pipeline

The Trans-Proteomic Pipieline (TPP) is an open source program developed at the Institute for Systems Biology designed for complete proteomic analysis starting from spectrum identification to protein identification and quantification and can be downloaded from

We present an example using the Human Plasma dataset where the spectra are searched through Tandem with the k-score plugin with TPP version 4.4. PeptideProphet automatically models all precursor ion charges and outputs the probability of correct identification. A mixture model using the Normal for the distribution of correct scores and a Gumbel distribution for the distribution of incorrect scores.

In Figure

pepXML viewer from TPP

**pepXML viewer from TPP**. The output of PeptideProphet is stored in pepXML format. The pepXML viewer visualizes the content of pepXML and posterior probabilities associated with each identified spectrum.

Clicking on 0.7664, or the ninth entry "2b_plasma_0mM_C1.00024.00024.1" on the identified spectra list, results in information of the model fit by PeptideProphet in Figure

Scoring results for identified spectra from a PeptideProphet fit in TPP

**Scoring results for identified spectra from a PeptideProphet fit in TPP**. PeptideProphet output of sensitivity error analysis and figures of estimated mixture models. The bottom portion shows the fitted curves for different charges. The light blue curves represent the distribution of scores for incorrect identification, purple for correct identifications, and black the sum of the two distributions. The red vertical line also indicates the score for the identified spectra that we clicked on with its additional information at the bottom of the figure.

Parameter estimates for a PeptideProphet fit in TPP

**Parameter estimates for a PeptideProphet fit in TPP**. Estimated parameter values of the PeptideProphet mixture model for charge 2. The parameters of accurate mass difference (Δ

We will now discuss how to use the information in Figures

1. False Discovery Rate: estimates of the False Discovery Rate can be obtained three ways. In the upper-right hand corner of Figure

A second approach is to use the estimated model parameters in Figure _{0}) is 0.04 which yields an estimate of π_{0 }as 0.96. The Normal's (Gaussian) estimated mean _{G }

The calculation takes into account that among the correctly identified spectra, it is estimated that a majority of the identified spectra have 0 missed cleavages.

A third approach of estimating the False Discovery Rate is to download all posterior probabilities, convert them to posterior error probabilities (local false discovery rates) by taking the complement, define a cutoff point

2. False Positive Rate or p-value: using the Gumbel's estimated parameters, the false positive rate can be found by looking at the tail area.

3. q-value: the q-value at a specific point

4. Posterior Error Probability and Local False Discovery Rate: these are most easily found by finding the complement of the values in the first column of Figure

All inference for semisupervised and semiparametric PeptideProphet cases are identical. Inference would be identical for the adaptive version of PeptideProphet but it is not implemented in TPP at this time but is available from the authors upon request.

Following the execution of PeptideProphet the next step in analysis is often the identification of proteins present in the sample. In this different analysis, the experimental unit changes from being a spectrum to a peptide. TPP can be used to run ProteinProphet, a computational algorithm that can utilize PeptideProphet's estimated probabilities to determine the probability for the presence of proteins in two steps

Discussion

PeptideProphet is available for use on the Trans-Proteomic Pipeline with many other database search tools (X!Tandem, MASCOT, OMSSA, Phenyx, ProbID, InsPecT, MyriMatch). The statistical approach of PeptideProphet is generalizable to any database search algorithm that returns a quantitative score for each identified spectrum.

Although we used the Gamma and Normal distributions to model the components of the PeptideProphet model, there are no limitation to the choice of parametric distribution for describing the distributions of scores for incorrect and correct identifications in PeptideProphet. The Gumbel distribution, with parameters

The Target-Decoy approach used in this manuscript is an approach that pioneered the use of decoys for the estimation of the False Discovery Rate and its results are often compared to other techniques

An alternative approach which relaxes the parametric assumptions is the variable component approach which uses an unknown mixture of Gaussians to represent the incorrect and correct distributions of scores _{0 }normal distributions (that may have different means and variances) and the incorrect distribution is represented by a separate mixture distribution of _{1 }normal distributions. Parameters _{0 }and _{1 }are unknown. Each score _{i }

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

K.M. implemented the statistical analysis framework, analyzed the datasets and wrote the manuscript. O.V. supervised the statistical aspects of the work, and wrote the manuscript. A.N. supervised the statistical and the mass spectrometry-based aspects of the work.

Acknowledgements

The authors would like to thank Hyungwon Choi for providing R-code for the PeptideProphet model fits. The work was supported in part by the NSF CAREER award DBI-1054826 to OV, and by NIH grants R01-GM-094231 and R01-CA-126239 to AN.

This article has been published as part of