PPINGUIN: Peptide Profiling Guided Identification of Proteins improves quantitation of iTRAQ ratios

Bauer, Chris; Kleinjung, Frank; Rutishauser, Dorothea; Panse, Christian; Chadt, Alexandra; Dreja, Tanja; Al-Hasani, Hadi; Reinert, Knut; Schlapbach, Ralph; Schuchhardt, Johannes

doi:10.1186/1471-2105-13-34

Methodology article
Open access
Published: 16 February 2012

PPINGUIN: Peptide Profiling Guided Identification of Proteins improves quantitation of iTRAQ ratios

Chris Bauer¹,
Frank Kleinjung¹,
Dorothea Rutishauser²,
Christian Panse²,
Alexandra Chadt³,
Tanja Dreja³,
Hadi Al-Hasani^3,4,
Knut Reinert⁵,
Ralph Schlapbach² &
…
Johannes Schuchhardt¹

BMC Bioinformatics volume 13, Article number: 34 (2012) Cite this article

6434 Accesses
4 Citations
Metrics details

Abstract

Background

Recent development of novel technologies paved the way for quantitative proteomics. One of the most important among them is iTRAQ, employing isobaric tags for relative or absolute quantitation. Despite large progress in technology development, still many challenges remain for derivation and interpretation of quantitative results. One of these challenges is the consistent assignment of peptides to proteins.

Results

We have developed Peptide Profiling Guided Identification of Proteins (PPINGUIN), a statistical analysis workflow for iTRAQ data addressing the problem of ambiguous peptide quantitations. Motivated by the assumption that peptides uniquely derived from the same protein are correlated, our method employs clustering as a very early step in data processing prior to protein inference. Our method increases experimental reproducibility and decreases variability of quantitations of peptides assigned to the same protein. Giving further support to our method, application to a type 2 diabetes dataset identifies a list of protein candidates that is in very good agreement with previously performed transcriptomics meta analysis. Making use of quantitative properties of signal patterns identified, PPINGUIN can reveal new isoform candidates.

Conclusions

Regarding the increasing importance of quantitative proteomics we think that this method will be useful in practical applications like model fitting or functional enrichment analysis. We recommend to use this method if quantitation is a major objective of research.

Background

Quantitative proteomics is becoming increasingly important and over the last years many efforts have been made to develop and improve methods allowing for protein quantification. Besides gel based approaches [1, 2], mass spectral techniques encompassing labeling techniques such as iTRAQ [3], ICAT [4] and SILAC [5, 6] as well as label free approaches are widely-used for quantitative proteomics. Especially iTRAQ (isobaric tags for relative and absolute quantitation) gained much popularity as it allows for multiplexing quantitation of up to 8 samples. This new flexibility has been used recently in several studies investigating various objectives [7–11].

Complementing these experimental technologies, a wide range of quantification algorithms can be found in the literature. The most common algorithms are included in software packages such as MASCOT, ProQUANT, i-TRACKER [12, 13], Multi-Q [14] or virtual expert mass spectrometrist (VEMS) [15]. In 2008 Lacerda et al. [16] compared the two software packages MASCOT and Peaks (Bioinformatics Solutions Inc., Waterloo, ON, Canada) [17] using a six-protein mixture as well as a complex protein sample. They revealed significant differences in the two packages as for a complex protein mixture only 26% of the proteins agreed within 20% error of quantitation ratios. The highest fold changes measured with iTRAQ differ widely among laboratories but rarely seem to exceed ten-fold, which was reported by Casado-Vela et al. [18] in a technical survey examining more than 200 articles.

The continuing popularity of iTRAQ makes an evaluation of the technique in terms of accuracy and precision a valuable task [19]. Accuracy assesses the closeness to the real quantification value. Precision in this context refers to reproducibility of experiments. Since accuracy is difficult to evaluate, precision is the most frequently applied measure for experimental quality [20, 21]. Gan et al. [22] tried to assess the precision of iTRAQ data by analyzing technical (different channels of the same MS run), experimental (same channel but different runs) and biological variations (different biological samples). They designed different iTRAQ experiments covering the different types of replications and they found technical variation to be small (11%) whereas experimental and biological variations where more than twice as high. For iTRAQ - like for the majority of MS based quantitation approaches - quantitation measurements are performed at the peptide level. Since often multiple peptides potentially with different modifications are measured for the same protein, the need for some kind of summarizing strategy is obvious. Different ideas regarding the calculation of protein quantitation from multiple peptides have been applied including mean or median calculation [23, 24] and error weighted means [25]. Because of the fixed stoichiometric ratio, quantitation measurements for peptides uniquely assigned to the same protein should be strictly correlated [26]. But often this presumption is not fulfilled and the quantitation values exhibit a substantial heterogeneity. The heterogeneity is also observed for quantitation ratios and z-transformed values and is not due to different ionization or fragmentation efficiency. This is illustrated in Figure 1 presenting the quantitation ratios of unique peptides for an exemplary chosen protein: 40S ribosomal protein S30. Especially the 117/116 ratio (rightmost bar in Figure 1) varies from 1.4 fold down-regulation to 2 fold up-regulation. An obvious reason for heterogeneous quantitation values are non-unique peptides shared by different proteins.

To correct for heterogeneity of peptides for the same protein, many approaches make use of outlier detection methods like Grubb's test [23] or Dixon's test [25] prior to averaging. However, for several reasons outlier filters are problematic: First, outlier filtering can be applied only to proteins with a certain minimum number of peptides, a presumption often not fulfilled in iTRAQ datasets [27]. Second, if heterogeneity is due to differentially regulated protein isoforms, the less frequent isoform is possibly regarded as an outlier and removed leading to loss of information. Third, if outlier detection is applied after protein inference, false positive peptides are removed that contributed to the protein identification score and hence the score is distorted a posteriori.

Here we present a statistical analysis workflow for iTRAQ data employing clustering prior to protein inference with the aim to reduce peptide heterogeneity (see Figure 2).

Methods

Dataset

Liver tissue samples from two different inbred mouse strains were investigated. The New Zealand Obese (NZO) mouse strain exhibits a polygenic obesity associated with hyperinsulinaemia and hyperglycaemia and presents additional features of a metabolic syndrome, including hypertension, and elevated levels of serum cholesterol and serum triglycerides [28]. In contrast, the Swiss Jim Lambert (SJL) mouse strain is lean and resistant to diet-induced obesity [29]. SJL mice carry a naturally occurring loss-of-function mutation in the TBC1D1 gene that confers leanness and protects from diet-induced obesity and diabetes [30]. In fact, deletion of TBC1D1 leads to elevated lipid oxidation in skeletal muscle that prevents weight gain in response to a high fat diet. Conversely, NZO mice are highly susceptible to weight gain when fed a high-fat diet, resulting in the development of morbid obesity, with fat depots exceeding 40% of total body weight [31].

After weaning at week 3, male NZO and SJL mice were raised on two different diets, a low fat diet (SD; 16% calories from fat) and a high fat diet (HF; 35% calories from fat). At week 12, mean body weight of SJL mice was 18.99 g (+/- 1.41 g) on SD and 20.1 g (+/- 1.42 g) on HF. In contrast, mean values for NZO mice were 38.81 g (+/- 1.85 g) on SD and 56.52 g (+/- 3.45 g) on HF, respectively. The mice were then sacrificed and liver tissue samples were analyzed.

Animals were kept in accordance with the NIH guidelines for the care and use of laboratory animals and all experiments were approved by the Ethics Committee of the State Ministry of Agriculture, Nutrition and Forestry, State of Brandenburg, Germany (23-2347-8-19-2008). Three to six mice per cage (macrolon type III) were housed at a temperature of 22°C and a 12 h light-dark cycle (lights on at 6 a.m.). Throughout the study the animals had free access to food and water.

Experimental design and iTRAQ labeling strategy are shown in Table 1. Three experimental replications were performed for each combination of genotype and diet. Experimental replications comprises a total of twelve different mouse individuals and four different iTRAQ channels (see Table 1). Due to this experimental design, the measured variance for each combination of genotype and diet is a superposition of technical (different iTRAQ channels) and biological error (different mouse individuals). This design was chosen to match real-life experiments where these errors are important.

Table 1 Experimental Design

Full size table

The dataset (Mascot Generic Files - mgf) was uploaded to PRIDE [32] - Accession number: 20140.

MASCOT

Peptide identification and quantitation were performed using MASCOT search engine (version 2.2.04 Matrix Science, London). Peptides identified with a MASCOT score < 50 and a significance threshold of p > 0.05 were neglected. Searches were performed using the following parameter set: Enzyme: Trypsin; maximum missed cleavages: 2; fixed modifications: Methylthio (C), iTRAQ4plex (N-term), iTRAQ4plex (K); quantitation method: iTRAQ 4 plex with weighted protein ratio and median normalization of ratios; variable modifications: Oxidation (M), iTRAQ4plex (Y); peptide mass tolerance: 10 ppm; fragment mass tolerance: 0.8 Da; mass values: monoisotopic; instrument type: ESI-FTICR; Isotope error mode: 0; minimum of 1 peptide per protein identification.

The database used was a SwissProt derived FGCZ in-house mouse database from 2009 containing 43636 mouse protein sequences (OS = Mus musculus) and 259 additional FGCZ specific entries. All proteins are present in normal/forward sequences and decoy/reverse sequences. Randomized decoy database (reversed sequences) was used for controlling false discovery rate (FDR) [33, 34]. For calculation of FDR the list of proteins ordered by MASCOT ProtScore was cut when given FDR level was reached. Because we intend to achieve reliable quantitation results rather than provide a comprehensive protein list, the false discovery rate was chosen restrictively: FDR = 0.1%.

X!Tandem and OpenMS

Peptide identification was performed using X!Tandem software (http://www.thegpm.org/tandem) [35] version 2009.04.01.1. X!Tandem search was performed using the following parameter set: cleavage site: '[RK]|P'; precursor-charge: 2; missed-cleavages: 2; fragment-mass-tolerance: 0.8 Da; precursor-mass-tolerance: 10 ppm; fixed-modifications: iTRAQ4plex (N-term), iTRAQ4plex (K), Methylthio (C); variable-modifications: Oxidation (M), iTRAQ4plex (Y); refinement of unanticipated cleavages.

Extraction of 4-plex iTRAQ quantitation data and isotope correction was performed using OpenMS (http://open-ms.sourceforge.net) [36, 37] svn revision 6265. The same decoy database as for MASCOT analysis was used and again false discovery rate was chosen restrictively: FDR = 0.1%. For calculation of FDR the list of proteins ordered by X!Tandem protein identification score was cut when a given FDR level was reached.

Peptide Profiling Guided Identification of Proteins - PPINGUIN

We define an iTRAQ quantitation profile of a spectrum as the ordered list of the raw quantitation values, in our case the raw intensities of the four iTRAQ channels 114 to 117. PPINGUIN seizes on the presumption that profiles of peptides derived from the same protein are highly correlated as they have a common source. As a first step and thus without regarding protein inference, iTRAQ quantitation profiles of the spectra are calculated by extracting the four quantitation values using OpenMS. In this prove of concept study, we want to show that clustering based on quantitation profiles representing different experimental conditions can help to correctly quantify proteins. In order to avoid distortions by missing values, we restrict the analysis to spectra with complete quantitation profiles and remove spectra with incomplete profiles. The recommended isotope correction is performed according to manufacturer's specifications (Applied Biosystems, Foster City, CA) using OpenMS. Isotope correction aims at correcting for trace levels of isotopic impurities and is done by solving a system of equations. In addition a complementary normalization of the four quantitation values is performed as described below.

Logarithmic quantitation profiles of the spectra are clustered in a coarse-grained manner using k-means algorithm [38] based on Euclidean distance and randomly selected starting points. We use k-means clustering (k = 5) as it is computationally fast and well suited to demonstrate the benefit of the pre-selection. The group size parameter k = 5 was chosen according to two internal cluster validation measures (see Section 'Number of Clusters'). To analyze stability of the clustering, it was performed for 1000 replications each with different randomly chosen starting points. From 1000 iterations 999 resulted in the same or a very similar partitioning of the quantitation profiles of the spectra.

Clustering intends to create groups of peptides with similar biological profiles (e.g. up-regulation for a certain combination of genotype and diet). As subsequent analysis is focused on relative iTRAQ ratios instead of absolute quantitation values and Euclidean distance is not scale independent, the profiles are centered prior to clustering (mean is set to zero). Euclidean distance used as distance measure clustering is not scale independent. In order to preserve differences between relative iTRAQ ratios no additional scaling was performed (standard deviation is preserved). This procedure equals to a clustering using Euclidean distances on centered logarithmic quantitation profiles. With this procedure an explicit choice of a reference channel is not necessary. Every spectrum is assigned to exactly one group and for every group the corresponding spectra show similar quantitation profiles. Quantitation and identification is now performed independently for every group with identical settings to X!Tandem and OpenMS approach. Similar to the X!Tandem/OpenMS approach, FDR was calculated by cutting the list of proteins ordered by X!Tandem protein identification score if a given FDR level was reached. The FDR is calibrated for each group individually and in effect, X!Tandem threshold for protein identification differs in each group. Finally, log₂ ratio profiles are calculated using SJL genotype with standard diet (SD) as reference. Following the definition of iTRAQ quantitation profiles, ratio profiles are defined as the list of 3 possible iTRAQ ratios (e.g. for Exp 1: 114/116, 115/116 and 117/116 - see Figure 1).

All calculations (normalization and clustering) were performed using R statistical programming language (R version 2.7.0 - 2008-04-22). The R-script of our implementation of PPINGUIN is provided as Additional File 1. Protein inference and extraction of quantitation values was performed using X!Tandem and OpenMS as described previously.

Normalizing iTRAQ quantitations

Additional normalization of the 4 quantitation values is required to correct for technical bias [19]. Karp et al. [27] observed a heterogeneity of variance for iTRAQ ratios where the width of the distribution is significantly larger at low intensities. They proposed a variance stabilizing normalization based on VSN software [39]. We compared three different normalization strategies: VSN, multi lowess algorithm - a multi dimensional extension of lowess normalization strategy [40] and median correction. In our dataset we see heterogeneity of variance for unnormalized data as well as for median corrected data. The other two normalization approaches lead to an almost constant variance (see Additional File 2 for more details). We selected multi-lowess as our preferred normalization strategy.

Number of Clusters

The number of clusters is an important parameter for clustering. The preferable number of clusters was determined using two different internal measures: gap statistic [41, 42] and Xie-Beni index [43]. Both measures were calculated for 25 repetitions of runs. The preferable number of clusters was determined to be in the range between 3 and 7. Therefore, we selected 5 as a reasonable number of clusters.

Calculation of CV values for Peptide Homogeneity

Let y_j,rbe the relative quantitation ratio for a peptide j and ratio r ∈ R = {NZO_SD/SJL_SD, NZO_HFD/SJL_SD and SJL_HFD/SJL_SD }. To assess peptide homogeneity, we calculate the coefficient of variation of a protein p by using all unique peptides for proteins:

C V_{p} = \frac{1}{3 * n_{p}} \sum_{j \in p} \sum_{r \in R} \frac{σ_{j, r}}{μ_{j, r}}

where n_pis the number of unique peptides for protein p and σ_j,rand μ_j,rare the standard deviation and mean of relative quantitation ratios y_i,rof all peptides uniquely assigned to protein p. The final coefficient of variation is calculated by averaging CV_pfor all proteins.

Calculation of CV values for Experimental Reproducibility

Let y_e,i,rbe the relative quantitation ratio for experiment e ∈ {Exp1, Exp2, Exp3}, protein i ∈ I = 1..n and ratio r ∈ R = { NZO_SD/SJL_SD, NZO_HFD/SJL_SD and SJL_HFD/SJL_SD }. In order to assess experimental reproducibility of r we calculate the average CV of all proteins occurring in all three experiments:

C V_{r} = \frac{1}{n} \cdot \sum_{i \in I} (\frac{σ_{i, r}}{μ_{i, r}})

where σ_i,rand μ_i,rare the standard deviation and mean of relative quantitation ratios y_i,rfor protein i and ratio r for all three experiments:

\begin{matrix} μ_{i, r} = \frac{1}{3} \sum_{e \in E} y_{e, i, r} \\ σ_{i, r} = \sqrt{\frac{1}{2} {\sum_{e \in E} (y_{e, i, r} - μ_{i, r})}^{2}} \end{matrix}

This value is reported in Table 2 together with mean standard deviation of log₂ ratios:

Table 2 Experimental Reproducibility

Full size table

S t D e v_{r} = \frac{1}{n} \cdot \sum_{i \in I} ({\hat{σ}}_{i, r})

where ${\hat{σ}}_{i, r}$ is the standard deviation of log₂ ratios:

\begin{array}{l} {\hat{μ}}_{i, r} = \frac{1}{3} \sum_{e \in E} l o g_{2} (y_{e, i, r}) \\ {\hat{σ}}_{i, r} = \sqrt{\frac{1}{2} \sum_{e \in E} {(l o g_{2} (y_{e, i, r}) - {\hat{μ}}_{i, r})}^{2}} \end{array}

Results

We present a novel workflow termed Peptide Profiling Guided Identification of Proteins (PPINGUIN). PPINGUIN proceeds by first clustering spectra based on their quantitation values and than inferencing proteins for each cluster independently (see Methods). The results of our approach are compared with standard evaluation approaches using MASCOT and X!Tandem/OpenMS software (see Methods).

Proteins identified

The numbers of protein accessions identified with the same FDR (see Methods) differ for each method: 225 for MASCOT, 177 for X!Tandem and OpenMS based approach and 176 for PPINGUIN. Ambiguous protein groups (e.g. H2B1B, H2B1C, H2B1F,...) identified with exclusively non-unique peptides, were not counted here. Therefore, the actual number of proteins and the overlaps of the three methods may be underestimated. Most of the representative accessions received from PPINGUIN analysis were also detected using X!Tandem (83%). Both methods have their set of unique proteins: 32 for PPINGUIN and 33 for X!Tandem. The overlap between MASCOT and the other two approaches is good: 70% of the X!Tandem IDs and 62% of PPINGUIN IDs were found with MASCOT (see Venn diagram in Figure 3). Explanations for these differences are provided in the discussion below.

The quality of the three methods is assessed in terms of three different criteria: (i) homogeneity of peptide profiles (ii) reproducibility in independent experiments and (iii) accordance with prior knowledge.

Homogeneity of peptide profiles

As described above, a protein represented by multiple unique peptides should result in strictly correlated quantitation ratios for the peptides. But often heterogeneous ratio profiles are observed using MASCOT as well as X!Tandem, naturally leading to difficulties in quantitative interpretation. This situation is illustrated in the first and second row of Figure 4 for three example proteins. An obvious reason for heterogeneous quantitation values are non-unique peptides shared by different proteins. For avoiding this fact non-unique peptides are left out for all plots and statistical assessments. Using our approach, peptide profiles are more homogeneous supporting a consistent quantitative interpretation (see bottom row of Figure 4). A distinctive feature of PPINGUIN is demonstrated by the ribosomal protein RS_30: inconsistent quantitation profiles are resolved by splitting up in two groups each with homogeneous profiles. This effect is illustrated in more detail in Figure 5 (and as addition examples in Additional File 3). The protein is identified in two different clusters (1 and 4) with different peptide profiles. The peptides in cluster 1 show low relative concentration for NZO_SD (114) and high relative concentration for NZO_HF (117) while peptides in cluster 4 show the opposite behavior. The peptides belonging to each cluster are located in different sites of the protein. As discussed below, this finding is a hint towards two variants of the RS_30 protein.

For quantitative assessment of overall peptide profile homogeneity we have performed a comprehensive statistical evaluation. We calculated CV values for each of the three approaches (see method section). We observed a CV of peptides assigned to a protein of 20% for MASCOT and 26% for X!Tandem while PPINGUIN has a smaller CV of 14%.

Reduced variance in PPINGUIN is an expected effect since the peptides within each cluster are rather similar by construction. For an independent assessment of our method we now proceed to investigate experimental reproducibility (precision) and accordance with prior knowledge.

Precision - Experimental Reproducibility

In order to test reproducibility we performed three independent experimental replications and three independent evaluations (see Dataset section in Methods). We investigated two different mouse genotypes and two diets resulting in 4 distinct combinations. The 4 combinations define 3 ratios using SJL genotype with standard diet (SD) as reference. Quantitation ratios for a protein are calculated by averaging the log ratios of the corresponding peptides. To facilitate comparability we restrict the analysis to the set of 99 proteins identified in all three experimental replications.

We calculated mean coefficient of variation and mean standard deviation of log quantitation ratios (CV and StDev - see Methods section) of all proteins (see Table 2).

The analysis was performed separately for each of the 3 experimental ratios: NZO_SD/SJL_SD, NZO_HFD/SJL_SD and SJL_HFD/SJL_SD.

Experimental variation of the MASCOT based evaluation is characterized by CV values ranging from 0.13 to 0.18 (see first column in Table 2). X!Tandem/OpenMS results in CV values ranging from 0.12 to 0.17 (second column in Table 2). Experimental variation is reduced using PPINGUIN with CV values ranging from 0.10 to 0.15 (third column in Table 2).

Different from the improved homogeneity in the previous section, the lower error of PPINGUIN is not a trivial effect since the complete analysis workflow is performed for each experiment independently. These results demonstrate that applying the proposed method for data evaluation leads to more stable quantitation values.

Accordance with prior knowledge

The identification of differentially expressed proteins is a major goal of quantitative proteomics. We now compare the set of differentially expressed biomarker candidates obtained with the three different methods. To assess the results of the differential analysis, we use a set of 'gold standard' genes identified in the context of type-2 diabetes [44]. This meta-analysis reports top gene candidates for mixture of genotypic and dietary effects. To achieve comparability with the meta-analysis, differential analysis is performed comparing NZO mice with high fat diet and SJL mouse with standard diet (see Dataset section).

Top lists of differentially expressed proteins are created by selecting proteins with mean absolute log₂ fold changes above arbitrarily chosen threshold of 0.5 (1.4 fold change). Due to the low number of replicates we use the fold instead of the p-value as criterion to judge differential expression.

Evaluation based on MASCOT identifies a total of 14 differentially regulated proteins of which 29% (4) are found in the reference. Using X!Tandem and OpenMS we identified only 8 differential proteins of which 37% (3) are found if the reference set. PPINGUIN results in 14 differentially expressed proteins, of which 50% (7) are part of the reference set. Table 3 presents the statistics of the differentially regulated proteins identified using PPINGUIN (proteins of the reference set are marked with asterisks). Of the three methods, PPINGUIN shows the highest agreement with the reference list. This remains true for alterations of the threshold value (e.g. 0.3 or 0.7).

Table 3 Accordance with prior knowledge

Full size table

Discussion

Typically, data mining techniques are applied after protein inference and quantitation. In contrast to the standard workflow, our approach employs clustering prior to protein inference as a very early step in data processing (see workflow comparison in Figure 2). Recently different approaches have been proposed to improve protein identification using peak intensities [26, 45]. In contrast to these works, our major goal is to improve quantitation itself based on a set of proven and tested identification tools.

A key feature of our approach is shown in Figure 5: the separation of unique peptides for a protein in multiple clusters. Non-unique peptides shared by different proteins are not considered. The peptides in each cluster exhibit distinct quantitation profiles which are most likely corresponding to protein isoforms. Typical reasons for isoforms are protein modification, splice variants or degradation effects.

For further investigation of protein modifications, we first identified most frequent modifications in our dataset. For each of the 800 modifications listed in Unimod [46] we re-performed protein inference searching for single variable modifications. The most frequent modification found was oxidation of methionine which increased the number of found peptide-spectrum-matches by almost 10%. Oxidation of methionine, whose impact on iTRAQ has been reported previously [47], can be caused by an enzymatic reaction but can also be due to sample preparation in the presence of reactive oxygen species. Other frequent modifications were 'Oxidation (D)', 'Oxidation (N)', 'Deamidated (Q)'. Subsequently we re-performed the analysis allowing for these 4 variable modifications simultaneously. However, in this second identification step we did not find further evidence for protein modification regarding the RS_30 protein isoforms.

Investigating splice variants as a possible explanation, we found that RS_30 protein is transcribed from exon 4 and 5 of the FAU (Ensembl-ID: ENSMUSG00000038274) gene. The peptides from different clusters are located in different regions of the protein which also correspond to the different exons of the FAU gene, but there was no indication for differential splicing in the database. However, the FAU gene may have two variants: the RS_30 protein with 59 amino acids and the completely transcribed protein with 133 amino acids. PPINGUIN finds two variants of the RS 30 gene. The two isoforms found by PPINGUIN may correspond to the two potential variants, which of cause would require further experimental investigations. But if PPINGUIN can detect potential novel splice variants it may help to improve protein or even nucleotide databases.

PPINGUIN is not designed to exploit known protein variants, but it may indirectly re-detect also known variants. Incorporating the knowledge of known protein variants during protein inference, should further improve protein quantitation.

The set of identified proteins is altered comparing PPINGUIN and X!Tandem. Assignment of peptides to different groups and subsequent protein identification for each group individually, is expected to lower identification significance and thus to reduce the number of proteins. Indeed, a random grouping identifies only 138 (± 10) proteins in all experiments. However, biologically motivated clustering used by PPINGUIN, leads to a total of 176 proteins, 32 of which are found only by PPINGUIN and not by X!Tandem. This is due to two combined effects: First, exploiting quantitation profile information, our clustering leads to a relative enrichment of peptides belonging to the same protein in a cluster and second, by splitting spectra into groups, clustering decreases the total number of spectra in each identification process. The reduced number of spectra per cluster alters the identification threshold used for calibration of the false discovery rate and in effect new proteins are identified. The largest set of uniquely identified proteins was found for MASCOT. Most of these 93 unique MASCOT proteins are also found using X!Tandem but they remain below the significance threshold. This is mostly due to differences in the assessment of short peptides since MASCOT appears to include many small peptides for identification that are excluded by X!Tandem. The set of quantified protein accessions received by PPINGUIN is characterized by an increased experimental reproducibility compared to the other methods. This implies that using PPINGUIN for evaluation, one experimental outcome is a more reliable predictor for the outcome of a similar experiment. Finally, the comparison with prior knowledge showed a surprisingly high agreement of our top proteins with a reference set, which we deem representative for diabetes and obesity. This hints for the practical benefit of our method.

Conclusion

We proposed a novel method for evaluation of iTRAQ data motivated by the observation that relative concentrations of peptides derived from the same protein often show unexpectedly heterogeneous correlation patterns. Exploiting correlations of quantitation ratios achieves more consistent quantitation ratios than the standard approaches. This is demonstrated by an increased reproducibility of independent experiments. Besides leading to a more reliable quantitation, the method can reveal new isoform candidates.

We see our work as a promising step towards quantitation guided identification. In general, we recommend to use our method in case accurate quantitation is a major objective of research. Regarding the increasing importance of quantitative proteomics we think that this method will be useful in practical applications like model fitting or functional enrichment analysis.

We expect that our approach will be still more valuable with an increasing number of parallel quantified samples (e.g. 8-plex iTRAQ) since the importance of the clustering increases. The proposed approach can also be very useful for other quantitative proteomics technologies like e.g. SILAC. A next step will be to extend the algorithms to include spectra with incomplete iTRAQ quantitations. Future versions of PPINGUIN will aim at further refinement of protein quantitation by incorporating the rapidly growing public knowledge on splice variants and protein isoforms.

References

Lee KH: Proteomics: a technology-driven and technology-limited discovery science. Trends Biotechnol 2001, 19: 217–222. 10.1016/S0167-7799(01)01639-0
Article CAS PubMed Google Scholar
Fievet J, Dillmann C, Lagniel G, Davanture M, Negroni L, Labarre J, de Vienne D: Assessing factors for reliable quantitative proteomics based on two-dimensional gel electrophoresis. Proteomics 2004, 4: 1939–1949. 10.1002/pmic.200300731
Article CAS PubMed Google Scholar
Ross PL, Huang YN, Marchese JN, Williamson B, Parker K, Hattan S, Khainovski N, Pillai S, Dey S, Daniels S, Purkayastha S, Juhasz P, Martin S, Bartlet-Jones M, He F, Jacobson A, Pappin DJ: Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol Cell Proteomics 2004, 3: 1154–1169. 10.1074/mcp.M400129-MCP200
Article CAS PubMed Google Scholar
Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R: Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol 1999, 17: 994–999. 10.1038/13690
Article CAS PubMed Google Scholar
Ong SE, Blagoev B, Kratchmarova I, Kristensen DB, Steen H, Pandey A, Mann M: Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol Cell Proteomics 2002, 1: 376–386. 10.1074/mcp.M200025-MCP200
Article CAS PubMed Google Scholar
Mann M: Functional and quantitative proteomics using SILAC. Nat Rev Mol Cell Biol 2006, 7: 952–958. 10.1038/nrm2067
Article CAS PubMed Google Scholar
Zhou L, Beuerman RW, Chan CM, Zhao SZ, Li XR, Yang H, Tong L, Liu S, Stern ME, Tan D: Identification of tear fluid biomarkers in dry eye syndrome using iTRAQ quantitative proteomics. J Proteome Res 2009, 8: 4889–4905. 10.1021/pr900686s
Article CAS PubMed Google Scholar
Serada S, Fujimoto M, Ogata A, Terabe F, Hirano T, Iijima H, Shinzaki S, Nishikawa T, Ohkawara T, Iwahori K, Ohguro N, Kishimoto T, Naka T: iTRAQ-based proteomic identification of leucine-rich alpha-2 glycoprotein as a novel inflammatory biomarker in autoimmune diseases. Ann Rheum Dis 2010, 69: 770–774. 10.1136/ard.2009.118919
Article CAS PubMed Google Scholar
Hsieh HC, Chen YT, Li JM, Chou TY, Chang MF, Huang SC, Tseng TL, Liu CC, Chen SF: Protein profilings in mouse liver regeneration after partial hepatectomy using iTRAQ technology. J Proteome Res 2009, 8: 1004–1013. 10.1021/pr800696m
Article CAS PubMed Google Scholar
Jin J, Park J, Kim K, Kang Y, Park SG, Kim JH, Park KS, Jun H, Kim Y: Detection of differential proteomes of human beta-cells during islet-like differentiation using iTRAQ labeling. J Proteome Res 2009, 8: 1393–1403. 10.1021/pr800765t
Article CAS PubMed Google Scholar
Kolla V, Jeno P, Moes S, Tercanli S, Lapaire O, Choolani M, Hahn S: Quantitative proteomics analysis of maternal plasma in Down syndrome pregnancies using isobaric tagging reagent (iTRAQ). J Biomed Biotechnol 2010, 2010: 952047.
Article PubMed Central PubMed Google Scholar
Shadforth IP, Dunkley TP, Lilley KS, Bessant C: i-Tracker: for quantitative proteomics using iTRAQ. BMC Genomics 2005, 6: 145. 10.1186/1471-2164-6-145
Article PubMed Central PubMed Google Scholar
Laderas T, Bystrom C, McMillen D, Fan G, McWeeney S: TandTRAQ: an open-source tool for integrated protein identification and quantitation. Bioinformatics 2007, 23: 3394–3396. 10.1093/bioinformatics/btm467
Article CAS PubMed Google Scholar
Yu CY, Tsui YH, Yian YH, Sung TY, Hsu WL: The Multi-Q web server for multiplexed protein quantitation. Nucleic Acids Res 2007, 35: W707–712. 10.1093/nar/gkm345
Article PubMed Central PubMed Google Scholar
Rodriguez-Suarez E, Gubb E, Alzueta IF, Falcon-Perez JM, Amorim A, Elortza F, Matthiesen R: Virtual expert mass spectrometrist: iTRAQ tool for database-dependent search, quantitation and result storage. Proteomics 2010, 10: 1545–1556. 10.1002/pmic.200900255
Article CAS PubMed Google Scholar
Lacerda CM, Xin L, Rogers I, Reardon KF: Analysis of iTRAQ data using Mascot and Peaks quantification algorithms. Brief Funct Genomic Proteomic 2008, 7: 119–126. 10.1093/bfgp/eln017
Article CAS PubMed Google Scholar
Ma B, Zhang K, Hendrie C, Liang C, Li M, Doherty-Kirby A, Lajoie G: PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun Mass Spectrom 2003, 17: 2337–2342. 10.1002/rcm.1196
Article CAS PubMed Google Scholar
Casado-Vela J, Martinez-Esteso MJ, Rodriguez E, Borras E, Elortza F, Bru-Martinez R: iTRAQ-based quantitative analysis of protein mixtures with large fold change and dynamic range. Proteomics 2010, 10: 343–347. 10.1002/pmic.200900509
Article CAS PubMed Google Scholar
Ow SY, Salim M, Noirel J, Evans C, Rehman I, Wright PC: iTRAQ underestimation in simple and complex mixtures: "the good, the bad and the ugly". J Proteome Res 2009, 8: 5347–5355. 10.1021/pr900634c
Article CAS PubMed Google Scholar
Callesen AK, Vach W, J?rgensen PE, Cold S, Mogensen O, Kruse TA, Jensen ON, Madsen JS: Reproducibility of mass spectrometry based protein profiles for diagnosis of breast cancer across clinical studies: a systematic review. J Proteome Res 2008, 7: 1395–1402. 10.1021/pr800115f
Article CAS PubMed Google Scholar
Mercier C, Truntzer C, Pecqueur D, Gimeno JP, Belz G, Roy P: Mixed-model of ANOVA for measurement reproducibility in proteomics. J Proteomics 2009, 72: 974–981. 10.1016/j.jprot.2009.05.005
Article CAS PubMed Google Scholar
Gan CS, Chong PK, Pham TK, Wright PC: Technical, experimental, and biological variations in isobaric tags for relative and absolute quantitation (iTRAQ). J Proteome Res 2007, 6: 821–827. 10.1021/pr060474i
Article CAS PubMed Google Scholar
Choe LH, Aggarwal K, Franck Z, Lee KH: A comparison of the consistency of proteome quantitation using two-dimensional electrophoresis and shotgun isobaric tagging in Escherichia coli cells. Electrophoresis 2005, 26: 2437–2449. 10.1002/elps.200410336
Article CAS PubMed Google Scholar
Boehm AM, Putz S, Altenhofer D, Sickmann A, Falk M: Precise protein quantification based on peptide quantification using iTRAQ. BMC Bioinformatics 2007, 8: 214. 10.1186/1471-2105-8-214
Article PubMed Central PubMed Google Scholar
Li XJ, Zhang H, Ranish JA, Aebersold R: Automated statistical analysis of protein abundance ratios from data generated by stable-isotope dilution and tandem mass spectrometry. Anal Chem 2003, 75: 6648–6657. 10.1021/ac034633i
Article CAS PubMed Google Scholar
He Z, Zhao H, Yu W: Score regularization for peptide identification. BMC Bioinformatics 2011, 12(Suppl 1):S2. 10.1186/1471-2105-12-S1-S2
Article Google Scholar
Karp NA, Huber W, Sadowski PG, Charles PD, Hester SV, Lilley KS: Addressing accuracy and precision issues in iTRAQ quantitation. Mol Cell Proteomics 2010.
Google Scholar
Ortlepp JR, Kluge R, Giesen K, Plum L, Radke P, Hanrath P, Joost HG: A metabolic syndrome of hypertension, hyperinsulinaemia and hypercholesterolaemia in the New Zealand obese mouse. Eur J Clin Invest 2000, 30: 195–202. 10.1046/j.1365-2362.2000.00611.x
Article CAS PubMed Google Scholar
West DB, Boozer CN, Moody DL, Atkinson RL: Dietary obesity in nine inbred mouse strains. Am J Physiol 1992, 262: R1025–1032.
CAS PubMed Google Scholar
Chadt A, Leicht K, Deshmukh A, Jiang LQ, Scherneck S, Bernhardt U, Dreja T, Vogel H, Schmolz K, Kluge R, Zierath JR, Hultschig C, Hoeben RC, Schurmann A, Joost HG, Al-Hasani H: Tbc1d1 mutation in lean mouse strain confers leanness and protects from diet-induced obesity. Nat Genet 2008, 40: 1354–1359. 10.1038/ng.244
Article CAS PubMed Google Scholar
Jurgens HS, Schurmann A, Kluge R, Ortmann S, Klaus S, Joost HG, Tschop MH: Hyperphagia, lower body temperature, and reduced running wheel activity precede development of morbid obesity in New Zealand obese mice. Physiol Genomics 2006, 25: 234–241. 10.1152/physiolgenomics.00252.2005
Article PubMed Google Scholar
Vizcaino JA, Cote R, Reisinger F, Barsnes H, Foster JM, Rameseder J, Hermjakob H, Martens L: The Proteomics Identifications database: 2010 update. Nucleic Acids Res 2010, 38: D736–742. 10.1093/nar/gkp964
Article PubMed Central CAS PubMed Google Scholar
Elias JE, Gygi SP: Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods 2007, 4: 207–214. 10.1038/nmeth1019
Article CAS PubMed Google Scholar
Kall L, Storey JD, MacCoss MJ, Noble WS: Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J Proteome Res 2008, 7: 29–34. 10.1021/pr700600n
Article PubMed Google Scholar
Craig R, Beavis RC: TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20: 1466–1467. 10.1093/bioinformatics/bth092
Article CAS PubMed Google Scholar
Reinert K, Kohlbacher O: OpenMS and TOPP: open source software for LC-MS data analysis. Methods Mol Biol 2010, 604: 201–211. 10.1007/978-1-60761-444-9_14
Article CAS PubMed Google Scholar
Sturm M, Bertsch A, Gropl C, Hildebrandt A, Hussong R, Lange E, Pfeifer N, Schulz-Trieglaff O, Zerck A, Reinert K, Kohlbacher O: OpenMS - an open-source software framework for mass spectrometry. BMC Bioinformatics 2008, 9: 163. 10.1186/1471-2105-9-163
Article PubMed Central PubMed Google Scholar
Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. 2nd edition. New York: Springer; 2009.
Book Google Scholar
Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 2002, 18(Suppl 1):96–104. 10.1093/bioinformatics/18.suppl_1.S96
Article Google Scholar
Quackenbush J: Microarray data normalization and transformation. Nat Genet 2002, 32(Suppl):496–501.
Article CAS PubMed Google Scholar
Tibshirani R, Walther G, Hastie T: Estimating the Number of Clusters in a Dataset via the Gap Statistic. 2000.
Google Scholar
Arima C, Hakamada K, Okamoto M, Hanai T: Modified fuzzy gap statistic for estimating preferable number of clusters in fuzzy k-means clustering. J Biosci Bioeng 2008, 105: 273–281. 10.1263/jbb.105.273
Article CAS PubMed Google Scholar
Xie XL, Beni G: A Validity Measure for Fuzzy Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 1991, 13(8):841–847. 10.1109/34.85677
Article Google Scholar
Rasche A, Al-Hasani H, Herwig R: Meta-analysis approach identifies candidate genes and associated molecular networks for type-2 diabetes mellitus. BMC Genomics 2008, 9: 310. 10.1186/1471-2164-9-310
Article PubMed Central PubMed Google Scholar
Li W, Ji L, Goya J, Tan G, Wysocki VH: SQID: an intensity-incorporated protein identification algorithm for tandem mass spectrometry. J Proteome Res 2011, 10: 1593–1602. 10.1021/pr100959y
Article PubMed Central CAS PubMed Google Scholar
Creasy DM, Cottrell JS: Unimod: Protein modifications for mass spectrometry. Proteomics 2004, 4: 1534–1536. 10.1002/pmic.200300744
Article CAS PubMed Google Scholar
Tenga MJ, Lazar IM: Impact of peptide modifications on the isobaric tags for relative and absolute quantitation method accuracy. Anal Chem 2011, 83: 701–707. 10.1021/ac100775s
Article PubMed Central CAS PubMed Google Scholar
Martens L, Chambers M, Sturm M, Kessner D, Levander F, Shofstahl J, Tang WH, Rompp A, Neumann S, Pizarro AD, Montecchi-Palazzi L, Tasman N, Coleman M, Reisinger F, Souda P, Hermjakob H, Binz PA, Deutsch EW: mzML-a community standard for mass spectrometry data. Mol Cell Proteomics 2011, 10: R110.000133. 10.1074/mcp.R110.000133
Article PubMed Central PubMed Google Scholar

Download references

Acknowledgements

This work is part of the Sys-Prot project funded by the European Commission, sixth framework program for research and technical development supported by grants from EU (37457).

Author information

Authors and Affiliations

MicroDiscovery GmbH, Marienburger Str. 1, 10405, Berlin, Germany
Chris Bauer, Frank Kleinjung & Johannes Schuchhardt
Functional Genomics Center, UNI ETH Zurich, Winterthurerstrasse 190, CH-8057, Zurich, Switzerland
Dorothea Rutishauser, Christian Panse & Ralph Schlapbach
German Institute of Human Nutrition, Department of Pharmacology, Arthur-Scheunert-Allee 114-116, 14558, Nuthetal, Germany
Alexandra Chadt, Tanja Dreja & Hadi Al-Hasani
German Diabetes-Center at the Heinrich-Heine-University, Düsseldorf, Germany
Hadi Al-Hasani
Department Computer Science and Mathematics, Free University of Berlin, Berlin, Germany
Knut Reinert

Authors

Chris Bauer
View author publications
You can also search for this author in PubMed Google Scholar
Frank Kleinjung
View author publications
You can also search for this author in PubMed Google Scholar
Dorothea Rutishauser
View author publications
You can also search for this author in PubMed Google Scholar
Christian Panse
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra Chadt
View author publications
You can also search for this author in PubMed Google Scholar
Tanja Dreja
View author publications
You can also search for this author in PubMed Google Scholar
Hadi Al-Hasani
View author publications
You can also search for this author in PubMed Google Scholar
Knut Reinert
View author publications
You can also search for this author in PubMed Google Scholar
Ralph Schlapbach
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Schuchhardt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chris Bauer.

Additional information

Authors' contributions

CB developed and implemented the described methods and drafted the manuscript. TD, AC and HA were responsible for the generation of the mouse samples. DR and RS acquired the iTRAQ profile data. All authors read and approved the final manuscript.

Electronic supplementary material

12859_2011_5188_MOESM1_ESM.R

Additional file 1:R-script of PPINGUIN. PPINGUIN.R: R-script with our implementation of PPINGUIN. The script requires OpenMS and X!Tandem to be installed. (R 13 KB)

12859_2011_5188_MOESM2_ESM.PDF

Additional file 2:Normalization - Results. More detailed description of the normalization strategy applied in this work. The effects of the normalization algorithms on channel bias and homoscedasticity are demonstrated. (PDF 9 MB)

12859_2011_5188_MOESM3_ESM.ZIP

Additional file 3:Examples for Potential Isoforms. Additional zip archive containing pdf images for 6 further examples with potential protein isoforms. (ZIP 82 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Bauer, C., Kleinjung, F., Rutishauser, D. et al. PPINGUIN: Peptide Profiling Guided Identification of Proteins improves quantitation of iTRAQ ratios. BMC Bioinformatics 13, 34 (2012). https://doi.org/10.1186/1471-2105-13-34

Download citation

Received: 02 August 2011
Accepted: 16 February 2012
Published: 16 February 2012
DOI: https://doi.org/10.1186/1471-2105-13-34

PPINGUIN: Peptide Profiling Guided Identification of Proteins improves quantitation of iTRAQ ratios

Abstract

Background

Results

Conclusions

Background

Methods

Dataset

MASCOT

X!Tandem and OpenMS

Peptide Profiling Guided Identification of Proteins - PPINGUIN

Normalizing iTRAQ quantitations

Number of Clusters

Calculation of CV values for Peptide Homogeneity

Calculation of CV values for Experimental Reproducibility

Results

Proteins identified

Homogeneity of peptide profiles

Precision - Experimental Reproducibility

Accordance with prior knowledge

Discussion

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us