Abstract
Background
Mass spectrometry based quantification of peptides can be performed using the iTRAQ™ reagent in conjunction with mass spectrometry. This technology yields information about the relative abundance of single peptides. A method for the calculation of reliable quantification information is required in order to obtain biologically relevant data at the protein expression level.
Results
A method comprising sound error estimation and statistical methods is presented that allows precise abundance analysis plus error calculation at the peptide as well as at the protein level. This yields the relevant information that is required for quantitative proteomics. Comparing the performance of our method named Quant with existing approaches the error estimation is reliable and offers information for precise bioinformatic models. Quant is shown to generate results that are consistent with those produced by ProQuant™, thus validating both systems. Moreover, the results are consistent with that of Mascot™ 2.2. The MATLAB^{® }scripts of Quant are freely available via http://www.proteinms.de webcite and http://sourceforge.net/projects/protms/ webcite, each under the GNU Lesser General Public License.
Conclusion
The software Quant demonstrates improvements in protein quantification using iTRAQ™. Precise quantification data can be obtained at the protein level when using error propagation and adequate visualization. Quant integrates both and additionally provides the possibility to obtain more reliable results by calculation of wise quality measures. Peak area integration has been replaced by sum of intensities, yielding more reliable quantification results. Additionally, Quant allows the combination of quantitative information obtained by iTRAQ™ with peptide and protein identifications from popular tandem MS identification tools. Hence Quant is a useful tool for the proteomics community and may help improving analysis of proteomic experimental data. In addition, we have shown that a lognormal distribution fits the data of mass spectrometry based relative peptide quantification.
Background
Mass spectrometry is a common technique employed for protein identification in proteomics. In tandem mass spectrometry, proteins are identified by matching the measured fragment ion spectra of peptides with theoretical spectra calculated from known DNA or protein sequences [1], for example the NCBI sequence database [2] or SwissProt [3].
Instead of studying a single protein in detail as done in former days of protein sciences, the analysis of all proteins of a cell – the proteome – became important [4]. The proteome comprises all the proteins present in an organism, tissue or cell at a particular time. In contrast to the genome, the proteome is not static but highly dynamic.
To understand the biological and biochemical processes in a cell or an organism, for example responses to different environmental influences or the difference between healthy and diseased tissue, analysis of all differences at genomic or proteomic level needs to be performed. The protein abundance changes over time are needed for understanding cellular processes [5].
Differences in protein expression are not accessible at genomic level but often are accessible at the proteome level [6]. Some proteins are up or downregulated in the different stages of a cell. Therefore, quantitative information of the expressed proteins is needed and constitutes a keystep to fully understand functions of organelles, cells, organisms as well as processes of diseases. Furthermore, the quantitative information of the protein expression can be used for bioinformatic modelling of cellular processes such as pathways, cell maturing and metabolisms [7].
The advantages of mass spectrometrybased peptide quantification are precision, sensitivity, throughput and convenient automation [8,9]. During the last decade, several techniques have been established [10], e.g. the isobaric tag for relative and absolute quantitation (iTRAQ™) that is currently the only technique capable of multiplexing up to four different samples for relative quantification. Four chemically identical iTRAQ™ reagents are available, named 114, 115, 116, 117, which have the same overall mass. Each label is composed of a peptide reactive group (NHS ester) and an isobaric tag of 145 Da that consists of a balancer group (carbonyl) and a reporter group (based on Nmethylpiperazine) [11], as shown in figure 1. Between the balancer and the reporter group is a fragmentation site. The peptide reactive group attaches specifically to free primary amino groups – Ntermini and εamino groups of lysine residues. Side reactions on tyrosine have been also reported [11]. No labelling occurs if the primary amino groups are modified, for example Nterminal glutamine or glutamic acid could form a ring (pyroglutamic acid) or an acetylation may occur. Therefore by using iTRAQ™, peptides within the sample are labelled that possess at least one free primary amino group.
Figure 1. Chemical structure of the iTRAQ™ reagent. The label is composed of a peptide reactive group (red, NHS ester) and an isobaric tag of 145 Da, which consists of a balancer group (blue, carbonyl group) and a reporter group (green, Nmethylpiperazine). The four available tags of identical overall mass vary in their stable isotope compositions such that the reporter group has a mass of 114–117 Da and the balancer of 28–31 Da. The fragmentation site between the balancer and the reporter group is responsible for the generation of the reporter ions in the region of 114–117 m/z.
In fragment ion spectra of iTRAQ™ labelled peptides, additional peaks appear in the m/z range of 114 to 117, originating from the singly charged reporter group fragment of each iTRAQ™ label. Peptide quantification can be performed by interpretation of these peaks. In order to allow for judging the results calculated from the reporter peaks, a reliable quality measure is needed [12] not only at the peptide level.
The development of precise and transparent methods for analysis of proteomic data is one of the crucial challenges in protein sciences [8]. A software for data evaluation support is needed for quantification, because Proteomics yields huge amounts of data [13]. These computer programs must be capable of providing results at the protein level. Some software already is available for analyzing iTRAQ™ data, such as iTracker [14], MassTRAQ [15], ProQuant™ (Applied Biosystems (ABI), Darmstadt, Germany), ProteinPilot™ (ABI) or Mascot™ 2.2 (Matrix Science, London, UK). Some of these are not freely available, such as ProQuant™, ProteinPilot™ and Mascot™. MassTRAQ and iTracker only provide data at the peptide level. These tools have in common that they are not capable of calculating reliable quantification information at the protein level or do not provide precise error estimation or a reliable quality measure. Some of them assume a mismatching and inappropriate distribution for their peptide and signal statistics. We thus decided to develop our tool named Quant for quantification at peptide level as well as at protein level. We focus on the protein level, as only this allows meaningful interpretations of the experimental data including a reliable transfer into bioinformatic modelling. Moreover, this software is freely available.
Methods
Experiments
The functionality of Quant has been proven by application to a standard protein mix provided by Applied Biosystems within the iTRAQ™ kit.
Sample preparation
A sixprotein mix delivered with the iTRAQ™ kit was used for the analysis. The protein mix consisted of bovine serum albumin (Accession Number P02769), βgalactosidase (P00722), αlactalbumin (P00711), βlactoglobulin (P02754), lysozyme (P00698), apotransferrin (P02787).
The proteins were dissolved according to the iTRAQ™ reagent protocol [16] in 100 mM triethylammonium bicarbonate buffer at pH 8.5. The cysteine residues were blocked and alkylated with MMTS as described in the iTRAQ™ protocol and the proteins were digested overnight using trypsin. The obtained peptides were labelled with the iTRAQ™ reagent in 70% ethanol.
The sample was divided in two sections, whereby one half was labelled with the iTRAQ™ reagent 114 and the other with 117. These differently labelled samples were mixed 1:1 and 1:3. The samples were separated by using multidimensional liquid chromatography. In the first dimension, the mixture was separated by strong cation exchange chromatography (PLSCX; 2.1mm inner diameter (ID), 150mm length, 1000Å pore size, 8μm particle size, Polymer Laboratories, Darmstadt, Germany) using a linear binary gradient (solvent A: 50 mM KH_{2}PO_{4}, pH 3.5; solvent B: 50 mM KH_{2}PO_{4}, 0.25 M NaCl, 25% ACN, pH 3.5). The separation of the peptides was performed with a gradient of 2% per minute increasing amount of solvent B. SCX fractions were taken every minute and the organic solvent was removed under vacuum, furthermore the fractions were separated in a second dimension and analyzed using nano LCMS/MS.
The nano MS/MS analysis was conducted with a Qstar XL (ABI). Samples were preconcentrated using a C18 PepMap trapping column (300 μm ID, 1 mm length, 100 Å pore size, 5 μm particle size; Dionex, Idstein, Germany) and afterwards separated on a C18 PepMap main column (75 μm ID, 150 mm length, 100 Å pore size, 3 μm particle size; Dionex) using a linear binary gradient (solvent A: 0.1% FA; solvent B: 0.1% FA, 84% ACN). Full MS scans from 400 to 1500 m/z were recorded, and the two most intensive peptide ions were subjected to further fragmentation. The MS/MS scans were recorded from 100 to 1500 m/z.
Protein identification
MS/MS Data was exported using wiff2dta [13], version 1.1.10. Protein identification was performed using Mascot™, Version 2.0 (Matrix Science, London, UK) and the database SwissProt (26012006). Identification data as well as fragment ion spectra were extracted using mres2x [17]. MS/MS peptide identifications were verified using theospec [1] and the visualization tools of resDB [18]. Protein identifications were verified using seqDB [19] as used in former studies [18].
The quantification by ProQuant™ was performed using the Analyst QS™ Software, version 1.1. Proteins were implicitly identified by ProID™ 1.1 using the SwissProt database (26012006). An interrogator database was generated based on the database using the enzyme trypsin and allowing one missed cleavage site. The parameters for ProQuant™ (version 1.1) and Pro Group Report (version 1.0.2) were 1.30 for the protein score threshold, and competitor proteins were shown within a protein score of 2.00. The mass tolerance was set to 0.4 amu for precursor ions and 0.4 amu for fragment ions.
Additionally, Mascot™ 2.2 was used for iTRAQ™ analysis. The protein ratio type was set to median, the normalization method was median ratio, no outlier removal was chosen and the peptide threshold was set to at least homology.
Error estimation and error propagation
We introduce precise error propagation in quantification software. A common method in error estimation is done by using the mean value μ, the standard deviation σ and by applying the kσrule and the Tschebyschewequation and has been proposed for quantification [12]. But this method implicates the assumption of the independence of the measured values and simultaneously requires their normal distribution (normality). If one of these or both cannot be assured, other means than this statistical approach to error estimation have to be applied. This is the case, if for example each measurement is only made once and uncertainty arises from precision issues of the instruments used. Moreover, the peptide count in quantitative proteomics is not large enough for reliable calculation of a mean and a standard deviation. Then, errors have to be estimated by intervals. The minimum and maximum values are calculated.
Usually in error treatment, observations are denoted with their errors. Let a and b be two measurements of the true values a_{0 }and b_{0 }with the relative errors f_{a} and f_{b}, respectively. The corresponding absolute errors are denoted as e_{a} and e_{b}. Then the equations a = a_{0 }(1 ± f_{a}) = a_{0 }± e_{a} and b = b_{0 }(1 ± f_{b}) = b_{0 }± e_{b} are valid.
Error propagation can be calculated dependant on the mathematical operations as follows. Sum and difference can be estimated as
and product as well as quotient as
This can be applied to the calculation of the determinant of any m × n matrix M. If any two columns are exchanged, the propagated relative error is not affected. This is especially valid when determinants are calculated by using submatrices.
The absolute error e_{i} of the peak intensity I_{i }is 0.5 in case of integer values. In all other cases, this error depends on the precision of the mass spectrometer and must be estimated individually during calibration. An MS/MS spectrum can be defined as a set M of 2tuples M = {(x_{i}, I_{i})  i ∈ {1,..., n}} and the intensities I_{i }can be regarded as errorprone I_{i }= y_{0 }± e_{i} = y_{0 }(1 ± f_{i}), but derived from the true signal y_{0}.
Purity correction of iTRAQ™ labels and error estimation
The iTRAQ™ reagent batches supplied by ABI are provided with sixteen purity values. These indicate the percentages of each reporter ion that have masses differing by 2, 1, +1 and +2 Da from the nominal reporter ion mass due to isotopic variants. Following the method proposed formerly [14], we use this information to correct the values of each reporter ion to account for the losses to and gains from other reporter ions. This results in simultaneous equations that can be framed such that they can be solved by applying Cramer's rule. This is where we extend the published method by means of error propagation. The relative error of the true reporter intensity W_{i }is , with i ∈ {114, 115, 116, 117}.
In addition, we introduce an initial experiment error that is taken into consideration during calculation of peptide and especially for protein quantification. In former publications [14], a rough intensity error estimation has been proposed. We improve this by a more reliable estimation. Moreover, our method is not fixed to integer intensity values in the fragment ion spectra.
Quantification of proteins
When performing protein quantification, only unique peptides are taken into consideration, whereas peptides belonging to more than one protein sequence are only used for proving the identification of the corresponding proteins. The ratios of the unique peptides are lognormal distributed if their count n is large enough, see figure 2. This has been previously reported for difference gel electrophoresis (DIGE) protein data [20,21]. The ShapiroWilktest, a powerful test of departure from normality, performed with Statistica™ (version 7.1, StatSoft Europe GmbH, Hamburg, Germany) yields W = 0.9629 and a pvalue of 0.2095 for the data of the 1:3 mix. Therefore, the null hypothesis that the logtransformed data is normal distributed cannot be rejected due to the high pvalue. The median of the ratios is calculated, too. In case of lognormal distribution, this equals the mean value μ of the logtransformed and thus normal distributed peptide ratios. However, in case of large n, the median should be preferred to the mean value of the nontransformed data, because it represents the medium observation and is thus the more meaningful choice between the both. The median represents the protein ratio. Additionally, the protein ratio R_{P }is calculated using the method of leastsquares estimation (LSE) by minimizing the square root . This yields a value with a minimal mean distance from the data points R_{i}. Both, LSE and median represent the protein ration derived from the peptide ratios. The choice of the median as protein ratio bases on the lognormal distribution of the peptide ratios and is a good choice for large enough data sets. The LSE is appropriate for smaller data sets and does not depend on an underlying distribution. This is the average of the points Ri, as can be shown. Both values should be nearly equal and their difference can be regarded as an additional quality measure. Moreover, if the peptide ratio count is large enough, the mean value μ and standard deviation σ of the logtransformed peptide ratios can be used as quality indicators, too.
Figure 2. The normalprobabilityplot shows that a lognormal distribution fits the peptide ratio data. The transformed experimental data is plotted and lies on a line, so the data is nearly normally distributed. The xaxis denotes the inverse function of the normality and the yaxis represents the sorted logtransformed values.
Implementation
The implementation was done on MATLAB^{® }(The Mathworks, Ismaning, Germany), version 6.1. The program files are contained in additional file 1, the detailed documentation in additional file 2. We provide example data in additional file 3.
Additional File 1. Archive containing the MATLAB^{® }scripts. This file contains the MATLAB^{® }scripts of Quant that can be executed with MATLAB^{®}.
Format: ZIP Size: 17KB Download file
Additional File 2. Documentation of the MATLAB^{® }scripts. This file contains the full documentation of the Quant scripts and explains the usage of the MATLAB^{® }scripts.
Format: PDF Size: 66KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional File 3. Archive containing the example data. This file contains the example data, that can be processed with the Quant scripts.
Format: ZIP Size: 416KB Download file
The quantification values are calculated by the script startquantitraq. It executes quantitraq that performs the iTRAQ™ quantification. The integration is done by calling sumquantitraq (sum of intensities) or flquantitraq (area calculation by trapezoids), depending on the user's choice. The function pcquantitraq implements the purity correction and is called by quantitraq. The peptide ratios are calculated by raquantitraq. The list of files being processed in batch is provided in the file names01.txt. These files contain the uncentroided MS/MS spectra in DTA format. We recommend not using centroided MS/MS spectra. Mascot™ results could be exported by using mres2x [17], for instance. The script startexperror performs the calculation of the experiment error by execution of the functions experror that calls logtrans, qplot and killzero. By running startplotitraq, the errors are plotted and the boxplots are created by iteratively calling plotitraq. The result files listed in the file names02.txt are processed.
Results
Peptide quantification based on fragment ion spectra
In contrast to other quantification software such as iTracker [14] or RelEx [22], Quant is able to cope with just one signal per iTRAQ™ reporter ion. We allow the choice between two methods of integration: trapezoid integration as implemented in existing software tools and the sum of intensities (see below). We introduce a constant minimal peak width b that is applied if only one peak is found in order to allow calculation of a peak area A when trapezoid integration has been chosen. The error estimation in the former case is as follows: f_{A} = f_{i} ⇒ e_{A }= e·b. In the latter case, the absolute error of trapezoid integration of peaks {(x_{i}, y_{i})} belonging to the mass spectrum S = {(x_{1}, y_{1}),...,(x_{n}, y_{n})} is e_{A} = e·(x_{n } x_{1}). The absolute error when summing up the intensities is e_{S} = n·e.
Relative quantification is performed by calculation of peptide ratios. Each pair of ratios is calculated by building quotients R_{i, j }of the true reporter intensities W_{i }and W_{j}, based on area or sum, for example . Consequently, the implicated relative error of the quotient is , the absolute error .
The effects of the chosen integration method are as follows. The quadratic effect of the integration process that comes from the area calculation does not disappear by applying quotients when ratios are calculated. Consider the example of two labels with two peaks each: P_{A }= {(114.0000, 6.0000), (114.2000, 9.0000)} (label A) and P_{B }= {(115.0000, 4.0000), (115.2000, 16.0000)} (label B), see figure 3. The summed intensities are 15.0000 and 20.0000, respectively. The trapezoid integrals amount to 1.5000 (A) and 2.0000 (B). The corresponding ratios are 1.3333 (summed) and 1.3333 (area). If an additional peak would have been acquired at for example 115.0600 m/z with an intensity of 7.6000, the area of B will not change, but the summed intensity will change to 27.6000, yielding a ratio of 1.8400. This yields a difference in relative quantification of about 38%. Therefore, we recommend using the sum of intensities instead of calculating an underlying area.
Figure 3. The example peaks of two labels A and B are depicted. The area of the peaks is not proportional to the sum of intensities if peak distances and peak count are not equal. This has effects on the quantification results yielding notable differences. The summed intensities of the example above are 15.0000 and 20.000, respectively. The trapezoid integrals amount to 1.500 (A) and 2.000 (B). The corresponding ratios are 1.3333 (summed) and 1.3333 (area). Suppose an additional peak at 115.0600 m/z with an intensity of 7.6000. Then, the area of B would be the identical, whereas the summed intensities will change to 27.6000, yielding a ratio of 1.8400. This yields a difference in relative quantification of 38%. In the former case, the ratio would not reflect the ion count of the three peaks detected by the mass spectrometer, but the latter does as the intensity of each signal represents the amount of ions detected and counted by the mass spectrometer.
These distorting effects of the integration method are independent of the peakpicking method (centroid, gaussian peak detection etc.) that is applied by the data extraction software processing the raw data of the mass spectrometer. Quant itself uses MS/MS data extracted by other means and therefore is independent of any peakpicking method. Moreover it is independent of the mass spectrometer manufacturer and of the controlling software.
Quant integrates an "experiment error" for protein quantification, i.e. a shift of peptide ratios that indicates the overall protein quantification. Previous studies have shown by plotting the ratio distribution of the proteins that most proteins of a sample are not regulated [23,24]. Therefore, the distribution of peptide ratios obtained by a quantification experiment should scatter around a value of one. If this is not the case, this shift indicates an error that happened during the sample preparation in the laboratory. Consider the example of mixing two samples 1:1. The protein concentration has to be known. This can be determined by a BCA [25,26] or Bradford assay [27], but both are not precise as other colorimetric protein assays, too [28]. Thus no exact 1:1 mix can be guaranteed during sample preparation.
Moreover, errors could occur during pipetting, particularly when handling small amounts of protein sample. In order to quantify this shift, the distribution of the peptide ratios must be analyzed in detail.
Firstly, the type of distribution must be determined. We found all peptide ratios lognormal distributed as reported previously for DIGE protein data [20,21]. The median was chosen as parameter, because the logtransformed median equals the mean of the logtransformed normal distributed data. Besides the observation, that biological data mostly are lognormal distributed, in the case of peptide quantification a leftsteeply, right skewed distribution is observed. This can be explained by the fact that in peptide quantification, the ratios have values greater than zero, but very seldom large values. Usually, they vary around 1. The lognormal distribution can be proved by a normalprobabilityplot as shown in figure 2.
The definition of the median in conjunction with the multiplicative characteristic of the lognormal distribution implies that the shift in question is multiplicative, too. This factor is the reciprocal of the median. All peptide ratios are multiplied with this value. Consequently, the median of the shifted peptide ratios is then near one.
The multiplication of the ratios with the median m effects the error estimation. The absolute error changes from e_{R }to . The relative error f_{m }of m implies a relative error of f = f_{R }+ f_{m }when calculating the quotients.
Multiple labelling of peptides has no effects on the quantification results, because the peptides being compared have identical sequences, and thus are equally labelled.
Protein quantification and visualization
The inhouse implementation of a pipeline that integrates Quant accepts peptide identifications from either Mascot™ [29] or Sequest™ [30] and integrates the tool mres2x [17] in order to preserve the linkage between the peptide identification and the corresponding MS/MS spectra.
Usually, only unique peptides are taken into consideration, whereas peptides pointing to more than one protein sequence are only used for improving protein identification as well as for verification and confirmation of identifications (see figure 4).
Figure 4. The amino acid sequence of the protein bovine serum albumin (P02769) is depicted as an example for sequence coverage. The uniquely identified peptide sequences are marked in red, whereas the blue marked regions are confirmed by nonunique peptides. The sequence coverage of the example shown above is 33.61%, the covered mass is 33.27%.
Visualization of protein quantification is done by providing a boxplot of the peptide ratios, as depicted in figures 5, 6, 7, 8, 9. This includes the first and third quartile of the data, i.e. the 25% and 75% quantile. The median is depicted by a horizontal red line. The whiskers mark the data range and are limited to 150% of the interquartilerange (IQR). Outliers are marked in red. The IQR represents a quality measure as it quantifies the scatter of the data independent of the underlying distribution.
Figure 5. Quantification results of the protein BGAL_ECOLI (P00722). Samples were mixed in a ratio of 1:1. Figure a) shows the standard boxplot of the peptide ratios. The median is 1.0508. Figure b) depicts the protein ratio calculated by the LSE value of the single peptide ratios, 0.9106. The red line indicates the LSE value, i.e. the protein ratio calculated from the relative peptide abundances. The blue crosses mark the corresponding errors of each peptide ratio, the red ones the peptide ratios.
Figure 6. Quantification results of the protein TRFE_HUMAN (P02787). Samples were mixed in a ratio of 1:1. Figure a) shows the standard boxplot of the peptide ratios. The median is 0.9984. Outliers are marked in red. Figure b) depicts the protein ratio calculated by the LSE value of the single peptide ratios, 0.9673. The red line indicates the LSE value, i.e. the protein ratio calculated from the relative peptide abundances. The blue crosses mark the corresponding errors of each peptide ratio, the red ones the peptide ratios.
Figure 7. Quantification results of the protein ALBU_BOVIN (P02769). Samples were mixed in a ratio of 1:3. Figure a) shows the standard boxplot of the peptide ratios. The median is 0.9424. Figure b) depicts the protein ratio calculated by the LSE value of the single peptide ratios, 0.9742. The red line indicates the LSE value, i.e. the protein ratio calculated from the relative peptide abundances. The blue crosses mark the corresponding errors of each peptide ratio, the red ones the peptide ratios.
Figure 8. Quantification results of the protein BGAL_ECOLI (P00722). Samples were mixed in a ratio of 1:3. The sequence of the outlying peptide 7 is APLDNDIGVSEATR with a ratio of 1.5171 ± 0.0121. Figure a) shows the standard boxplot of the peptide ratios. The median is 0.9674. Outliers are marked in red. They not distort the calculation of the protein ratio. Figure b) depicts the protein ratio calculated by the LSE value of the single peptide ratios, 0.9936. The red line indicates the LSE value, i.e. the protein ratio calculated from the relative peptide abundances. The blue crosses mark the corresponding errors of each peptide ratio, the red ones the peptide ratios.
Figure 9. Quantification results of the protein TRFE_HUMAN (P02787). Samples were mixed in a ratio of 1:3. Figure a) shows the standard boxplot of the peptide ratios. The median is 1.0511. Figure b) depicts the protein ratio calculated by the LSE value of the single peptide ratios, 1.0526. The red line indicates the LSE value, i.e. the protein ratio calculated from the relative peptide abundances. The blue crosses mark the corresponding errors of each peptide ratio, the red ones the peptide ratios.
As a measure of quality, the confidence interval [μk·σ, μ + k·σ] can be used for the logtransformed data. Additionally, the standard deviation σ of this data can be used as an indicator of the quantification quality in case of a large peptide count per protein. As a numeric tool for measuring the overall quality of the data used, the rootmeansquare value (RMS) can be applied to the relative errors of peptide quantification: .
The smaller the RMSvalue, the better the level of uncertainty is. This method must be preferred to the norm of an error vector, because the dimensions of error vectors are not identical. Moreover, the RMS is appropriate for small data sets.
Experimental results
The standard protein mix supplied with the iTRAQ™ kit was used for testing our software tool Quant. The contents and amount of proteins are known and this protein mix is generally used to establish the iTRAQ™ workflow in laboratories.
Furthermore, we always test new software with a generalized and known sample. By doing this, the functionality and applicability can be easily shown.
The standard protein mix of iTRAQ™ consists of bovine serum albumin (Bos taurus), βgalactosidase (E. coli), αlactalbumin (Bos taurus), βlactoglobulin (Bos taurus), lysozyme (Gallus gallus), apotransferrin (Homo sapiens). The results acquired by following our standardised protein identification procedure that comprises LCMS/MS and the database search algorithm Mascot™ 2.0 are shown in tables 1 and 2. These data prove that all expected proteins have been identified. However, several homologous proteins are detected, because the complete database SwissProt was used for identification. In the tables 1 and 2, all peptides belonging to more than one protein are marked in red. To visualize the unique and nonunique peptides of a protein, an example is shown in figure 4.
Table 1. Identified proteins of the 1:1 sample mix
Table 2. Identified proteins of the 1:3 sample mix.
The list of identification was then submitted to quantification by Quant. As only unique peptides can be used for reliable quantification, the software Quant implements a filter that removes all nonunique peptides. In a real nonstandard sample this is necessary as otherwise protein isoforms neither can be distinguished correctly nor quantified in a reliable manner (see tables 3 and 4 as well as tables 5 and 6, respectively).
Table 3. Quantification results of the sample 1:1 mix.
Table 4. Quantification results of the sample 1:1 mix.
Table 5. Quantification results of the sample 1:3 mix.
Table 6. Quantification results of the sample 1:3 mix.
Running the software Quant with this filter, only quantification results for the proteins BGAL_ECOLI, TRFE_HUMAN, ALBU_BOVIN were calculated as for the other proteins only nonunique peptides were detected. These quantification results are presented in tables 4 and 6. In the case of using a known standard protein mix with proteins from different organisms, the nonunique peptides are accessible by deactivation of that filter. This can be avoided in a real sample because the organism is usually known and the database search can be accomplished with a database only containing the proteins of this organism or by using a taxonomy filter as supported by Mascot™. The quantification results obtained by not applying the filter for unique and nonunique peptides are summarized in tables 4, 6 and 3, 5, respectively. In these tables not only the results from our software Quant are listed, but additionally the output of the software ProQuant™ that implements no restriction to only unique peptides. Data obtained from Mascot™ 2.2 are presented, too. The absolute protein quantification ratios yielded by Quant, Mascot™ 2.2, and ProQuant™ are comparable. As shown in tables 3 and 5, including nonunique peptides distorts the quantification results. The experiment error of Quant (bias of ProQuant™) indicates the overall protein mixing ratio. The protein ratio results are normalized by this factor. The visualization of the protein results for BGAL_ECOLI, TRFE_HUMAN and ALBU_BOVIN is shown in figures 5, 6, 7, 8, 9. No peptides were detected that underwent Nterminal cyclation.
Discussion
Comparison with other software tools
In contrast to other software used for peptide quantification that applies trapezoid or other methods of integration for area calculation, we decided to introduce the sum of intensities in MSbased quantification. We have shown that integration implies changes in relative quantification of peptides and proteins, see figure 3. This yields similar changes when absolute quantification is performed. The effect depends on the precision, resolution and calibration of the mass spectrometer, but is not zero. Consequently, Quant is able to cope with just one signal per iTRAQ™ reporter ion. For the integration of peak areas, we introduced a minimum peak width, in order to provide this feature in that context. The sum of the signal intensities reflects the ion count recorded by the mass spectrometer more precisely than an integrated peak area, as shown in figure 3. Moreover, when summing up intensities the problem of just one reporter signal is not existent. The peaks are filtered by applying a threshold for the peak intensity. This is an option for the user, as the noise in mass spectra depends on the mass spectrometer that is used.
We improved the error estimation of other approaches [14] by adding precise error indication. Instead of taking only the maximum peak intensity as a basis of error estimation that has been formerly proposed [14], we use all peaks belonging to an iTRAQ™ reporter for precise error calculation. Additionally, we propagate the implications of the purity correction on the error estimation. When relative quantification is calculated, we propagate the estimated errors and use them for calculation of a quantification error. This is the maximum possible error and can be used as a quality indicator. If reporter peaks are missing for a label, the relative quantification cannot be performed. Thus, no zero values appear in the peptide ratio lists of the proteins and the logtransformation can be performed in all cases.
Multiple MS/MS spectra belonging to the same peptide sequence are not merged to one quantification value. We regard them as single measurements that are analyzed separately. Thus by using Quant, modified and unmodified peptides can be distinguished. Moreover, modified peptides might appear as outliers of the boxplot and can be analyzed separately. Some examples of this are included in the figures 5, 6, 7, 8, 9. If outliers are detected, the amino acid sequence should be analyzed in detail, and in some cases a new database search should be performed in order to confirm these sequences and to seek out further posttranslational modifications, e.g. non iTRAQ™ labelled peptides because of Nterminal cyclation or acetylation of primary amino groups.
Quant uses MS/MS data extracted by other means and therefore is independent of any peakpicking method. Moreover it is independent of the mass spectrometer controlling software.
In comparison with Peakardt.FindPairs [31] that uses the mean value of peptide ratios for protein quantification, we use the median. This is statistically sound and correct, as peptide ratios are lognormal distributed (see figure 2) and therefore the mean value does not equal the median. Moreover, the median is robust against outliers that would have effects on the mean value. Therefore, there is no need to eliminate or to reject outliers. Moreover, Quant is able to point the user to outliers that should be analyzed further.
As a numeric tool for estimation of the quantification quality, we introduce the rootmeansquare value (RMS) into protein quantification. This value is calculated from the relative errors of the peptide ratios. In contrast to the quality estimation by applying the standard deviation to the nontransformed data, the RMS is independent of the number of data points. Calculation of the standard deviation requires sufficient data points for doing a precise assumption on the underlying distribution of the data. Other tools, such as Peakardt.FindPairs [31], use the standard deviation σ of the nontransformed data as a quality measure. That approach uses σ and the Tschebyscheffequation as basis for identification of outliers. This is needed for Peakardt.FindPairs, because the mean value is used as a parameter for protein quantification, which is sensitive to outliers. If the median would have been chosen, this problem would not occur.
Mascot™ 2.2 provides an analysis of iTRAQ™ data that is described online [32]. The lognormal distribution is employed. We could show that peptide ratios are from lognormal distribution and in consequence the use of the ShaprioWilktest is the appropriate choice. We suggest not to rely on data with less than 5 observations when using this test, an upper limit for this procedure does not exist [33]. Mascot™ does not provide an experiment error or a bias within the result display. We could show that Mascot™ 2.2 bases on statistically correct and appropriate assumptions, concerning the iTRAQ™ evaluation.
ProteinPilot™ itself uses the same statistical approach as ProQuant™, but restricts peptides to unique ones. According to the information available with the trial version, the software estimates the experiment error (bias correction) with at least 20 protein ratios, although the median is applied. In contrast to Quant that makes use of the median and the LSE, ProteinPilot™ calculates the protein ratio by a weighted average. Similar to our approach, ProteinPilot™ yields a quality measure that is derived from the 95% confidence interval (error factor) which is calculated from the standard deviation in logspace.
In contrast to other quantification software that often are restricted to the use of only one protein identification algorithm, such as Mascot™ (Mascot™ 2.2), Sequest™ (RelEx – no iTRAQ™ capability), ProID™ (ProQuant™) or Paragon™ (ProteinPilot™), our method named Quant is independent of the identification algorithm. Moreover, Quant implements the purity correction including error propagation and precise error estimation. Additionally, we present reasons on an appropriate manner of intensity calculation as preprocessing for peptide ratio analysis.
The data presented in tables 3, 4 suggest that integration of nonunique peptides into calculation of protein quantification impairs the results in a negative way. The results generated by Quant are consistent with those produced by ProQuant™ as well as with Mascot™ 2.2. Because of the precise error propagation and the adequate visualization, the data obtained by using Quant is reliable.
Conclusion
We have shown that relative quantification can be performed on data generated by tandem MS and iTRAQ™. We presented an analyzing method named Quant capable of calculating precise data, what has been shown by application to the protein standard mix supplied with the iTRAQ™ kit. The protein ratios of this standard have been calculated precisely from MS/MS spectra of the identification results.
We showed that restriction of the data evaluation to unique peptides is the only way of obtaining reliable quantification results at the protein level. Identification of unique peptides can be easily automated. Moreover, Quant is independent of the underlying protein identification software.
We have shown that a lognormal distribution fits the data of relative peptide quantification by applying the ShapiroWilktest on the logtransformed data. Outliers can be identified by applying proper means of statistical tools, i.e. distribution analysis, boxplot, median, LSE and RMS. These are helpful as quality measures. We replaced peak area integration by sum of intensities, yielding reliable quantification results.
The methods presented here scale well with the protein and peptide ratios. The quality of the results yielded by Quant are not dependant of the peptide or protein ratios, but rather depend on the quality of the MS/MS experiment as well as on the protein identification and the MS/MS spectra, especially the scale of signal intensities is important. Therefore, and proven by the statistically sound system, the dynamic range of Quant is not limited by the inherent methods in comparison to the instrumental methods. Moreover, Quant provides a precise quality measure of the protein quantification by the RMS value.
The presented method is expandable to the 8plex iTRAQ™ [34] as it is independent of the number of different labels.
Our data analysis method is more robust than other published software tools. Quant demonstrates improvements in peptide and protein quantification using iTRAQ™. Precise quantification data can be obtained when using error propagation and adequate visualization in conjunctions with consideration of an experiment error. Quant is shown to generate results that are consistent with those produced by ProQuant™ and Mascot™ 2.2, thus validating these systems.
Availability and requirements
The MATLAB^{® }program scripts are freely available upon request from the authors and freely available via http://www.proteinms.de webcite and http://sourceforge.net/projects/protms/ webcite under the GNU Lesser General Public License. A MATLAB^{® }installation is required for executing the scripts.
List of abbreviations used
Å: Angström
ABI: Applied Biosystems/MDSSciex
ACN: acetonitrile
amu: atomic mass unit
BCA: bicinchoninic acid
Da: Dalton
DIGE: difference gel electrophoresis
DNA: desoxyribonuclein acid
DTA: file format for MS/MS spectrum data
EE: experiment error
EF: error factor of ProQuant™
FA: formic acid
ID: inner diameter
Id: database identifier of a protein
IQR: interquartile range
iTRAQ™: isobaric tag for relative and absolute quantitation
LSE: least squares estimator
μm: micrometre
mm: millimetre
mM: millimolar
MMTS: methyl methanethiosulfonate
MS: mass spectrometry
MS/MS: tandem mass spectrometry
NHS: Nhydroxysuccinimide
NN: No value available
pVal: pvalue
RMS: rootmeansquare value
SCX: strong cation exchange
Authors' contributions
AB initiated the project and implemented the program in the laboratory. DA implemented the MATLAB^{® }scripts. MF and DA introduced precise error estimations and statistics to the project. AS and SP conducted the experiments and contributed with ideas and discussions. AB, SP and DA contributed equally to the manuscript. All authors have read and approved the final manuscript.
Acknowledgements
This work was supported by the Deutsche Forschungsgemeinschaft (FZT 82).
References

Boehm AM, GrosseCoosmann F, Sickmann A: Command Line Tool for Calculating Theoretical MS Spectra for Given Sequences.
Bioinformatics 2004, 20:28892891. PubMed Abstract  Publisher Full Text

NCBI: National Center for Biotechnology Information. [http://www.ncbi.nih.gov/] webcite

Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Mischoud K, O'Donovan C, Phan I, Pilbout S, Schneider M: The SWISSPROT Protein Knowledgebase and Its Supplement TrEMBL in 2003.
Nucleic Acids Research 2003, 31:365370. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wilkins MR, Sanchez JC, Gooley AA, Appel RD, HumpherySmith I, Hochstrasser DF, Williams KL: Progress with Proteome Projects: Why All Proteins Expressed by a Genome Should Be Identified and How to Do It.

Krijgsveld J, Heck AJR: Quantitative Proteomics by Metabolic Labelling with Stable Isotopes.
Drug Discovery Today 2004, 3:S11S15. Publisher Full Text

Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R: Quantitative Analysis of Complex Protein Mixtures Using IsotopeCoded Affinity Tags.
Nature Biotechnology 1999, 17:994 9999. PubMed Abstract  Publisher Full Text

Cavalieri D, Filippo CD: Bioinformatic Methods for Integrating WholeGenome Expression Results into Cellular Networks.
Drug Discovery Today 2005, 10:727734. PubMed Abstract  Publisher Full Text

Aebersold R, Mann M: Mass SpectrometryBased Proteomics.
Nature 2003, 422:198207. PubMed Abstract  Publisher Full Text

Aebersold R, Goodlett DR: Mass Spectrometry in Proteomics.
Chemical Reviews 2001, 101:269295. PubMed Abstract  Publisher Full Text

Pütz S, Reinders J, Reinders Y, Sickmann A: Mass SpectrometryBased Peptide Quantification: Applications and Limitations.
Expert Review of Proteomics 2005, 2:381392. PubMed Abstract  Publisher Full Text

Ross PL, Huang YN, Marchese JN, Williamson B, Parker K, Hattan S, Khainovski N, Pillai S, Dey S, Daniels S, Purkayastha S, Juhasz P, Martin S, BartletJones M, He F, Jacobson A, Pappin DJ: Multiplexed Protein Quantitation in Saccharomyces cerevisiae Using Aminereactive Isobaric Tagging Reagents.
Molecular & Cellular Proteomics 2004, 3:11541169. Publisher Full Text

Ong SE, Mann M: Mass SpectrometryBased Proteomics Turns Quantitative.
Nature Chemical Biology 2005, 1:252262. PubMed Abstract  Publisher Full Text

Boehm AM, Galvin RP, Sickmann A: Extractor for ESI Quadrupole TOF Tandem MS Data Enabled for High Throughput Batch Processing.
BMC Bioinformatics 2004., 5 PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Shadforth IP, Dunkley TPJ, Lilley KS, Bessant C: iTracker: For quantitative proteomics using iTRAQ(TM).
BMC Genomics 2005., 6 PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wu KP, Lin WT, Hung WN, Yian YH, Chen YR, Chen YJ, Sung TY, Hsu WL: MassTRAQ: A Fully Automated Tool for iTRAQlabeled Protein Quantification: ; Stanford, USA. Edited by Martin DC. IEEE Computer Society; 2005:157158.

Applied Biosystems: Applied Biosystems iTRAQ™ Reagents AmineModifying Labeling Reagents for Multiplexed Relative and Absolute Protein Quantitation  Protocol. [http://docs.appliedbiosystems.com/pebiodocs/04350831.pdf] webcite

GrosseCoosmann F, Boehm AM, Sickmann A: Efficient Analysis and Extraction of MS/MS Result Data from Mascot™ Result Files.
BMC Bioinformatics 2005., 6 PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zahedi RP, Sickmann A, Boehm AM, Winkler C, Zufall N, Schönfisch B, Guiard B, Pfanner N, Meisinger C: Proteomic Analysis of the Yeast Mitochondrial Outer Membrane Reveals Accumulation of a Subclass of Preproteins.
Molecular Biology of the Cell 2006, 17:14361450. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Boehm AM, Sickmann A: A Comprehensive Dictionary of Protein Accession Codes for Complete Protein Accession Identifier Alias Resolving.
Proteomics 2006, 6:42234226. PubMed Abstract  Publisher Full Text

Jung K, Gannoun A, Sitek B, Apostolov O, Schramm A, Meyer HE, Stühler K, Urfer W: Statistical Evaluation Of Methods For The Analysis Of Dynamic Protein Expression Data From A Tumor Study.

Jung K, Gannoun A, Sitek B, Meyer HE, Stühler K, Urfer W: Analysis of Dynamic Protein Expression Data.

MacCoss MJ, Wu CC, Liu H, Sadygov R, Yates JR: A Correlation Algorithm for the Automated Quantitative Analysis of Shotgun Proteomics Data.
Analytical Chemistry 2003, 75:6912 66921. PubMed Abstract  Publisher Full Text

Patwardhan AJ, Strittmatter EF, Camp DG, Smith RD, Pallavicini MG: Quantitative Proteome Analysis of Breast Cancer Cell Lines Using 18OLabeling and an Accurate Mass and Time Tag Strategy.
Proteomics 2006, 6:2903–2915. Publisher Full Text

Kolkman A, DaranLapujade P, Fullaondo A, Olsthoorn MMA, Pronk JT, Slijper M, Heck AJR: Proteome Analysis of Yeast Response to Various Nutrient Limitations.
Molecular Systems Biology 2006., 2 PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Smith PK: Measurement of Protein Using Bicinchoninic Acid. US Patent 4839295, Pierce Chemical Company; 1987.

Smith PK, Krohn RI, Hermanson GT, Mallia AK, Gartner FH, Provenzano MD, Fujimoto EK, Goeke NM, Olson BJ, Klenk DC: Measurement of Protein Using Bicinchoninic Acid.
Analytical Biochemistry 1985, 150(1):7685. PubMed Abstract  Publisher Full Text

Bradford MM: A Rapid and Sensitive Method for the Quantitation of Microgram Quantities of Protein Utilizing the Principle of ProteinDye Binding.
Analytical Biochemistry 1976, 72:248254. PubMed Abstract  Publisher Full Text

Sapan CV, Lundblad RL, Price NC: Colorimetric Protein Assay Techniques.
Biotechnol Appl Biochem 1999, 29:99108. PubMed Abstract  Publisher Full Text

Perkins DN, Pappin DJC, Creasy DM, Cottrell JS: ProbabilityBased Protein Identification by Searching Sequence Databases Using Mass Spectrometry Data.
Electrophoresis 1999, 20:35513567. PubMed Abstract  Publisher Full Text

Eng JK, McCormack AL, Yates JR: An Approach to Correlate Tandem Mass Spectral Data of Peptides with Amino Acid Sequences in a Protein Database.
Journal of the American Society for Mass Spectrometry 1994, 5:976989. Publisher Full Text

Reidegeld KA, Franke G, Hebeler R, Wiese S, Oeljeklaus S, Lakhal B, Meyer HE, Warscheid B: Peakardt.FindPairs  A Universal Software for Protein Quantitation via Stable IsotopeLabeling through Mass Spectrometry: ; München. ; 2005:S30.

Matrix Science Ltd.: Quantitation: Statistical procedures. [http://www.matrixscience.com/help/quant_statistics_help.html] webcite

Royston P: Approximating the ShapiroWilk WTest for NonNormality.
Statistics and Computing 1992, 2:117119. Publisher Full Text

Applied Biosystems: Multiplex Protein Quantitation using iTRAQ™ Reagents  8plex  Publication 114PB1501. [http://docs.appliedbiosystems.com/searchdodnum.taf?dodnum=116320] webcite

Applied Biosystems: Using Pro Group Reports. [http://docs.appliedbiosystems.com/pebiodocs/00113913.pdf] webcite