Abstract
Background:
Biological Mass Spectrometry is used to analyse peptides and proteins. A mass spectrum generates a list of measured mass to charge ratios and intensities of ionised peptides, which is called a peaklist. In order to classify the underlying amino acid sequence, the acquired spectra are usually compared with synthetic ones. Development of suitable methods of direct peaklist comparison may be advantageous for many applications.
Results:
The pairwise peaklist comparison is a multistage process composed of matching of peaks embedded in two peaklists, normalisation, scaling of peak intensities and dissimilarity measures. In our analysis, we focused on binary and intensity based measures. We have modified the measures in order to comprise the mass spectrometry specific properties of mass measurement accuracy and nonmatching peaks. We compared the labelling of peaklist pairs, obtained using different factors of the pairwise peaklist comparison, as being the same or different to those determined by sequence database searches. In order to elucidate how these factors influence the peaklist comparison we adopted an analysis of variance type method with the partial area under the ROC curve as a dependent variable.
Conclusion:
The analysis of variance provides insight into the relevance of various factors influencing the outcome of the pairwise peaklist comparison. For large MS/MS and PMF data sets the outcome of ANOVA analysis was consistent, providing a strong indication that the results presented here might be valid for many various types of peptide mass measurements.
Background
In recent years, mass spectrometry (MS) has emerged as a powerful technique to identify proteins in biological samples [14]. For their identification, proteins are usually cleaved into peptides by a protease of known and restricted cleavage specificity, e.g. trypsin. The resulting cleavage products can then be analysed by Peptide Mass Fingerprinting (PMF) [5] or subjected to MS/MS fragment ion analysis [6,7]. A PMF is a highly specific set of peptide molecular masses derived from one isolated protein. PMFs are employed to identify the analysed protein in large protein sequence databases by matching the determined peptide molecular masses to values calculated from the amino acid sequences in the database. Similarly, MS/MS spectra serve for protein identification by comparing the determined peptide fragment ion masses against predicted ones from amino acid sequence data and fragmentation characteristics of the employed MS instrumentation [8].
Before performing database searches, the MS spectra are processed and the most informative features, namely the monoisotopic peaks are extracted. The procedure consists of several steps and includes smoothing, baseline subtraction, peakextraction and monoisotopic peak determination [9,10]. A spectrum preprocessing usually requires proprietary software provided by the instrument vendors. It generates a list of mass over charge (m/z) values of the monoisotopic peaks and either the area under or the height of those peaks are obtained. The set of m/z and intensity value pairs is called a peaklist. In case of PMF datasets, the peaklists have on average 36 peak/intensity pairs compared to e.g. 100,000 data points of the unprocessed spectra.
The sensitivity and specificity of the peptide identification using database searches might be increased by several methods. These usually include calibration [1113], identification of nonpeptide peaks [12,14,15], identification and removal of lowquality spectra [16,17] or validation of the search results using machinelearning algorithms [18,19].
The subtractive analysis technique
The sensitivity and specificity of peptide and protein identification can further be increased by the pairwise comparison of the peaklists [11,2022]. Yates et al. [20] applied, as a measure of peaklist similarity, the crosscorrelation score normalised by the autocorrelation of the spectra. They demonstrated that when using this measure, the MS spectra could be correctly classified according to their peptide content even if acquired on two different instruments, namely a TripleQuadrupole Tandem (TSQ) or Quadrupole Ion Trap (LCQ) mass spectrometers.
They suggested, as part of a "subtractive analysis technique", using pairwise spectra comparisons to search MS spectra against a library of identified spectra before database searching. Tandem mass spectra (unique to an experiment) could be targeted for database searches or de novo interpretation.
A significant portion of identical peptides is analysed and identified many times even when the instrument control software attempts to prevent the repeated isolation and fragmentation of particular peptides in order to increase the diversity of acquired spectra. Gentzel et al. [11] used the crosscorrelation measure for MS/MS spectra comparison. They computed the similarity score for two parts of the spectra. If these parts exhibited a satisfying similarity score, the spectra were assumed to be identical. Tabb et al. [22] explored the performance of the normalised dotproduct (spectral angle) algorithm to identify duplicated samples. The advantage of the dotproduct measure over the crosscorrelation algorithm lies in its computational speed. Based on this measure, Tabb et al. [22] and Beer et al. [21] developed software to identify the duplicated MS/MS spectra. The analysis time saved by subtractive analysis can be used instead to perform more extensive searches in other databases, i.e. expressed sequence tag (EST) databases, or to apply computationally demanding, mutationtolerant search algorithms [23], which depend on partial spectra interpretation [2426].
Pairwise spectra comparison can also be put to use "as an informative marker to identify organisms or some other feature of an organism" [20]. For example, Svetnik and Liaw [27] used pairwise spectra comparison to detect novel outliers in largescale cosmid screening experiments [28]. They used the Pearson and Spearman correlation measures, as well as the Euclidean distance to compute the distances of the spectra, followed by sequential clustering. Serum protein and peptide fingerprints were used in diagnostic medicine to distinguish healthy individuals from those with cancer [2932].
The pairwise peaklist comparison process
Previously, only the performance of the dotproduct measure has been compared to the similarity index using MS/MS spectra of structural isomers [33]. Therefore, in our work we have reviewed a large group of dissimilarity measures and examined how these can be extended to include the mass spectrometry specific property of mass measurement accuracy. A new parameter weight of nonmatching peaks (θ) was introduced into the computation of distance measures. We have studied the Euclidean and the Manhattan distance, the covariance, the sum of agreeing intensities and the spectral angle. We have also examined the impact of the intensity scaling on the outcome of intensity based measures [34,35]. In addition, we have performed a systematic study of various intensity transformations [22] in order to determine the best variance stabilising transformation. Furthermore, we investigated quantitative measures, i.e. Huberts Γ or the relative mutual information measure [36]. The combination of these factors resulted in 96 choices of the comparison process for the binary measures and 2688 approaches for the intensity based measures. The first aim of the work presented here was to determine the pairwise peaklist comparison approach with highest sensitivity and specificity for the grouping of spectra. The second goal was to determine which factors studied had the highest effect on the outcome of the clustering, in order to foster the understanding of the pairwise peaklist comparison process. While the first goal could be easily achieved by ranking the various peaklist comparison approaches, the second goal was approached by analysis of variance (ANOVA) techniques. The partial area under the receiver operator characteristic (ROC) curve, determined for high sensitivity and specificity values was used as the dependent variable, while the various choices for the comparison process were the factors in the ANOVA.
Evaluation framework
PMF and MS/MS data represent mass spectrometric peptide peaklists. While the PMF measurement is characterised by highmass resolution, large mass range and the production of relatively few peaks, the MS/MS spectra have a lowermass resolution, a smaller mass range and a higher number of peaks. Figure 1 presents examples of peaklists for fragment ion MS/MS and PMF, respectively. We have analysed the pairwise comparison process using both datasets in order to determine if the differences of the data require different configuration of the pairwise comparison.
Figure 1. Example of a peaklist stick spectrum for fragment ion MS/MS (top panel) and PMF(bottom panel). Xaxis – mass of the peaks, Yaxis – area under the peak.
In order to determine the sensitivity and specificity of the measures used for classification, the grouping induced by the measures must be compared with the true cluster membership of the spectra. However, we did not have a dataset with a large number of groups of spectra and with identities known apriori.
Therefore, in our study we used PMF and MS/MS spectra resulting from studies of different proteomes. The identities of the spectra were determined by database searches (cf. Methods). We assumed that the database searches resulted in true identification of the peptides and proteins.
In the PMF dataset, only 176 proteins out of 668 were identified by a single spectrum, while the remaining 492 protein identifications resulted from 2160 database searches. The amount of duplicated samples in the two datasets (Table 1) was significant and recognising them could for example, significantly reduce the number of searches necessary to identify all proteins.
Table 1. Number of clusters of given cluster size N. The columns 2 and 3 describe the cluster size in the PMF and the MS/MS datasets. Number of spectra – number of peaklists submitted for database search, identified spectra – spectra assigned to a database ID with an either significant probability based Mowse score (PMFdata) or to a peptide sequence with Xcorr > 2, and an ion coverage > 20% (MS/MSdata) given a parent peptide charge z = 2. Identified proteins/peptides – the number of uniquely identified proteins or peptides. ^{A }– approximate number of spectra derived from ion fragments of peptides with charge z = 2. ^{B }– The number of spectra with charge z = 2 of the parent ion (≈ 53% of all identified spectra).
According to the protein database identifier (ID), in case of the PMF data or the peptide sequence and the same parent ion charge z = 2 for MS/MS data, we defined a peaklists pair (X, Y) to be within a cluster if it was assigned to the same protein ID or peptide sequence. Similarly, two peaklists were defined to lie between the clusters if their database IDs differed. This assignment of the peaklists pairs as within and between was recognised as the true condition status. The assignment induced by the pairwise comparison approach for a given thresholds of the discriminatory variable was compared with this true condition status and the sensitivities and specificities were computed.
The spectra in the large group of unidentified peaklists could not be used in this study. This is because we could not infer that all these spectra were derived from the same peptide/protein. Secondly, we could not assume that an unidentified spectrum was not obtained in a measurement of the same peptide as any of the spectra assigned to a database identifier or peptide sequence. The identification of a protein/peptide often fails because of a signal to noise ratio that is too small. If we would treat this group likewise the clusters formed by identified spectra we would then introduce an error during the determination of false positive/false negative rates.
Using the data, where the identities of samples were determined by database search algorithms we were able to examine whether the pairwise peaklist comparison made equal or different assignments to a group, in relation to the database search algorithm. We were not able to disclose if any of these measures had higher sensitivities and specificities than the database search algorithms used. However, these data were sufficient to expose relative differences between pairwise peaklist comparison approaches, as well as the degree by which the various factors of the comparison process influenced the outcome.
The computation of the pairwise peaklist distances was performed using the inhouse developed R [37] package msbase, which is available from the BioConductor Project [38] web page [39].
Results and Discussion
The factors of the pairwise peaklist comparison
Table 2 summarises the factors, which can influence the outcome of a pairwise peaklist comparison. The first step in the comparison of the peaklists is to determine matching and nonmatching peaks with given mass measurement accuracy. If one peak is ambiguously assigned to several peaks in the second peaklist (cf. Methods – Figure 6), the noncrossing matching can be computed. The next element to be considered is whether the mass measurement accuracy should be modelled [45,46] using Equation 2. Modelling of the mass differences between matching masses did not affect the nonmatching peaks. The influence of nonmatching peaks on the pairwise peaklist comparison was modulated by increasing or decreasing their weight by twofold (using the parameter θ in the dissimilarity equations (cf. Methods)).
Table 2. Factors considered in the comparison process and their levels. Column 1 – Factors: identification of factors, Column 2 – Levels: short summary of the levels (For more details please refer to the Methods section). Column 3 – Number: number of levels. Int. – comparisons considering the intensities; Bin. – binary measures.
The length of the aligned peaklists either equals the sum of the peaks in both peaklists minus the number of peaks matching or is userdefined. In Equation 3, we set N = 250 for the PMF dataset and N = 400 for the MS/MS dataset, which in both cases was approximately twice the length of the longest peaklist. In case of the intensity based measures the missing peak pairs were augmented by peaks of zero intensity. Further elements which affected only the intensitybased measures included the transformation and scaling of peak intensities. The distance measures (cf. Methods) were the last of the examined factors. Section 'Features of the pairwise peaklist comparison and their properties' in the Appendix section provides a descriptive analysis of the features of the pairwise peaklist comparison in order to introduce the data and to motivate the use of various factors of the peaklist comparison approach.
The intensitybased measures contained 2688 sets of factors while the binary measures covered 96 sets. To determine which of these factors were important and how they influenced the scores, we applied analysis of variance techniques (ANOVA). We first performed the ANOVA on the PMF dataset. Afterwards, we examined if the obtained linear model could be used to explain the properties of the pairwise peaklist comparison process computed on the MS/MS dataset.
The evaluation scores
In order to evaluate the capability of pairwise comparison approaches to identify peaklist pairs as being within or between cluster we used the partial area of interest under the Receiver Operating Characteristic (ROC) curve (PAUC) [44]. The ROC curve was generated by drawing the , where TP – are true positives, FN – false negatives, against the , where FP – false positives, TN – true negatives, for the same value of the discriminatory variable, i.e. the number of matching peaks as shown in Figure 2. For 4 matches we determined a specificity of 99% and sensitivity of 95%.
Figure 2. Receiver Operator Characteristic curve – The sensitivity (TPrate) is plotted against FP = 1  specificity using the number of matching peaks as the discriminatory variable. Red dashed area: sensitivityPAUC – partial area under the ROC curve for FPrate ∈ [0, 0.1]. Green dashed area: specificityPAUC – partial area under the ROC curve for sensitivities ∈ [0.9, 1].
We were particularly interested in the sensitivity of the pairwise peaklist comparison only for small values of the FPrate. Therefore, we computed the Partial Area under the ROC curve (PAUC) for 1  specificity ∈ [0, 0.1] (reddashed region Figure 2), denoted by sensitivityPAUC. Moreover, we were also interested in the specificities when high sensitivities are required (sensitivity ∈ [0.9, 1]), further abbreviated specificityPAUC. Hence, we computed the PAUC for the area indicated in Figure (2) by the greendashed region. Both sensitivityPAUC and specificityPAUC were utilised as the dependent variable in the analysis of variance.
ANOVA of the pairwise peaklist comparison approaches
The aim of the statistical analysis was to evaluate different strategies of the pairwise peaklist comparison with respect to the sensitivity and specificity partial area under the ROC curve (PAUC). A possible strategy to decide which pairwise comparison approach performs best would have been to choose the one with the largest partial area under the curve. However, we were also interested in determining the influence of and the dependence between the factors of the pairwise peaklist comparison on the outcome of the classification.
Each strategy of the pairwise peaklist comparison was defined by the combination of seven factors as given in Table 2. The whole set of pairwise peaklist comparison strategies shows a completely balanced factorial structure analogous to those used in analysis of variance (ANOVA) [42] with PAUCs as dependent variables and the specific strategies of pairwise peaklist comparison as combinations of factor levels. However, due to multimodality our data could not be transformed to approximate normality. Thus, we could not calculate Fratios and related statistical tests of significance. To assess the significance of the factors, we therefore used the relative sum of squares (%SSQ) and the relative mean sum of squares (%MSQ), defined as the ratio of the SSQ or MSQ with respect to the total SSQ or MSQ. We did not calculate Fratios or Pvalues for factors and interactions, due to the mentioned deviation from normality. A large value of the relative sums of squares (%SSQ) for a factor in Tables 3 and 4 indicates its importance for the correct classification of peaklists.
Table 3. Influence of factors specifying the pairwise peaklist comparison on partial areas under the ROC curve for binary PMF and MS/MS data. For each of the 96 pairwise comparison approaches, sensitivityPAUC (sensitivity given FPrate ∈ [0, 0.1]) and specificityPAUC (specificity given sensitivity ∈ [0.9, 1]) (Figure 2) were determined. A partitioning of sums of squares was performed analogously to analysis of variance. Column names: Factors – identification of factors; df – degrees of freedom (DF, number of factor levels – 1); %SSQ – relative sum of squares (%SSQ = SSQ/∑SSQ); %MSQ – relative mean sum of squares (%MSQ = MSQ/∑MSQ), where MSQ = SSQ/DF. %MSQ measures the importance of a specific factor for the size of specificityPAUC and sensitivityPAUC. × denotes interactions between factors. measure – distance measure, noncross – non crossing matching, length – alignment length, θ – weight of nonmatching peaks, residual – unexplained %SSQ or %MSQ, total – column sum of %SSQ, df, %MSQ.
Table 4. Influence of factors specifying the pairwise peaklist comparison on partial areas under the ROC curve for intensity PMF and MS/MS data. For each of the 2688 pairwise peaklist comparison approaches, sensitivityPAUC (sensitivity given FPrate ∈ [0, 0.1]) and specificityPAUC (specificity given sensitivity ∈ [0.9, 1]) (Figure 2) were determined. A partitioning of sums of squares was performed analogously to analysis of variance. Column names: Factors – identification of factors; df – degrees of freedom (DF, number of factor levels  1); %SSQ – relative sum of squares (%SSQ = SSQ/∑SSQ); %MSQ – relative mean sum of squares (%MSQ = MSQ/∑MSQ), where %MSQ = MSQ/∑MSQ. %MSQ measures the importance of a specific factor for the size of sensitivityPAUC and specificityPAUC. × denotes interactions between factors. measure – distance measure, noncross – non crossing matching, length – alignment length, θ – weight of nonmatching peaks, trans – peak intensity transformation, residual – unexplained %SSQ or %MSQ, total – column sum of %SSQ, df, %MSQ.
The ANOVA results
The high value of the %SSQ or %MSQ value in Table 3 and 4 reflect the change (variance) of the response variable PAUC caused by a factor or combination of factors. In case of the intensity based measures, the high value of the relative mean sum of squares (%MSQ) (Table 4, top panel) for the factor 'scale' (intensity scaling procedure) and 'measure' (dissimilarity measure) indicates that these factors were crucial for the correct classification of peaklists. The small values of the %MSQ for the factors 'weight of match accuracy' (weight) and 'computing the noncrossing matching' (noncross) shows that these factors had a negligible impact on the result of pairwise peaklist comparison.
A large value of %MSQ or %SSQ of an interaction term (denoted by × in Table 3 and 4, bottom panel) demonstrates that some combinations of factors were more useful than others. The high value of %SSQ for the interaction measure × scale reflects that, for example, the measure sum of agreeing intensities performed better in combination with the vector length scaling (N) or rootmeansquare scaling (S), than with the total ion count scaling (T) or with the zscore scaling (Z). We concluded that the crucial factors of the pairwise peaklist comparison were the measure and peak intensity scaling, followed by the weight of nonmatching peaks and the length of the peaklist.
In case of binary measures, the high %MSQ value (Table 3, column %MSQ) of factors 'measure', 'weight of nonmatching peaks' θ, 'peaklist length' N as well as of their interactions indicates that their are crucial for the outcome of the pairwise comparison.
In order to examine the extend to which the properties of the pairwise peaklist comparison, determined for the PMF data set can be generalised to other types of mass spectrometric data, we applied the ANOVA to the MS/MS dataset (Tables 3 and 4, right panel). The main difference between these two datasets is that the computation of the noncrossing matching which influenced the PAUC scores in case of MS/MS data, (Tables 3 and 4, row noncross). However, one can conclude that the same factors and factor interactions are significant if comparing PMF and MS/MS data.
Dissimilarity measures with small variance and high PAUC scores
The Figure 3 shows the boxplot of the sensitivity PAUCs measure (PMF data) itemised according to the factors explaining the largest variance. These included measure and weight of non matching peaks θ (binary measures, Figure 3A) and measure and scaling (intensity based measures, Figure 3B).
Figure 3. A: Boxplot of the sensitivityPAUC (sensitivity given a FPrate ∈ [0, 0.1]) itemised according the factors dissimilarity measure and θ (weighting of nonmatching peaks) for the binary measure based peaklist comparisons. B: Boxplot of the factors scale (cf. Methods – Scaling) and measure of the sensitivityPAUC (sensitivity given a FPrate ∈ [0, 0.1]) for intensity measure based peaklist comparisons. The top panels show a clip (ZOOM) of the bottom boxplot, indicated by the green horizontal line. Xaxis labels: fm – FowlkesMallows statistics, gower – Gower coefficients, hg – Huberts Γ, rmi – relative mutual information, canberra – Canberra distance, simindex – similarity index, manhattan – Manhattan distance, euclidean – Euclidean distance, dotprod – dotproduct measure, cov – covariance, soai – sum of agreeing intensities. Scaling: T – total ion count, N – vector length, S – root mean square, R – ranks.
In case of binary measures (Figure 3A) the largest PAUC scores were measured for the FowlkesMallows statistics (Figure 3, left panel) followed by the Gower coefficient. Other factors had a negligible impact on the measures as the small height of the boxes indicates. In Figure 4A (PMF data) and 4B (MS/MS data) we compared the scores obtained by the asymmetric binary measures (Fowlkes Mallows statistics and Gower coefficient) with those acquired by the symmetric binary measures (Huberts Gamma and Relative Mutual information). The figure revealed that for MS/MS data, the symmetric measures performed better (right panel). The conclusion, which can be drawn from this observation is that for MS/MS data a lack of peaks at given masses was more significant than for PMF data. This is in agreement with the higher peak density of MS/MS peaklists.
Figure 4. Boxplot A: Comparison of the sensitivityPAUCs (computed for FPrate ∈ [0, 0.1]) computed for the assymetric binary measures with sensitivityPAUCs of the symmetric binary measures, in case of the PMF dataset. Boxplot B: Comparison of the sensitivityPAUC (computed for FPrate ∈ [0, 0.1]) computed for the assymetric binary measures with the sensitivityPAUCs of the symmetric binary measures, in case of the MS/MS dataset.
The boxplot (Figure 3B) demonstrates that the PAUC scores computed using the Manhattan and Euclidean distances, exhibited higher overall variance, than the dot product measure and the sum of agreeing intensities. These measures (Manhattan and Euclidean distances) were influenced by the weighting of nonmatching peaks θ and peaklist pair length N (cf. Methods, Equation 3). These two factors did not influence the outcome of the dotproduct measure, which measures only the similarity of matching peaks. However, for a fixed combination of measure and scaling, the Manhattan distance with the total ion count (l_{1}norm) scaling and the Euclidean distance with the vector norm (l_{2}norm) scaling presented an eminently small variance of the PAUC measure. This reduction of the variance occurred because the factor 'peaklist length' did not influence the outcome of the comparison. Notably good choices of intensity scaling in the case of the sum of agreeing intensities (soai) and the dot product (dotprod) measure were either the vector norm or the root mean square scaling. Remarkably, the widely used dot product measure did not achieved the highest PAUC score (top panel Figure 3B).
The analysis presented here reproduced published results, demonstrating that the relative distances (cf. Methods 12) performed worse than the dot product measure [33]. Furthermore, it identified other measures that performed equally or better than the dot product measure.
Intensity transformation and ANOVA
As the %MSQ scores of the ANOVA analysis reveal, the intensity transformation has a smaller impact on the peaklist comparison process if set in relation with the distance measure and scaling. However, the proper choice of the intensity transformation can increase the PAUC scores. The Boxplot (Figure 5) of the sensitivityPAUC (left) and specificityPAUC (right) score, computed using the dotproduct and the sum of agreeing intensities measures (both computed on vector length scaled data), shows how the intensity transformation influenced the classification. As predicted by the analysis of variance stabilisation (cf. Appendix – Peak intensity transformation), the log transformation of intensities performed best for both measures.
Figure 5. A: Boxplot of the specificityPAUC (specificity given a TPrate ∈ [0.9, 1]) for the dotproduct measure (dotprod) and sum of agreeing intensities (soai). B Boxplot of the sensitivityPAUC (sensitivity given a FPrate ∈ [0, 0.1]). N – raw intensities, S – square root transformed intensities, L – log transformed intensities, R – intensity ranks.
Figure 6. Stick spectrum of two peaklists X (black lines) and Y (black dot dashed lines). Upper left corner – accuracy of the mass measurement a. A – ambiguous match of five peaks. B – unambiguous match of two peaks. C – peaks not matching.
Interestingly, in the case of MS/MS data we did not observe any differences in the PAUC score due to intensity transformation (not shown). This was due to the fact that we were able only, using a dataset where spectra IDs where assigned by database searches, to determine whether a pairwise comparison performed different or equal to the database search algorithms. It means, that the MS/MS database searches did not perform better than the pairwise peaklist comparison computed with the worst intensity transformation.
The peaklist length
We examined two ways of defining the length of the matched peaklists, first by setting N = 0 in Equation 3 (cf. Methods), and second to a user defined value N (N = 250 in the case of PMF data and N = 400 in case of MS/MS data). The missing peakpairs were augmented by peaks of zero intensity.
The symmetric binary measures Huberts Γ and relative mutual information significantly interacted with peaklist length (length × measure) as well as with the weight of nonmatching peaks (θ × length) (see high %MSQ values in Table 3, bottom panel). These interactions were not observed for the asymmetric binary measures, what caused the third order interaction measure × θ × length (Table 3, bottom panel: Final model). In case of PMF and MS/MS data, the best combination of factors for both symmetric binary measures was visible when N = 250 or N = 400 respectively, and θ = 0.5.
In case of the intensity based dissimilarity measures Manhattan and Euclidean distance, a strong interaction between the factor 'peaklist length' and 'measure' (Table 4, bottom panel: Final model row length × measure) was observed, except for a case when the L_{1}metric (Manhattan distance) was combined with the total ion count (l = 1norm) scaling and the L_{2 } metric (Euclidean distance) with the l_{2}norm (vector length scaling). All the other intensity based measures were practically not influenced by the choice of peaklist length N (see Figure 3).
Differences between binary and intensity based dissimilarities
The dashdotted line in Figure 3 indicates the maximal sensitivityPAUC determined for the binary based peaklist comparison, while the dashed line shows the maximal sensitivityPAUC computed for the intensity based peaklist comparison. If high sensitivities at a high specificity (sensitivityPAUC) were required, the intensity based peaklist comparison performed better than the binary based peaklist comparison. This is because it is very unlikely that samples from different sources would generate spectra where not only the peak masses, but also the peak intensities were similar. However, if high specificity at high sensitivities was required (PMF data only, not shown), the order reversed and binary measures performed better than intensity based measures. Using binary coding makes it unlikely that peaklists with matching peaks will generate a large distance because of erroneous peak intensity measurement.
Weighting of mass measurement accuracy, computing the noncrossing matching and weighting of nonmatching peaks
The variance, which is explained by the factor 'noncrossing matching' (cf. Methods – Finding the matching peaks) and 'weight of matching peaks' (cf. Methods – Weighting the missing mass measurement accuracy) was practically zero in case of the PMF data. In case of the MS/MS data, the variance explained by the factor 'weight of matching peaks' and 'noncrossing matching' was small, but not zero.
Using the measures, which take into account the nonmatching peaks (e.g. Euclidean distance, Huberts Γ), weighting of mass accuracy may decrease the PAUC obtained by a peaklist comparison. This is because weighting of mass accuracy decreases the weight of matching peakpairs, but does not affect the nonmatching peaks. For example, in the case of the Euclidean distance, which is given by weighting of match accuracy may decrease the weight of the term ab as well as b^{2 }and a^{2}. For nonmatching peaks the term ab equals zero, however, the terms a^{2 }and b^{2 }have a full weight. However, matching peaks have higher discriminating power than nonmatching peaks. Thus, exclusively decreasing the contribution of matching peaks decreases the discriminating power of a pairwise peaklist comparison. In order to compensate for the effect of mass measurement accuracy weighting we have introduced the weighting of nonmatching peaks by parameter θ.
We recommend the usage of both procedures if applying the pairwise peaklist comparison on MS/MS data. Noncrossing matching corrects for errors of the peak extraction procedure. Alternatively, instead of computing the noncrossing matching, the binning of the mass range as described, for example, by Tabb et al. [22] can be used. This however, is limited to data with small mass resolution and a mass range. To decrease the influence of random matches on the dissimilarity, which more frequently occurs in MS/MS peaklists, the weighting of mass measurement accuracy can be utilised.
Conclusion
Analysis of variance, based on the factorial structure presented in Table 2 and the PAUC as dependent variable, was used to determine the sensitivities of the factors of pairwise peaklist comparison. To test whether these results apply to different types of mass spectrometric data we used both PMF and MS/MS datasets. The amount of variance explained by the factors was similar for both datasets, which provides evidence that the obtained results might be of general interest.
Two factors, namely measure and intensity scaling and their interactions had the highest impact on the intensity based pairwise peaklist comparison. The combination of the Euclidean distance with vector norm scaling, the Manhattan distance with total ion count scaling and the sum of agreeing intensities with vector length scaling were the best performing measures. A high performing measure with small variance due to the choice of scaling methods was the dot product measure. A further factor, which can be used to increase the classification performance of the peaklist comparison is the intensity transformation with the log function as a best choice. In case of the MS/MS data we recommend to apply the weighting of mass measurement accuracy and combine it with a decrease of the weight of nonmatching peaks (θ = 0.5), as well as to implement the computation of noncrossing matching.
The most important factors for the comparison of the peaklists using binary measures are the measure, weight of nonmatching peaks (θ) and peaklist length N. Symmetric measures with large peaklist length N and a small weight of nonmatching peaks (θ = 0.5) performed best for MS/MS data, while asymmetric measures were the most useful during a comparison of PMF data.
A further possible direction to enhance measures of pairwise peaklist dissimilarity would be to combine them with methods that model peaklist properties i.e. peptide fragmentation patterns [17].
The pairwise peaklist comparison is a computer model of cluster affiliation, which for a given input of two peaklists and various control variables such as weight of nonmatching peaks generates a single output variable further used to classify the peaklist pair. To conduct the analysis presented here a total number of ≈ 4,000,000,000 pairwise peaklists comparisons was performed (cf. Methods Computation). In order to reduce a number of required computations and explore a wider range of factors and levels, it might be beneficial to apply different methods to design and analyse computer experiments [47].
The recommended pairwise peaklist comparison approaches can be used as predictive functions of within and between cluster associations of mass spectrometric peaklist pairs. However, the best value of the discriminatory variable still needs to be determined. This can be achieved, for example, by the use of ROC curves combined with cross validation analysis, but will require a dataset where the identities of the peaklists are known apriori.
Methods
Datasets and preprocessing
PMFdata
The PMF data employed in this study (4532 PMF MS spectra) were derived from three proteome studies. One set contains 1193 PMF MS spectra from bacterial (Rhodopirellula baltica) samples (unpublished data). These samples were measured on a Bruker Reflex III reflectron MALDITOF MS (Bruker Daltonics, Bremen, Germany). Another set, which contains 1539 PMF spectra from mouse (Mus musclus) brain tissue samples (unpublished data) was measured on a Bruker Ultraflex reflectron MALDITOF MS (Bruker Daltonics, Bremen, Germany), while the final set, which was measured on an Bruker Autoflex reflectron MALDITOF MS (Bruker Daltonics, Bremen, Germany), contains 1800 PMF MS spectra from plant tissue (Arabidopsis thaliana) [48,49]. All PMF MS spectra were derived from tryptic protein digests of individually excised protein spots. For this purpose the whole tissue/cell protein extracts of the former mentioned organisms were separated by twodimensional (2D) gel electrophoresis [48] and visualised with MS compatible Coomassie brilliant blue G250 [49]. The MALDITOF MS analysis was performed using delayed ion extraction and employing the MALDI AnchorChipTM targets (Bruker Daltonics, Bremen, Germany). For positively charged ions in the m/z range of 700 – 4,500 m/z were recorded. Subsequently, the monoisotopic masses of the measured peptides were detected by the SNAP algorithm of the XTOF spectra analysis software (Bruker Daltonics, Bremen, Germany). The sum of the detected monoisotopic masses constitute the raw peaklist (peaklist), which were calibrated to a mass accuracy of 0.05Da (or higher) by the inhouse developed software mscalib [39,50]. Moreover, the mscalib software was used to filter the peaklists for irregular peaks that did not follow the general peptide mass rule [12,51]. Additional background peaks (peaks occurring in more than 8% of the spectra [15]) were removed from the peaklists. The obtained processed peaklists were then used for the protein database searches with the Mascot search software (Version 1.8.1) [52] employing a mass accuracy of ± 0.1Da, setting methionine oxidation as a variable and carbamidomethylation of cysteine residues as fixed modification, and allowing a maximum of 1 missed proteolytic cleavage site. Samples with multiple content/identification were removed from the data. Multiple content of samples was determined by removal of all peaks, matching the highest significant hit in the first search and resubmission of the remaining peaks to a new database search.
MS/MS data
To evaluate the distances for the MS/MS data, 70 clusters (spectra assigned to one ID) were randomly chosen (5 replicates obtained) from a large dataset of identified yeast spectra [53]. The protein extraction, sample preparation, measurement and identification were performed as described by Wagner et al. [54]. The analysed MS/MS spectra were recorded on an ESI Ion Trap mass spectrometer (LCQ DECA, Thermo Electron) with the following instrument settings: spray voltage: 1.5 kVolt; data dependent scanning with one full MS spectrum is followed by four independent MS/MS spectra of the four most intensive ions; minimum signal intensity for a peptide to be selected for fragmentation set to 10^{6 }ion counts. These selected and fragmented ions were then excluded from further fragmentation events for 1 minute to prevent repeated MS/MS spectra of identical peptides. The collision energy for the peptide fragmentation was automatically set by the instrument, which was controlled by the Xcalibur software (Versionl.2, Thermo Electron). The post acquisition processing was performed with the Bioworks browser package (Thermo Electron). The resulting peaklists were automatically stored and assigned to peptide sequences in the yeast protein database [55], by using the Sequest database search algorithm (Version 27) [8,56]. The search parameters employed for the database searches were as follows: a) none or one of the four proteases, as defined by Sickmann et al. [53]; b) mass type: mono isotopic (parent ion and fragment ion) c) amino acid modifications: carbamidomythylated cysteine residues +57Da and oxidation on methionine residues +16Da, while missed cleavage sites (maximum allowed): 1 missed. We considered spectra identified if they had an Xcorr > 2 and an ion coverage of 20%.
Finding the matching peaks
We considered two peaks x and y from different peaklists X, Y to match with an accuracy a if x  y <a (absolute error) or (relative error in part per million (ppm)). Cases where more than one peak in Y match a peak x (Figure 6, case A) were resolved by computing a noncrossing matching of the peaklists. A noncrossing matching a maximum trace [57,58] can be computed in time O(n log n), where n is the number of peak matches. In our case, we resorted to a simple maximum similarity alignment, which could be banded to improve its O(n^{2}) time complexity. The optimal trace of two sorted mass lists of matching peaks was established by dynamic programming. Let qual be a measure of the goodness of the match of two peaks i.e.
qual_{abs }= max{0, a  x  y},
where x and y are masses of peaks from two distinct peaklists. Then our goal was to maximise the overall quality of all matched peaks. We recorded in the matrix M_{i, j }the best possible assignment of the first i peaks in the first list and the first j peaks in the second list. Hence M_{i,0 }= M_{j,0 }= 0 for all i, j, and it is easy to see that the following recurrence can be used to derive the overall best assignment:
In this way we could find a noncrossing matching, which minimised the overall errors and unambiguously assigned a peak x to a peak y.
Weighting the missing mass measurement accuracy
For computation of the dissimilarities we used the weighting of the mass measurement accuracy [45,46] and the alignment of peaklist by linear regression [12,59]. To model the accuracy of a given match either we weighted the peak intensities in the matching pairs (intensity based measures) or calculated the weight of the match (binary measure) by a triangular function w:
where a is the maximum displacement evaluated and x and y are peak masses. If the mass difference x  y of two matching peaks increases then the significance of the match is reduced. Before computing the weights, we minimised the overall error of the matching masses by adjusting the two peaklists using linear regression.
Non matching peak pairs
Peaks detected in one sample not occurring in the other one (Figure 6, case C), were included in the computation of the dissimilarities. The second peaklist was augmented with a peak of zero intensity, at the mass of the notmatching peak. In addition, the significance of such a peak pair (and peak intensities) could be weighted with the factor θ. In this study we examined three values of θ, namely 0.5, 1 and 2.
Binary measures
We have also investigated measures that only use qualitative information in the sense that they evaluate the number of matching and mismatching peaks of both peaklists. Essentially these measures are numerical functions in the contingency Table 5 derived from both peaklists. To include the weighting of missing accuracy by the w (see Equation 2) and the weighting of nonmatching peaks by θ we introduced a generalised version of the contingency table. All binary measures introduced below can be computed on the entries of the contingency Table 5.
Table 5. Modified contingency table. M = max{N, i + θ·(
+ ) + } with N defined by the user and c = 1 in case of Hubert's Gamma or c = 0 otherwise.Peaks present in list X, but not in list Y, are denoted by , likewise present in Y, but not in X by . We multiplied the mismatches by θ to assign a variable weight. Therefore, , as well as were replaced by and , respectively. To include the weighting of missing mass accuracy in computing the dissimilarities one can set , with defined by the Equation 2. Our data are asymmetric in the sense that we can only evaluate existing peaks and do not count the absence of peaks in both peaklists at a mass. Measures that utilise only this information are the Gower coefficient and FowlkesMallows statistics. Additionally, we were interested in the performance of measures that take into account the marginal M and hence the entry is required (Hubert's Γ (Appendix Equation 16) or the relative mutual information (Appendix Equation 19)).
Since the peaklists can have different length and the maximal peaklist length is undefined, we defined the entry M length of a matched peaklists pairs as follows:
where N is an arbitrary user defined constant and c = 1 in case of the Huberts Γ and c = 0 otherwise. By this definition, due to the use of the maximum function we avoided the case when becomes less than zero (see equation 4 for definition of ). In this study we used two different values of N. We set N = 0 and the second value equal to twice the length of the longest peaklist in each dataset.
Given all entries of the modified contingency Table (5), the marginals could be computed by equations (4 – 8):
A summary of the measures studied here can be found in the the Appendix section.
Measures based on peak intensities and intensity ranks
Before computing the measures based on peak intensities, we have applied intensity transformations and scaling procedures. The peak intensities were transformed by taking the square root, as suggested by Tabb et al. [22], and the logarithm. Furthermore, we replaced the intensities by their ranks within the peaklist [27]. The scaling procedures used included total ion current count normalisation [34], vector length normalisation, root mean square normalisation and zscore normalisation (a detailed description can be found in the Appendix section).
In our study, we investigated several pairwise similarity measures to compare two peaklists. These measures are either measures of similarity (such as covariance) or measures of distance (such as the l^{p }metrics). In order to make both classes of measures comparable we transformed each similarity measure into an appropriate dissimilarity measure. Moreover, we introduced the factor w_{i }to weight missing mass measurement accuracy (cf. Methods – Weighting the missing mass measurement accuracy) and nonmatching peaks (cf. Methods – Non matching peak pairs).
The dotproduct of two vectors is defined
where I^{X }and I^{Y }are the intensity vectors of two matched peaklists (cf. Methods – Finding the matching peaks) of length N, and is defined by the Equation 2 for matching peaks and equals θ for nonmatching peaks. In case of summeansquare, total ion count and vector length scaling, the product of non matching peakpairs is zero and therefore this measure is independent of θ. If the intensities of the matched peaklists are zscore scaled the outcome will depend on the value of θ. Furthermore, augmenting the peaklists by zero pairs in order to increase their length will increase DP for zscore and rootmeansquare scaled data. The most prominent representative of this family is the spectral angle (the dotproduct of vector length normalised data). It has a geometric interpretation. It is equal to the cosine of the angle enclosed by the two vectors.
Covariance
The covariance is a measure of dependency between random variables I^{x }and I^{y }[61] and is defined as:
where I^{X}, I^{Y}, N, are defined as above.
The best known representative of this family of measures is the Pearson correlation, which is obtained if we compute the covariance of zscore scaled intensity vectors.
Metricbased measures
The Euclidean and Manhattan distances belong to the family of l^{p }metrics and can be expressed using equation:
In case of the Euclidean distance p = 2, and for the Manhattan distance p = 1. The Euclidean distance penalises large intensity differences mores than the Manhattan distance. The outcome of this measure will change due to different sample wise scaling of the intensities. In case of the zscore scaling the outcome will depend on the user defined peaklist length N (Equation 3).
Similarity index and Canberra distance
The Similarity index [33] and Canberra Distance [62] measure the relative distance and can be expressed by the equation:
Setting p = 2 yields the similarity index, while p = 1 results in the Canberra distance. Similarly, as in case of the l^{p }metrics, the similarity index with p = 2 will be more influenced by large intensity differences than the Canberra distance.
If the term x + y in the denominator equals zero due to x = y, with x ≠ 0 ∨ y ≠ 0, infinity +∞ is returned [60].
Sum of agreeing intensities
The sum of agreeing intensities is defined by the equation:
It shares with the similarity index and the Canberra distance the property that each pair of matching peaks will contribute to the final score a proportion in the range of [0,1/n]. The sum of agreeing intensities however, puts more emphasis on the agreement of peak intensities. Peak pairs whose intensity differences are larger than their average intensity receive a weight of zero.
Computation
All scores presented in the results section were computed for 75 clusters. The clusters were sampled from the datasets without replacement. For each cluster we randomly chose 2 – 20 (PMFdata) 2 – 7 (MS/MS) samples. This procedure was repeated five times and the average of the scores was computed. The pairwise peaklist comparison approaches were computed with a mass measurement error of 0.7Da for the MS/MS data, and of 0.2Da for the PMF data. The PAUC areas were computed using inhouse developed R functions. Other R packages provide a huge variety of statistical tools for further analysis of the dissimilarities such as clustering algorithms and validation or multidimensional scaling methods [67].
Appendix
Binary measures
Jaccard/Gower coefficient
The matching peak count is the dotproduct of the two peaklists and counts the number of matching peaks (). Since peaklists have different numbers of nonzero elements, this dot product must be normalised by the total counts. The Jaccard coefficient is a normalised version of the matching peak count, whose distance version is given by:
A generalised version of the Jaccard coefficient in which and is weighted by a constant θ was introduced by Gower et al. [63].
FowlkesMallows statistics
The FowlkesMallows statistics [64] (introduced in the context of clustering validation by use of contingency tables) are the matching peak counts normalised by the geometric mean of the peaklists lengths. The equation of the distancelike version is given by:
Huberts Γ
Using binary signals, we can transform the formula of the correlation coefficient such that it uses the values of the contingency table to obtain:
We observed that the numerator was maximised if all signals were expressed equally. To avoid the fact that the denominator becomes zero (which is the case if or = 0 and occurs if one peaklist is included in the other) we set c = 1 in equation (3).
Relative mutual information
We were additionally interested in the performance of information theoretic concepts. Given the two peaklists, X and Y, the amount of information about peaklist X inherent in peaklist Y (and vice versa) is given by the mutual information (H) [65]:
To be able to use the mutual information as a similarity measure, so it could distinguish positive from negative correlation, we introduced the following scaling term [66]:
Furthermore, we adjusted it for the information inherent in the individual peaklists. The adjustment was done using the entropy of the individual peaklists, which for a peaklist X is given by:
Thus, we defined the relative mutual information:
The relative mutual information is small if both peaklists are similar and high if they differ. Since the inequalities
H(X; Y) ≥ 0 and H(X; Y) ≤ min{H(X), H(Y)}
holds, this measure is bounded to the interval [1,1]. The relative mutual information has been introduced before [36], in the context of clustering gene expression data.
Peak intensity scaling
The purpose of scaling is to allow the comparison of peaklists with different intensity values i.e. due to different scale of the detector used or due to different amount of sample. Since intensities in different peaklists could have different intensity ranges, we used standard scaling procedures to account for this bias.
• Total ion current count normalisation [34,35] is defined as:
where I_{i }is the intensity of the peak i in the peaklist of length N. Here, the intensities are divided by the sum of all intensities, so that after scaling the sum of the intensities in each peaklist equals one (). The total ion count is better known as the l_{1 }– norm since I_{i }> 0 ∀ i
• Vector length normalisation is defined as:
Here, the peak intensities are divided by the l = 2norm of the intensity vector, which causes that the Euclidean length of the vector equals one ().
• Root mean square normalisation is defined as:
Here, the intensities are divided by their rootmeansquare [60].
• zscore normalisation is defined as:
where I_{i}, i, N defined as above, Ī denotes the average intensity of a peaklist and . Here, two scaling steps are performed, centring and subsequent division by the standard deviation. This causes each scaled peaklist to have an average intensity of zero and a standard deviation of one.
The scaling is preferred if intensities and variance in an arbitrary sample are much higher than in the other samples, which will determine the outcome of the peaklist comparisons. Data transformation was applied before the peaklist matching, whereas data scaling was performed for already matched peaklists.
Features of the pairwise peaklist comparison and their properties
The number of matches
While the intensities of individual peaks may considerably vary between the spectra, the m/z values of fragment ions can be measured with at least the accuracy of a single m/z in the majority of mass spectrometers. If the primary fragment ions/peptides in a pair of spectra have the same m/z locations, the spectra are judged to result from the same peptide/protein, regardless of their peak intensities. Table 6 (rows one and four) summarises the properties of both: mass measurement error (MME) and the mass measurement range of the peaklists. It also provides the fivenumber summary and the mean of the peaklists lengths. The observed number of matches for within and between cluster peaklist pairs is shown in Table 6 (rows two, three, five and six). The theoretical probability of i matches if two independent peaklists of known length, mass measurement range and resolution are compared, can be modelled using the hypergeometric distribution [40]. In case of PMF data, for peaklists drawn from between clusters, a higher number of matches than expected for independent peaklists was observed. This difference might be due to incomplete separation of proteins obtained after twodimensional (2D) gel electrophoresis [41] and because the sequence database entries have different database IDs, even if the protein sequences exhibit a high fraction of sequence identity (i.e. protein families).
Table 6. Peptide (PMF) peaklist and peptide fragment ions (MS/MS) peaklist properties. MME – mass measurement error. The rows 1 and 4 provide &fivenumber summary and the mean of the peaklists lengths (number of peaks in peaklist) in the dataset. Rows 2,3 (PMF) and 5,6 (MS/MS) provide the fivenumber summary and the mean of the number of matches observed if comparing within and between cluster peaklists pairs. Min. – minimum, 1st Qu. – first quartile, 3rd Qu. – third quartile, Max. – maximum
For 75 clusters of various size (2 – 20 samples/cluster) sampled five times from the PMF dataset, we have computed the number of matching peaks for all peaklist pairs. The number of matching peaks was in almost all cases higher, if the peaklists compared laid within one cluster (magenta histogram, Figure 7A), than if they occurred between different clusters (green histogram, Figure 7A). For example, 95% of within cluster peaklist pairs had more than 4 matches, but only 1% of between cluster peaklist pairs had more than 4 matches. The cases where the number of matches between peaklist from within one cluster equalled zero, can be explained by the fact that the spectra were measured on nonoverlapping fragments of the same protein.
Figure 7. A – Histogram of the number (bandwith = 1) of matching peaks for peaklists chosen from the same cluster (magenta) and from different clusters (green). B – Histogram of the number (bandwith = 3) of nonmatching peaks, if peaklists were chosen from the same (magenta) or from different clusters (green).
The masses of randomly matching peaks differ, on average, more than the masses of nonrandom matching peaks. Therefore, weighting of mass measurement accuracy using a triangular function (see Equation 2) was implemented. This function reduced the weight of peaks with a small overlap.
Furthermore, in case of the MS/MS peaklists, clusters of peaks separated by a mass smaller than the mass measurement accuracy (which is used for searches of matching peaks) were observed. Therefore, during matching the two peaklists, some ambiguous matches (that is a peak is assigned to more than one peak in the second peaklist) occurred (cf. Methods – Figure 6, case A). In order to generate an unambiguous pairwise assignment of peaks we computed the noncrossing matching using standard dynamic programming techniques (cf. Methods – Finding the matching peaks).
We concluded that the probability of matches between independent peaklists is higher in case of MS/MS than PMF data because of its lower mass measurement accuracy, smaller mass range and larger number of peaks. Hence, the number of matches has a lower discriminating power in case of MS/MS than of PMF data.
The number of nonmatching peaks
To discriminate peaklist pairs as being within or between clusters the number of nonmatching peaks can be used. Figure 7B presents histograms of the number of nonmatching peaks between peaklist pairs (in magenta – the number of peaks that did not match if we compared two peaklists within a cluster; in green – the number of peaks that did not match if we compare peaklists pairs between two clusters). We observed that the probability of encountering a within peaklist pair increased if the number of nonmatching peaks was small.
We have evaluated the performance of the following asymmetric binary measures: Gower coefficient [63] (cf. Appendix Equation 14) and FowlkesMallows statistics (cf. Appendix Equation 15). These measures incorporate the number of matches and mismatches. If the length of the aligned peaklists is defined (see Equation 3), symmetric binary measures e.g. Huberts Γ (Appendix Equation 16) and relative mutual information (Appendix Equation 19) can also be used. Furthermore, we examined whether increasing or decreasing the weight of nonmatching peaks by a factor of two can increase the performance of the pairwise peaklist comparison (cf. Methods – Non matching peak pairs).
Peak intensities
Intensities associated with the masses observed at least twice within a cluster (magenta density, Figure 8A) tend to have higher peak intensities, compared to intensities of peaks whose masses are observed only once within a cluster (grey density, Figure 8A). Furthermore, intensities I_{X }and I_{Y}, of matching peaks in peaklists from within a cluster, were more strongly correlated (, ) (Figure 8B) than those of peaks matching between clusters (, ) (Figure 8C). The correlation was determined for log transformed and root mean square scaled peak intensities. This indicated that the intensity of peaks could be employed for better discrimination of within and between cluster peaklist pairs. We have studied the performance of several intensity based measures: the covariance (cf. Methods Equation 10), the dotproduct (cf. Methods Equation 9), the Manhattan and Euclidean distances (cf. Methods Equation 11), the relative distances Canberra and similarity index (cf. Methods Equation 12), and the sum of agreeing intensities (cf. Methods Equation 13).
Figure 8. Peak Intensities. A) Histogram of intensities: Xaxes – Intensity of log transformed rootmeanssquare scaled peak intensities. Yaxis – Frequency. In grey: Histogram of the peak intensities that do not match a peak in any other peaklists (peaklists) within the same cluster (this mass is observed only once in the cluster). In magenta: Histogram of intensities of peaks that do match a peak within any peaklist within cluster (this mass is observed at least twice in the cluster). B) Altman Bland plot of intensities of the matching peaks for peaklists pairs from within a cluster. C) Altman Bland plot of intensities of matching peaks for peaklists pairs of between clusters.
Peak intensity transformation
If two peaks match within a cluster, the peak intensities are very likely (except random matches) to be estimates of a number of the ions of the same peptide (PMF) or peptide fragment (MS/MS). These estimates might contain errors resulting from random noise, different levels of peptide fragmentation due to variations in collision energy and different signaltonoise ratios due to varying concentrations of sample present [22].
The observed error can depend on the observed intensity. Thus, any statistical model would either directly account for the variances or transform the data so that the variances are approximately equal for all peak intensity levels. To be able to determine the best variance stabilising transformation, one can examine the proportionate reduction in variation R^{2 }[42], obtained by analysis of the model ΔI ~ Ī + Ī^{2}, where ΔI = I_{X } I_{Y }are the residues and Ī = (I_{X }+ Y_{Y})/2 represent the average peak intensity of two matching peaks. This model accounts for a correlation of variance and intensity (ΔI ~ Ī), unlike the naive model ΔI = E(ΔI) [43]. If the variance is stable, the naive model suffices, and the proportionate reduction in variation obtained with the complex model ΔI ~ Ī + Ī^{2 }should be close to zero.
The AltmanBland plots [43] in Figures 8B and 8C, show the residues (ΔI = I_{X } I_{Y}) as a function of the average peak intensity Ī = (I_{X }+ I_{Y})/2, where I_{X }and I_{Y }are the intensities of a matching peak pair (X, Y). The peak intensities are logtransformed and root mean square scaled (cf. Methods – Equation 22). Table 7 shows the adjusted R^{2 }of the model ΔI ~ Ī + Ī^{2 }for various peak intensity transformations. The logtransformation gives the best variance stabilisation.
Table 7. The adjusted R^{2 }of the model ΔI ~ Ī + Ī^{2 }for the raw, squared (Tabb et al. [22]) and log transformated peak intensities. PMF – PMFdata; MS/MS – MS/MS data.
To elucidate to which extent the transformation influences the PAUC score, as compared to other factors, we kept the different transformations as factor levels of the pairwise peaklist comparison process. This wasdone despite the fact that the best transformation was determined by the analysis of the proportionate reduction of variance. In addition to the raw, rootsquared and logtransformed intensities we included the ranking of the intensities [27] among the transformations studied by ANOVA.
Abbreviations
• ANOVA – analysis of variance
• ROC – receiver operating characteristic curve.
• PAUC – partial area under the curve.
• TP – true positive.
• FP – false positive.
• FN – false negative.
• TN – true negative.
Authors' contributions
HL, JG, KR, ML and RH gave initial input to the research.
WEW implemented the dissimilarities, evaluation framework, and designed the figures.
WEW, RH and PM planned and carried out the analysis.
WEW, ML, PM, KR, PG and JG wrote the manuscript.
AS provided the MS/MS dataset.
JG and PG provided the PMF datasets.
All authors contributed to the final version of the manuscript and approved it.
Additional File 1. Column name and factor level definition of the following four tables.
Format: TXT Size: 1KB Download file
Additional File 2. The file contains the sensitivity and specificity PAUC obtained for peaklist comparison approaches based on binary measure and the PMF dataset.
Format: CSV Size: 4KB Download file
Additional File 3. The file contains the sensitivity and specificity PAUC obtained for peaklist comparison approaches based on measures considering peak intensities and the PMF dataset.
Format: CSV Size: 228KB Download file
Additional File 4. The file contains the sensitivity and specificity PAUC obtained for peaklist comparison approaches based on the binary measure based and the MS/MS dataset.
Format: CSV Size: 4KB Download file
Additional File 5. The file contains the sensitivity and specificity PAUC obtained for peaklist comparison approaches based on measures considering peak intensities and the MS/MS dataset.
Format: CSV Size: 229KB Download file
Acknowledgements
We would like to thank Florian Markowetz, Colin Gillespie, Stale Nygard, Andrew Golightly and Malcolm Farrow for proofreading the manuscript. We thank the anonymous referees for helpful comments. Further thanks to the Rhelp mailing list [37] for continuous help while implementing the package msbase. This project was funded by the National Genome Research Network (NGFN) of the German Ministry for Education and Research (BMBF), and the Max Planck Society.
References

Fenyo D: Identifying the proteome: software tools.
Current Opinion in Biotechnology 2000, 11:391395. PubMed Abstract  Publisher Full Text

Griffin TJ, Aebersold R: Advances in proteome analysis by mass spectrometry.
J Biol Chem 2001, 276:45497500. PubMed Abstract  Publisher Full Text

Patterson SD: Data analysis – the Achilles heel of proteomics.
Nat Biotechnol 2003, 21(3):2212. PubMed Abstract  Publisher Full Text

Aebersold R, Mann M: Mass spectrometrybased proteomics.
Nature 2003, 422(6928):198207. PubMed Abstract  Publisher Full Text

Mann M, Hojrup P, Roepstorff P: Use of mass spectrometric molecular weight information to identify proteins in sequence databases.
Biol Mass Spectrom 1993, 22(6):338345. PubMed Abstract

Johnson R, Martin S, Biemann K, Stults J, Watson J: Novel Fragmentation Process of Peptides by CollisionInduced Decomposition in a Tandem Mass Spectrometer: Differentiation of Leucine and Isoleucine.
Analytical Chemistry 1987, 59(21):26212625. PubMed Abstract

Smith RD, Loo JA, Edmonds CG, Barinaga CJ, Udseth HR: New developments in biochemical mass spectrometry: electrospray ionization.
Anal Chem 1990, 62(9):88299. PubMed Abstract

Sadygov RG, Cociorva D, Yates JRr: Largescale database searching using tandem mass spectra: looking up the answer in the back of the book.
Nat Methods 2004, 1(3):195202. PubMed Abstract  Publisher Full Text

Gras R, Muller M, Gasteiger E, Gay S, Binz PA, Bienvenut W, Hoogland C, Sanchez JC, Bairoch A, Hochstrasser DF, Appel RD: Improving protein identification from peptide mass fingerprinting through a parameterized multilevel scoring algorithm and an optimized peak detection.
Electrophoresis 1999, 20(18):35353550.
[(eng)].
PubMed Abstract  Publisher Full Text 
Strittmatter EF, Rodriguez N, Smith RD: High mass measurement accuracy determination for proteomics using multivariate regression fitting: application to electrospray ionization timeofflight mass spectrometry.
Anal Chem 2003, 75(3):4608. PubMed Abstract

Gentzel M, Kocher T, Ponnusamy S, Wilm M: Preprocessing of tandem mass spectrometric data to support automatic protein identification.
Proteomics 2003, 3(8):1597610. PubMed Abstract  Publisher Full Text

Wool A, Smilansky Z: Precalibration of matrixassisted laser desorption/ionizationtime of flight spectra for peptide mass fingerprinting.
Proteomics 2002, 2(10):13651373. PubMed Abstract  Publisher Full Text

Gobom J, Mueller M, Egelhofer V, Theiss D, Lehrach H, Nordhoff E: A calibration method that simplifies and improves accurate determination of peptide molecular masses by MALDITOF MS.
Anal Chem 2002, 74(15):39153923.
[(eng)].
PubMed Abstract 
Levander F, Rognvaldsson T, Samuelsson J, James P: Automated methods for improved protein identification by peptide mass fingerprinting.
Proteomics 2004, 4(9):2594601. PubMed Abstract  Publisher Full Text

Chamrad DC, Koerting G, Gobom J, Thiele H, Klose J, Meyer HE, Blueggel M: Interpretation of mass spectrometry data for highthroughput proteomics.
Anal Bioanal Chem 2003, 376(7):101422. PubMed Abstract  Publisher Full Text

Moore RE, Young MK, Lee TD: Method for screening peptide fragment ion mass spectra prior to database searching.
J Am Soc Mass Spectrom 2000, 11(5):4226. PubMed Abstract  Publisher Full Text

Sun W, Li F, Wang J, Zheng D, Gao Y: AMASS: Software for Automatically Validating the Quality of MS/MS Spectrum from SEQUEST Results.
Mol Cell Proteomics 2004, 3(12):11949. PubMed Abstract  Publisher Full Text

Anderson DC, Li W, Payan DG, Noble WS: A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores.
J Proteome Res 2003, 2(2):13746. PubMed Abstract

Keller A, Nesvizhskii AI, Kolker E, Aebersold R: Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search.
Anal Chem 2002, 74(20):538392. PubMed Abstract

Yates JR, Morgan SF, Gatlin CL, Griffin PR, Eng JK: Method To Compare CollisionInduced Dissociation Spectra of Peptides: Potential for Library Searching and Subtractive Analysis.
Anal Chem 1998, 70:35573565. PubMed Abstract

Beer I, Barnea E, Ziv T, Admon A: Improving largescale proteomics by clustering of mass spectrometry data.
Proteomics 2004, 4(4):95060. PubMed Abstract  Publisher Full Text

Tabb DL, MacCoss MJ, Wu CC, Anderson SD, Yates JRr: Similarity among tandem mass spectra from proteomic experiments: detection, significance, and utility.
Anal Chem 2003, 75(10):24707. PubMed Abstract

Pevzner PA, Dancik V, Tang CL: MutationTolerant Protein Identification by Mass Spectrometry.
Journal of Computational Biology 2000, 7(6):777787. PubMed Abstract  Publisher Full Text

Mann M, Wilm M: Errortolerant identification of peptides in sequence databases by peptide sequence tags.
Anal Chem 1994, 66:43904399. PubMed Abstract

Tabb DL, Saraf A, Yates JRr: GutenTag: highthroughput sequence tagging via an empirically derived fragmentation model.
Anal Chem 2003, 75(23):641521. PubMed Abstract  Publisher Full Text

Clauser KR, Baker P, Burlingame AL: Role of accurate mass measurement (+/ 10 ppm) in protein identification strategies employing MS or MS/MS and database searching.
Anal Chem 1999, 71(14):287182. PubMed Abstract

Svetnik V, Liaw AI: Detecting Novel Samples in Mass Spectral Data: A Clustering Approach.
In Proceedings of the 33rd Symposium on the Interface Edited by Wegman E, Braverman A, Goodman A, Smyth P. 2001, 321328.

An Z, Harris G, Zink D, Giacobbe R, Lu P, Sangari R, Bills G, Svetnik V, Gunter B, Liaw A, Masurekar P, Liesch J, Gould S, Strohl W: Expression of CosmidSize DNA of SlowGrowing Fungi in Aspergillus Nidulans for Secondary Metabolite Screening. In Handbook of Industrial Mycology. Edited by An Z. New York: Marcel Dekker; 2003:167187.

Li J, Zhang Z, Rosenzweig J, Yang Y, Chan D: Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer.
Clinical Chemistry 2002, 48(8):12961304. PubMed Abstract  Publisher Full Text

Jeffries N: Performance of a genetic algorithm for mass spectrometry proteomics. [http://www.biomedcentral.com/14712105/5/180] webcite
BMC Bioinformatics 2004, 5:180. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Adam B, Qu Y, Davis J, Ward M, Clements M, Cazares L, Semmes O, Schellhammer P, Yasui Y, Feng Z, Wright G: Serum protein fingerprinting coupled with a patternmatching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men.
Cancer Research 2002, 62:36093614. PubMed Abstract  Publisher Full Text

Petricoin E, Ardekani A, Hitt B, Levine P, Fusaro V, Steinberg S, Mills G, Simone C, Fishman D, Kohn E, Liotta L: Use of proteomic patterns in serum to identify ovarian cancer.
Lancet 2002, 359:572577. PubMed Abstract  Publisher Full Text

Wan KX, Vidavsky I, Gross ML: Comparing similar spectra: from similarity index to spectral contrast angle.
J Am Soc Mass Spectrom 2002, 13:8588. PubMed Abstract

Alfassi ZB: On the normalization of a mass spectrum for comparison of two spectra.
J Am Soc Mass Spectrom 2004, 15(3):385387. PubMed Abstract  Publisher Full Text

Rasmussen GT, Isenhour TL: The Evaluation of Mass Spectral Search Algorithms.
Journal of Chemical Information and Computer Sciences 1979, 19(3):179186.

Herwig R, Poustka AJ, Müller C, Lehrach H, O'Brien J: Largescale Clustering of cDNA Fingerprinting Data.
Genome Res 1999, 9:10931105. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

R Development Core Team: R: A language and environment for statistical computing. [http://www.rproject.org] webcite
R Foundation for Statistical Computing, Vienna, Austria; 2004.
[ISBN 3900051003].

Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, lacus S, Irizarry R, Li FLC, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics. [http://genomebiology.com/2004/5/10/R80] webcite
Genome Biology 2004, 5:R80. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bioconductor – open source software for bioinformatics [http://www.bioconductor.org] webcite

Schmidt F, Schmid M, Jungblut PR, Mattow J, Facius A, Pleissner KP: Iterative data analysis is the key for exhaustive analysis of peptide mass fingerprints from proteins separated by twodimensional electrophoresis.
J Am Soc Mass Spectrom 2003, 14(9):94356. PubMed Abstract  Publisher Full Text

Fox J: Applied Regression Analysis, Linear Models, and Related Methods. Sage Publications; 1997.

Bland JM, Altman DG: Measuring agreement in method comparison studies.
Stat Methods Med Res 1999, 8(2):13560. PubMed Abstract  Publisher Full Text

Zhou XH, McClish DK, Obuchowski NA: Statistical Methods in Diagnostic Medicine. Wiley; 2002.

Zhang N, Aebersold R, Schwikowski B: ProbID: A probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data.
Proteomics 2002, 2:14061412. PubMed Abstract  Publisher Full Text

Hageman JA, Wehrens R, Gelder RD, Buydens LMC: Powder Pattern Indexing Using the Weighted Crosscorrelation and Genetic Algorithms.
Journal of Computational Chemistry 2003, 24(9):10431051. PubMed Abstract  Publisher Full Text

Santner T, Williams B, Notz W: The Design and Analysis of Computer Experiments. Springer Series in Statistics, Springer Verlag New York; 2003.

Klose J, Kobalz U: Twodimensional electrophoresis of proteins: an updated protocol and implications for a functional analysis of the genome.
Electrophoresis 1995, 16(6):103459. PubMed Abstract

Giavalisco P, Nordhoff E, Kreitler T, Kloppel KD, Lehrach H, Klose J, Gobom J: Proteome analysis of Arabidopsis thaliana by twodimensional gel electrophoresis and matrixassisted laser desorption/ionisationtime of flight mass spectrometry.
Proteomics 2005, 5(7):190213. PubMed Abstract  Publisher Full Text

Wolski WE, Lalowski M, Jungblut P, Reinert K: Calibration of mass spectrometric peptide mass fingerprint data without specific external or internal calibrants.

Gay S, Binz PA, Hochstrasser DF, Appel RD: Modeling peptide mass fingerprinting data using the atomic composition of peptides.
Electrophoresis 1999, 20(18):35273534. PubMed Abstract  Publisher Full Text

Perkins DN, Pappin DJ, Creasy DM, Cottrell JS: Probabilitybased protein identification by searching sequence databases using mass spectrometry data.
Electrophoresis 1999, 20(18):35513567. PubMed Abstract  Publisher Full Text

Sickmann A, Reinders J, Wagner Y, Joppich C, Zahedi R, Meyer HE, Schonfisch B, Perschil I, Chacinska A, Guiard B, Rehling P, Pfanner N, Meisinger C: The proteome of Saccharomyces cerevisiae mitochondria.
Proc Natl Acad Sci USA 2003, 100(23):1320712. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wagner Y, Sickmann A, Meyer HE, Daum G: Multidimensional nanoHPLC for analysis of protein complexes.
J Am Soc Mass Spectrom 2003, 14(9):100311. PubMed Abstract  Publisher Full Text

IsselTarver L, Christie KR, Dolinski K, Andrada R, Balakrishnan R, Ball CA, Binkley G, Dong S, Dwight SS, Fisk DG, Harris M, Schroeder M, Sethuraman A, Tse K, Weng S, Botstein D, Cherry JM: Saccharomyces Genome Database.
Methods Enzymol 2002, 350:32946. PubMed Abstract

Yates JR, Eng JK, McCormack AL, Schieltz D: Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database.
Anal Chem 1995, 67(8):142636. PubMed Abstract

Kececioglu J: The maximum weight trace problem in multiple sequence alignment. In Proceedings of the 4th Symposium on Combinatorial Pattern Matching (CPM 93). no. 684 in Lecture Notes in Computer Science, Springer; 1993:106119.

Kececioglu JD, Lenhof HP, Mehlhorn K, Mutzel P, Reinert K, Vingron M: A Polyhedral Approach to Sequence Alignment Problems.

Egelhofer V, Gobom J, Seitz H, Giavalisco P, Lehrach H, Nordhoff E: Protein identification by MALDITOFMS peptide mapping: A new strategy.
Analytical Chemistry 2002, 74(8):17601771. PubMed Abstract

Becker RA, Chambers JM, Wilks AR: The New S Language. Wadsworth & Brooks/Cole; 1988.

Härdle W, Simar L: Applied Multivariate Statistical Analysis. Springer, Heidelberg; 2003.

Lance GN, Williams WT: MixedData Classificatory Programs I – Agglomerative Systems.

Gower JC, Legendre P: Metric and Euclidean Properties of Dissimilarity Coefficients.

Fowlkes EB, Mallows CL: A method for comparing two hierarchical clusterings.

Cover TM, Thomas J: Elements of Information Theory. New York: J. Wiley and Sons; 1991.

Herwig R, Ed: Largescale information theoretic clustering with application to the analysis of genetic fingerprinting data (PhD). Berlin: Logos; 2001.

Comprehensive R Archive Network [http://cran.rproject.org] webcite