Laboratory for Bioinformatics and Computational Biology, Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, PR China
Abstract
Background
In mass spectrometry (MS) based proteomic data analysis, peak detection is an essential step for subsequent analysis. Recently, there has been significant progress in the development of various peak detection algorithms. However, neither a comprehensive survey nor an experimental comparison of these algorithms is yet available. The main objective of this paper is to provide such a survey and to compare the performance of single spectrum based peak detection methods.
Results
In general, we can decompose a peak detection procedure into three consequent parts: smoothing, baseline correction and peak finding. We first categorize existing peak detection algorithms according to the techniques used in different phases. Such a categorization reveals the differences and similarities among existing peak detection algorithms. Then, we choose five typical peak detection algorithms to conduct a comprehensive experimental study using both simulation data and real MALDI MS data.
Conclusion
The results of comparison show that the continuous waveletbased algorithm provides the best average performance.
Background
Proteome research requires the analysis of largevolume protein data in a highthroughput manner. Mass Spectrometry (MS) is a common analytical tool in proteome research. It can be used as a technique to measure masses of proteins/peptides in complex mixtures obtained from biological samples. This provides tremendous potential to study disease proteome and to identify drug targets directly at the protein/peptide level
In a typical proteomic experiment, a huge volume (e.g. 1 GB) of MS data is often generated. Each of MS spectra consists of two large vectors corresponding to mass to charge ratio (
1. What's the working mechanism of an algorithm?
2. What are the differences and common points among different algorithms?
3. What is their performance in MS data analysis?
To address the above questions, we study the peak detection process using a common framework: smoothing, baseline correction and peak finding. Such a decomposition enables us to better elucidate the fundamental principles underlying different peak detection algorithms. More importantly, it helps us to clearly identify the differences and similarities among existing peak detection algorithms.
We describe each part in the peak detection process with particular emphasis on their technical details, hoping that this can help readers implement their own peak detection algorithms.
During evaluation, we choose five typical peak detection algorithms to conduct a comparative experimental study. In the experiments, we use both simulation data and real MALDI MS data for performance comparison. The results show that the continuous waveletbased algorithm provides the best average performance.
The remainder of this paper is organized as follows: section 2 provides details on existing peak detection algorithms and highlights their differences and similarities; section 3 conducts a performance comparison on some typical peak detection algorithms using simulation data and real MALDI MS data; section 4 concludes the paper.
Methods
Peak Detection Process
Usually, peptide signals appear as local maxima (i.e., peaks) in MS spectra. However, detecting these signals still remains challenging due to the following reasons:
(1) Some peptides with low abundance may be buried by noise, causing high false positive rate of peak detection.
(2) The chemical, ionization, and electronic noise often result in a decreasing curve in the background of MALDI/SELDI MS data, which is referred to as baseline
To facilitate peak detection, we often use the framework shown in Figure
Peak detection framework
Peak detection framework. The input mass spectrum is transformed into a list of peaks.
An example of the peak detection process
An example of the peak detection process. (a) A raw spectrum, (b) the spectrum after smoothing, (c) the spectrum after smoothing and baseline correction and (d) final peak detection results with peaks marked as circles.
Categorization
Existing peak detection algorithms can be categorized according to the methods used in each step of peak detection process. Table
Open source software packages for MS data analysis
Program
S
B
P
Web link
Cromwell
S7
B1
P1, P4
LCMS2D

B5
P1, P2
LIMPIC
S4
B2
P1, P3
LMS
S3
B2
P1, P4
MapQuant
S1,S2,S3

P7
CWT
S5
B4
P1, P6
msInspect
S6
B2
P5
mzMine
S1, S2

P1, P2, P8
OpenMS
S5
B4
P7
PROcess
S1
B2, B3
P1, P2, P5
PreMS
S7
B1
P1, P4
XCMS
S3

P1, P4
Here "S" denotes smoothing filter, "B" denotes baseline correction method, "P" denotes peak finding criterion and "" means smoothing or baseline correction method is not used. Cromwell, LIMPIC, LMS, CWT, and PROcess are designed for single spectrum peak detection. LCMS2D, MapQuant, msInspect, mzMine, OpenMS and XCMS are designed for LCMS (Liquid Chromatography Mass Spectrometry) data analysis. PreMS is a GUI (Graphic User Interface) package based on Cromwell.
(1) The algorithms in Table
• The software is mainly designed for MS data preprocessing.
• The software is open source.
• The software is described in a publication.
(2) In Table
• Smoothing
S1: Moving average filter
S2: SavitzkyGolay filter
S3: Gaussian filter
S4: Kaiser window
S5: Continuous Wavelet Transform
S6: Discrete Wavelet Transform
S7: Undecimated Discrete Wavelet Transform
• Baseline Correction
B1: Monotone minimum
B2: Linear interpolation
B3: Loess
B4: Continuous Wavelet Transform
B5: Moving average of minima
• Peak Finding Criterion
P1: SNR
P2: Detection/Intensity threshold
P3: Slopes of peaks
P4: Local maximum
P5: Shape ratio
P6: Ridge lines
P7: Modelbased criterion
P8: Peak width
Smoothing Filters
These methods usually apply traditional signal processing techniques such as moving average filter, SavitzkyGolay filter and Gaussian filter. For an input spectrum, we represent it as [
S1: Moving average filter
The output of the moving average filter
where
S2: SavitzkyGolay filter:
The SaviztkyGolay filtering can be considered as a generalized moving average filter. It performs a least squares fit of a small set of consecutive data points to a polynomial and takes the central point of the fitted polynomial curve as output.
The smoothed data point
where
Smoothing filters
Smoothing filters. In (a), "PO" stands for polynomial order of polynomial fitting in SavitzkyGolay filter. In (b),
S3: Gaussian filter
After a signal
where
Some researchers use the secondderivative of Gaussian to perform smoothing. Their argument is that the secondderivative of Gaussian can implicitly remove background when smoothing signals
S4: Kaiser window
After a signal passing a Kaiser window:
where
S5, S6, S7: Wavelet based filters
Wavelet can be grouped as continuous wavelet transform and discrete wavelet transform. The continuous wavelet transform can be written as
where
Then
The process of computing DWT
The process of computing DWT. Here "↓ 2" means down sampling by 2,
(1) Signal is decomposed simultaneously by a lowpass filter
(2) The output of
(3) The output of
The advantage of discrete wavelet transform over continuous wavelet transform is its efficiency because it only computes on the scales and positions based on the power of two, while the redundancy of continuous wavelet transform makes the interpretation of MS peak detection easier
Discrete wavelet transform is shiftvariant. To achieve shift invariance, undecimated discrete wavelet transform has been proposed
Baseline Correction
Baseline correction is typically a twostep process: (1) estimating the baseline and (2) subtracting the baseline from the signal. In the following, we list details of some commonly used baseline correction methods. Since baseline substraction is straightforward, we mainly focus on the baseline estimation procedure in different methods.
B1: Monotone minimum
This method includes two steps to estimate baseline. The first step is to compute the difference, which can be used to determine the slope of each point. Then, this method starts from the leftmost point
• If the slope of a local point
• If the slope of a local point
• Let
B2: Linear interpolation
Linear interpolation takes two steps to estimate baseline:
• Divide the raw spectrum into small segments and use the mean, the minimum or the median of the points in each segment as the baseline point.
• Generate a baseline for the raw spectrum by linearly interpolating baseline points across all small segments.
B3: Loess
First, it divides the raw spectrum into small segments. Then, in each small segment, it computes the quantile. After that, it estimates a predictor in every small segment for baseline estimation. The predictor in each small segment is obtained using the following rules:
• If the intensity of a point
• If the intensity of a point is larger than or equal to the quantile in the segment, then the intensity of corresponding point on predictor equals the quantile.
Baseline is obtained by applying local polynomial regression fitting to the predictor.
B4: Continuous Wavelet Transform
In local regions, baselines are monotonic. Baseline can be modeled as the following function:
where
where
zero. If we use a symmetric wavelet function (like Mexican Hat wavelet), the first item in Equation (8) is also zero. Thus, continuous wavelet transform removes baseline automatically.
B5: Moving average of minima
This method uses two steps to estimate baseline:
• Estimate a rough baseline by finding local minimum within a two Da window for each point.
• Use a moving window to smooth the rough baseline obtained in the first step.
Peak Finding Criteria
There are many peak detection methods. Most methods detect peaks after smoothing and baseline correction. However, it should be noted that there is a special case, CWT does not have explicit smoothing and baseline correction steps. Du
P1: SNR
SNR stands for signal to noise ratio. Different methods define noise differently. Below are two examples:
• Noise is estimated as 95percentage quantile of absolute continuous wavelet transform (CWT) coefficients of scale one within a local window
• Noise is estimated as the median of the absolute deviation (MAD) of points within a window
P2: Detection/Intensity threshold
This threshold is used to filter out small peaks in flat regions. In these regions, the median of the absolute deviation (MAD) is quite small, which may result in big SNR. Using SNR alone may identify many noisy points as peaks.
P3: Slopes of peaks
This criterion uses the shape of peaks to remove false peak candidates. In order to compute the left slope and the right slope of a peak, both the left end point and the right end point of the peak need to be identified. Peak candidate is discarded if both left slope and right slope are less than a threshold. The threshold is defined as half of the local noise level
P4: Local maximum
A peak is a local maximum of
P5: Shape ratio
Peak area is computed as the area under the curve within a small distance of a peak candidate. Shape ratio is computed as the peak area divided by the maximum of all peak areas. The shape ratio of a peak must be larger than a threshold.
P6: Ridge lines
Ridge lines are obtained in the following steps:
• Carry out continuous wavelet transform on raw spectrum. This step produces 2D coefficient matrix with size of
• Connect nearest local maximal coefficients of adjacent scales to obtain ridge lines. The distance between two adjacent points on a ridge line should be smaller than a window size.
• Use a variable
Ridge lines are used in the following ways:
• False peaks are removed if the length of their ridge lines are smaller than a given threshold supplied by users.
• The width of a peak is proportional to the scale corresponding to the maximum amplitude on the ridge line
P7: Modelbased criterion
The application of this criterion consists of three steps:
• Locate the endpoints of both sides for each peak. The left endpoint and right endpoint of a peak define its peak area.
• Estimate the centroid for each peak. For
• Use a model function to fit peaks.
Different methods choose different model functions to fit peaks. OpenMS
P8: Peak width
The two end points of a peak define its peak area. The intensities of all points within the peak area should be larger than a given noise level. A simple way to locate a peak area is to start from a point with intensity above a given noise level and move to the right until we run into a point with intensity below the noise level.
After peak end points have been identified, peak width is computed as the mass difference of right end point and left end point. The peak width should be within a given range.
Results and discussion
Data Description and Algorithm Selection
In comparison, we use one group of simulation data and one group of real MALDI MS data. The low resolution simulation data is downloaded from the website of M. D. Anderson Cancer Center
Data and results. This file lists the data used in this paper and the results for the experiments.
Click here for file
Software programs for LCMS data analysis consider additional information along the LCaxis during peak detection. In order to obtain a fair comparison, here we only focus on single spectrum based peak detection algorithms. According to this criterion, only five algorithms in Table
Evaluation Criteria
In simulation data, the list of groundtruth peaks is the input before data generation. In real data, the trypsindigested theoretical peaks (without adding isotope masses) are used as the groundtruth peaks. In both cases, a detected peak is labeled as a false peak if its mass is not within the ± 1% error range of the expected
It is difficult for two algorithms to produce the same false discovery rate. Here we divide false discovery rate into small segments. Such segments have clear interpretations. For example, the FDR [0,0.1] range reveals the algorithm's ability to recognize the most abundant (based on SNR) peaks in the spectrum. Every time when we obtain peak lists, both false discovery rates and sensitivity are computed. We group the sensitivity together if the corresponding false discovery rates fall into the same small segment. Then average values of sensitivity in the same group are computed. The average value of sensitivity is used to evaluate the performance of one algorithm in that area.
As ground truth is known for both simulation data and real data in this paper, the ROC curve is probably the most informative measure for evaluation of different peak detection methods. However, the false discovery rates of waveletbased methods are limited to a relatively small range across all possible parameter settings. On one hand, this reflects the robustness of waveletbased methods. On the other hand, the plot of ROC curve becomes difficult in waveletbased methods. Here we use the following alternative method to conduct performance comparison: we select four regions of false discovery rate:[0, 0.1), [0.2, 0.3), [0.4, 0.5), [0.6, 0.7) and compare sensitivity of different algorithms in these regions using boxplot. Such strategy is capable of providing an overall performance evaluation since it is roughly a "discrete" ROC curve in four regions. Moreover, the boxplots illustrate the performance variances of different algorithms.
Different programs have different parameters to adjust when performing peak detection. Since it is very time consuming to optimize each algorithm using all potential combinations of different parameters, we mainly test combinations of parameters that are related to peak finding and use default values for other parameters. Please refer to additional file
Comparison of Algorithms Using Simulation Data
The simulation data is generated using a model that incorporates some characteristics of real MALDITOF mass spectrometers: The simulation engine takes a peak list with both
This data set has 25 groups of data and each group has 100 spectra. Each spectrum has a true peak list provided by data set. We directly use these peak lists as ground truth in our experiment. We use different parameter settings to perform peak detection repeatedly on 100 spectra in the same group, and then compute the average value of sensitivity with corresponding false discovery rates locating in the same small region. For each algorithm, we obtain 25 average values of sensitivity in each small region.
Figure
Performance of different algorithms at different false discovery rates using simulation data
Performance of different algorithms at different false discovery rates using simulation data. In this figure, (a), (b), (c) and (d) show the average sensitivity when false discovery rate is around 0.05, 0.25, 0.45 and 0.65, respectively.
Comparison of Algorithms Using Aurum Data
Aurum Dataset is a high resolution data set, which contains spectra from 246 known, individually purified and trypsindigested protein samples with an ABI 4700 MALDI TOF/TOF mass spectrometer. In the experiments, we do not use MS/MS data and limit our analysis only to MS spectra. For each MS spectrum, we generate the ground truth peaks in silico using the following parameters: trypsin digestion with a maximum of one missed cleavage, monoisotopic peaks and single charge state. We also consider some typical PTMs (PostTranslational Modifications): carboxyamidomethyl cysteine as the fixed modification and oxidation of methionine as the variable modification. Note that peptides having missed cleavages and PTMs are also used to generate groundtruth peaks. After obtaining the theoretic peak list, we merge identical peaks into one peak and delete peaks whose
Figure
Performance of different algorithms at different false discovery rates using Aurum data
Performance of different algorithms at different false discovery rates using Aurum data. In this figure, (a), (b), (c) and (d) show the average sensitivity when false discovery rate is around 0.05, 0.25, 0.45 and 0.65, respectively.
Report of Running Time
For highthroughput data analysis, high efficiency is always desirable. In Table
Average processing time per spectrum using different programs
Program
Platform
Time for Simulation data (Second)
Time for Real Data (Second)
Cromwell
Matlab
0.21
1.71
LMS
Matlab
0.50
3.23
LIMPIC
Matlab
1.74
1.59
CWT
R
3.31
11.00
PROcess
R
4.56
33.21
Parameter Tuning
When the false discovery rate is 5%, half of true peaks are not detected; when 90% of true peaks are detected, many other identified peaks are noise peaks. We use the
The larger
Parameter setting. This file gives parameters settings in experiments for each program compared in this work.
Click here for file
Conclusion
In this paper, we provide a comprehensive survey of existing peak detection methods. In addition, we compare performance of five single spectrum based peak detection algorithms. Results show that CWT provides the best performance.
The reasons that CWT provides the best performance are twofold:
(1) CWT optimally characterizes the shape of peaks in mass spectra. In a real spectrum, peak width varies a lot
(2) True peptiderelated peaks are more consistent at multiple scales than false positive peaks that are mainly caused by high frequency noise. The concept of forming ridge lines in CWT effiectively removes false positive peaks.
Algorithms studied in this paper mainly focus on how to identify peak positions correctly. They ignore how to compute peak abundance, which is very important in some applications (e.g. protein quantification). In our future work, we plan to study the issue of peak detection in LCMS data. It will be interesting to see if additional information along the LCaxis may help to improve peak detection results.
Authors' contributions
CY performed the implementations and drafted the manuscript. ZH participated in the categorization of related work. WY conceived the study and finalized the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We are grateful to the anonymous reviewers for their valuable comments and suggestions, which greatly helped us improve the manuscript. This work was supported with the GRF Grant 621707 from the Hong Kong Research Grant Council, a research proposal competition award RPC07/08.EG25 and a postdoctoral fellowship award from the Hong Kong University of Science and Technology.