Abstract
Background
Liquid chromatographymass spectrometry (LCMS) is one of the major techniques for the quantification of metabolites in complex biological samples. Peak modeling is one of the key components in LCMS data preprocessing.
Results
To quantify asymmetric peaks with high noise level, we developed an estimation procedure using the biGaussian function. In addition, to accurately quantify partially overlapping peaks, we developed a deconvolution method using the biGaussian mixture model combined with statistical model selection.
Conclusions
Using extensive simulations and real data, we demonstrated the advantage of the biGaussian mixture model over the Gaussian mixture model and the method of kernel smoothing combined with signal summation in peak quantification and deconvolution. The method is implemented in the R package apLCMS: http://www.sph.emory.edu/apLCMS/ webcite.
Background
Liquid chromatographymass spectrometry (LCMS) is one of the major techniques in metabolomics [14], as well as a key component in MSbased proteomics [5,6]. The preprocessing of LCMS data involves a complex workflow including noise reduction, peak identification and quantification, retention time correction, peak alignment and weak signal recovery [7,8]. We have previously reported the apLCMS package which carries out the entire workflow with new algorithms specifically designed for LCMS data with high mass resolution [9]. Highresolution mass spectrometry, such as Fourier transform mass spectrometry (FTMS), allows the separation of m/z values at or below 10 ppm level [10], resulting in good separation between metabolites. The high resolution facilitates the use of empirical peak shape models to accurately quantify peaks, which is critical in biomarker studies where the relative quantities of metabolites are compared across samples.
Currently, LCMS peaks are quantified either by summation of ion count, or using symmetric peak shape models, such as the Gaussian function [79]. Both methods have serious drawbacks. The method of ion count summation results in biased quantification when the ion trace has missing intensities, which often occurs in highresolution LCFTMS data. The Gaussian peak model can result in bias in peak location estimation and peak quantification when the peaks are asymmetric. Hence asymmetric peak models are necessary for the accurate quantification and identification of metabolites. In addition, some metabolites may share m/z and partially overlap in retention time, which necessitates the development of deconvolution procedures.
A large number of empirical peak shape models have been developed for asymmetric peaks in chromatography, most of which were summarized by Di Marco and Bombi [11]. For a few of the models, advanced deconvolution procedures are available [1217]. Examples include the nonlinear deconvolution based on Powell's method [18] for the polynomialmodified Gaussian (PMG) model [16,19], regressionbased methods for the parabolicLorentzian modified Gaussian (PLMG) model [17], and various deconvolution methods for the exponentially modified Gaussian (EMG) model [12,13].
The estimating procedures for asymmetric peak models in chromatographic data generally assume low noise level. In LCMS data, the noise level is magnitudes higher, and the intensity observations are obtained at much fewer time points. Thus a simple, robust model that can be fitted using a limited number of intensity observations is necessary. The biGaussian peak model (Figure 1a) has been described in the context of chromatography [11,20]. Empirical and theoretical results have shown that the biGaussian model is well suited for asymmetric peaks [20,21]. With four parameters and a simple functional form that's amenable to maximum likelihood estimation, the biGaussian model is suitable for LCMS data. A parameter estimation method for the biGaussian model has been developed in the openMS environment [22]. The method relies on the observed maximum intensity for the determination of the peak summit location, which could lead to inaccurate estimates when the signaltonoise ratio is low. Currently no deconvolution method is available for the biGaussian mixture model.
Figure 1. The characteristics of the biGaussian function. (a) the four parameters that define the biGuassian function; (b) The function A(τ)B(τ) used in our estimation. Different σ_{1}/σ_{2 }ratios are plotted.
In this paper, we first develop a new algorithm to fit the biGaussian function to noisy ion traces. Secondly, we develop a deconvolution procedure for partially overlapping peaks using the biGaussian mixture model. Thirdly, the low signaltonoise ratio causes uncertainty in the number of components of the mixture model. We address this issue by a procedure involving statistical model selection. All the algorithms described here have been implemented to improve the apLCMS package for highresolution LCMS data analysis [9].
Methods
The biGaussian peak model
The model involves four parameters  the location of the peak summit α, the standard deviation of the half Gaussian function to the left of the summit σ_{1}, the standard deviation of the half Gaussian function to the right of the summit σ_{2}, and the scaling factor δ (Figure 1a). The intensity as a function of retention time is modeled by:
The areas of the two regions to the left/right of the peak summit are δσ_{1}/2 and δσ_{2}/2, respectively.
The estimation procedure for a single peak
For the estimation of the parameters from the observed data, the most important is to find the peak summit α. When the data is noisy, we cannot rely on the observed high point as the estimate. Rather, information from the entire ion trace must be used to estimate the parameter. We define two quantities as a function of retention time τ. The first one is the logratio of the areas to the left and right of τ:
The second quantity is the logratio of the cuberoot of the noncentered second moments of the left and right truncated portions of the function:
When τ = α, the quantity A(α) is the log ratio between the areas of the two half Guassian functions, which is equal to the log ratio between the two standard deviations; B(α) is the log ratio between the cubic roots of the variances of the two Gaussian functions multiplied by their scaling factors, which is also equal to the log ratio between the two standard deviations. Thus τ = α is a root for A(τ)B(τ) = 0.
Simulations using a reasonable range of σ_{1}/σ_{2 }showed that A(τ)B(τ) is a monotone function (Figure 1b), which indicates the solution is unique.
In LCMS data, the intensity values {x_{1},x_{2}, ..., x_{n}} are collected at discrete time points {t_{1},t_{2}, ..., t_{n}}, which means the function g(t) is approximated by a step function. We first define the step sizes of the function:
We approximate A(τ) by
And B(τ) by
Because the data are generated from discrete time points, we first find for all the middle points between adjacent t's. Then we interpolate between the largest point below zero and the smallest point above zero to find . After finding , estimating σ_{1 }and/σ_{2 }becomes straightforward:
To estimate the scaling factor δ, we first find the fitted values without scaling:
Then the estimate is found by a weighted average of the ratio between the observed intensities and the fitted values without scaling. Because ion counts are highly skewed, the calculation is carried out in log scale, giving higher weights to points closer to the summit of the curve,
Fitting the biGaussian mixture model
In LCMS data from complex samples, e.g. serum or urine, sometimes peaks sharing m/z value may also partially overlap in the retention time dimension. Here we propose an EMlike iterative algorithm to fit partially overlapping asymmetric peaks. The expectationmaximization (EM) algorithm finds maximum likelihood estimates of parameters in the presence of latent variables. It iterates between finding the expectation of the loglikelihood with regard to the latent variables given the current estimate of the parameters, and finding the parameters that maximize the likelihood [23]. In our application, the parameter estimation is not obtained using the maximum likelihood procedure, and an extra step of eliminating components that explain too small a proportion of the data is added to deal with the noise.
(1) Fit a kernel smoother to the data {(t_{i},x_{i})}. Split the data points into groups at the valleys of the smoother. For every group j of the data points, use the smoother peak as the initial estimate of peak summit , and estimate , , and using the procedure in the previous subsection. More discussion about smoother parameter selection is presented in the next subsection.
(2) Iterate until convergence:
(2.1) Find the fitted values at every t_{i }for component j,
(2.2) For every component j, find the proportion of data explained by the component:
Remove component j if Q_{j }is smaller than a threshold.
(2.3) For every time point, we find the expected proportion of the observed intensities that belong to each component j, denoted q_{ij}.
Then for every component j, reestimate from the data {(t_{i},x_{i}q_{ij})}, using the procedure described in the previous subsection.
Choosing the number of components of the mixture by statistical model selection
In the previous subsection, the kernel smoother is employed to obtain an initial estimate of the number of components and the parameters. When the data is noisy, changing the window size of the kernel smoother could result in different numbers of components of the mixture. To find the best model to explain the data, we utilize statistical model selection based on the Bayesian information criterion (BIC) [24]. BIC is one of the most popular criteria for the selection among a set of parametric models with different number of parameters. It penalizes the number of free parameters. The model with lower BIC value is preferred.
First, a reasonable range of the windowsize parameter is determined based on biological/chemical considerations about potential peak width. It can be quite lenient to cover a wide range of potential values. Several window size values spanning the range are selected. Starting from each of the windowsize value, we compute the kernel smoother, and run the EMlike algorithm described in the previous subsection. The corresponding BIC value is computed by:
where N is the total number of time points with observed intensities, and J is the number of biGaussian components in the model. The model with the lowest BIC value is selected. In the setting of LCMS data, this is a heuristic criterion, because the data we observe are not random samples, and the Gaussian error assumption of BIC may not be satisfied. We justify the usage of the criterion by extensive simulations.
Simulations
To assess the performance of the proposed method, extensive simulations were conducted. The biGaussian mixture model with BIC model selection was compared with two other methods  the Gaussian mixture model [9] with BIC model selection, and the peak quantification based on kernel smoother and signal summation.
The data were generated from a 3component biGaussian mixture model, with different levels of peak asymmetry, noise and peak overlap. Given the parameters (Additional file 1: Table S1), the data from each component are generated from the biGaussian functions:
Additional file 1. Supporting Material. The file contains details of the simulation study, additional results of the simulation study, extra figure illustrating the method workflow, and description of the likelihoodbased estimation procedure of the biGaussian model.
Format: DOC Size: 1.4MB Download file
This file can be viewed with: Microsoft Word Viewer
After summing the intensities from the components, multiplicative noise was added to the data. In addition, a portion of the values were turned into zero to mimic the behavior of real highresolution LCMS data:
The parameter ξ is the standard deviation of the noise added at the logscale. Three levels of ξ were used in the simulations (0.2, 0.4, 0.6). At the high noise level of ξ = 0.6, 50% of the intensity values were changed by 1.5 fold or more, and 25% were changed by two fold or more. The parameter θ controls the percentage of values turned into zero using random samples from the binomial distribution. Three levels of θ were used (0, 0.25, 0.5). The value of θ directly corresponds to the proportion of intensities turned into zero. In addition, various levels of peak asymmetry and overlap were considered (Additional file 1: Table S1). In total 864 parameter combinations were tested. At each parameter setting, the simulation was performed 100 times. For detailed information, please refer to Additional file 1.
Results
Simulation results
First, we compared the rate of successfully selecting the correct number of components between the biGaussian mixture model and the Gaussian mixture model (Figure 2). The method of kernel smoother combined with signal summation wasn't compared because no BIC model selection could be performed using this method, which is a shortcoming in itself. In summarizing the results, the level of peak overlap is defined by the ratio r between the lowest point of the valley between two peaks and the lower of the peak summits, before noise is introduced. Because two valleys exist between the three simulated peaks, the larger r value is taken for each simulation setting. For the purpose of plotting, we roughly divide the amount of overlap into four categories: little overlap (r < 0.2), moderate overlap (0.2 ≤ r < 0.5), strong overlap (0.5 ≤ r < 0.75), and severe overlap (r ≥ 0.75). The level of overlapping is colorcoded. The point size corresponds to the three levels of noise added to the data (ξ = 0.2,0.4, 0.6). The fill of the point represents the proportion of missing values (0%, 25% and 50%).
Figure 2. Comparison of the rate of successfully selecting the correct number of components between the biGaussian mixture model and the Gaussian mixture model. Each subplot corresponds to a different degree of asymmetry, as shown in the titles of the subplots (ratios between the right and left standard deviations). Each dot represents a simulated situation. The values were obtained by averaging the results from 100 simulations. The color represents the level of overlaps between the simulated peaks. The size of the dot represents the amount of noise added to the data. The fill of the dot represents the percentage of values missing in the ion trace.
When the peaks were symmetric (Figure 2, upperleft panel), the Gaussian mixture model showed a slight advantage when the overlapping was strong (red and magenta points). When the peaks were asymmetric (Figure 2, upperright and lowerleft panels), the biGaussian mixture model showed a clear advantage. When the peak overlapping was not strong (blue and green points), the success rate of the biGaussian mixture model was mostly higher than 90%, even when the noise level was high. When there was strong peak overlapping and the noise level was high (larger sized red and magenta points), the rate of successfully selecting the correct number of components was reduced for both the biGaussian mixture model and the Gaussian mixture model.
Secondly, we compared the percentage error in peak area quantification between the three methods, when all three methods were able to identify the correct number of components (not necessarily the best BIC value). Compared to the Gaussian mixture model, the biGaussian mixture model yielded much smaller errors when the peaks were asymmetric (Figure 3, upperright and lowerleft panels). Compared to the method of kernel smoother combined with signal summation, the biGaussian mixture model showed a clear advantage when some of the intensity values were missing (filled points) (Figure 4). When the peak overlapping was not strong (blue and green points), the error of the biGaussian mixture model was mostly under 15%. Further comparisons on peak location and peak spread estimation are presented in Additional file 1. The biGaussian mixture model also clearly outperformed the other two methods in those aspects (Additional file 1: Fig. S2~S4).
Figure 3. Comparison of the accuracy in peak size quantification between the biGaussian mixture model and the Gaussian mixture model. Each subplot corresponds to a different degree of asymmetry, as shown in the titles of the subplots (ratios between the right and left standard deviations). Each dot represents a simulated situation. The values were obtained by averaging the results from 100 simulations. The color represents the level of overlaps between the simulated peaks. The size of the dot represents the amount of noise added to the data. The fill of the dot represents the percentage of values missing in the ion trace.
Figure 4. Comparison of the accuracy in peak size quantification between the biGaussian mixture model and the method of kernel smoother combined with signal summation. Each subplot corresponds to a different degree of asymmetry, as shown in the titles of the subplots (ratios between the right and left standard deviations). Each dot represents a simulated situation. The values were obtained by averaging the results from 100 simulations. The color represents the level of overlaps between the simulated peaks. The size of the dot represents the amount of noise added to the data. The fill of the dot represents the percentage of values missing in the ion trace.
Analysis of highresolution LCMS data
We implemented the new algorithms in the apLCMS package for LCMS metabolomics data analysis [9]. When analyzing the example dataset at the apLCMS website, which contains 8 highresolution LCMS profiles, we observed many examples where the peaks were clearly asymmetric. We show two examples in Figure 5, where both peak asymmetry and peak overlapping exist. In both examples, the inability of the Gaussian curve to fit asymmetric peaks left residuals to be fitted by the smaller peaks, which caused the smaller fitted peaks to deviate from the local peak shape (Figure 5, lower panels). Clearly the biGaussian mixture model fitted the data much better (Figure 5, upper panels).
Figure 5. Comparison of the fit of the biGaussian mixture model and the Gaussian mixture model to real asymmetric peaks. (a) The ion trace at m/z = 446.8913. (b) The ion trace at m/z = 301.1409. Colored curves: fitted components; black curve: summation of the signal from all the components. Upperpanel: the biGaussian mixture fit; lowerpanel: the Gaussian mixture fit.
At the global level, in 21.0% of the ion traces, the biGaussian mixture model and the Gaussian mixture model selected different number of components. Among these cases, the biGaussian mixture model fitted the data with smaller number of components 93.7% of the time. In addition, it achieved better BIC scores in 66.2% of the cases. Overall, in 59.4% of all the ion traces, the biGaussian (mixture) model achieved better BIC values compared to the Gaussian (mixture) model. Considering the biGaussian model is penalized more heavily by BIC with the extra parameter, which puts it in disadvantage when the peak is close to symmetric, these results indicate that the biGaussian peak model is indeed better suited for the data.
Discussions
Compared to the Gaussian peak shape model, which has been used in some modelbased data processing pipelines [8,9], the biGaussian model provides extra flexibility to fit asymmetric peaks, while suffering little disadvantage when the true peak shape is symmetric. Compared to the method of kernel smoother combined with signal summation, fitting a biGaussian mixture model disentangles partially overlapping peaks, and copes with the issue of missing intensities in highresolution LCFTMS data much better. The biGaussian model is among many asymmetric peak models in chromatographic peak modeling. A large number of other models could potentially be used for the processing of LCMS data [11]. Advanced deconvolution methods already exist for a few of the models [1217,19]. However, modifications to the existing estimation procedures may be necessary to suit the characteristics of LCMS data, i.e. sparser data points and much higher noise.
In this study, the parameter estimation for a single peak is done by numerically solving an equation that involves the zero and second moments of the truncated distribution functions. An alternative route is to use the maximum likelihood method. We developed a likelihoodbased algorithm (Additional file 1: Section S4) and compared its performance with the momentbased method in simulations. The likelihoodbased algorithm was slower in computation due to its iterative nature, and it didn't achieve better estimation accuracy over the momentbased method. Under the settings of our simulations, five window size values were used for the initiation of the model selection process. With both methods programmed in R, using a single core of a 2.26 GHz Xeon CPU, the median CPU time for solving the threecomponent mixture was 0.15 second for the momentbased method, and 0.33 second for the likelihoodbased method.
Conclusion
In this manuscript, we presented a method to fit the biGaussian curve to noisy LCMS ion traces, as well as an EMlike algorithm paired with BIC model selection for the deconvolution of partially overlapping peaks. Currently, the methods were implemented in the apLCMS package for the preprocessing of highresolution LCMS data. The same modeling procedure can be adapted easily into other pipelines for the quantification of both metabolites and peptides.
Authors' contributions
TY designed the study, developed the methods, conducted data analysis, and drafted the manuscript. HP developed the likelihoodbased estimation procedure, and drafted the corresponding method description (Additional file 1: Section S4.1). Both authors have read and approved the final manuscript.
Acknowledgements
This research was partially supported by NIH grants 1P01ES01673101, 2P30A1050409 and 1UL1RR02500801, and a grant from the University Research Committee of Emory University. The authors thank Dr. Dean Jones, Dr. Youngja Park, and Ms. Jennifer Johnson for helpful discussions.
References

Issaq HJ, Van QN, Waybright TJ, Muschik GM, Veenstra TD: Analytical and statistical approaches to metabolomics research.
J Sep Sci 2009, 32(13):21832199. PubMed Abstract  Publisher Full Text

Dettmer K, Aronov PA, Hammock BD: Mass spectrometrybased metabolomics.
Mass Spectrom Rev 2007, 26(1):5178. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dunn WB: Current trends and future requirements for the mass spectrometric investigation of microbial, mammalian and plant metabolomes.
Phys Biol 2008, 5(1):11001. Publisher Full Text

Griffin JL, Kauppinen RA: A metabolomics perspective of human brain tumours.
Febs J 2007, 274(5):11321139. PubMed Abstract  Publisher Full Text

Chen G, Pramanik BN: Application of LC/MS to proteomics studies: current status and future prospects.
Drug Discov Today 2009, 14(910):465471. PubMed Abstract  Publisher Full Text

Ahmed FE: Utility of mass spectrometry for proteome analysis: part II. Ionactivation methods, statistics, bioinformatics and annotation.
Expert Rev Proteomics 2009, 6(2):171197. PubMed Abstract  Publisher Full Text

Katajamaa M, Oresic M: Data processing for mass spectrometrybased metabolomics.
J Chromatogr A 2007, 1158(12):318328. PubMed Abstract  Publisher Full Text

Smith CA, Want EJ, O'Maille G, Abagyan R, Siuzdak G: XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification.
Anal Chem 2006, 78(3):779787. PubMed Abstract  Publisher Full Text

Yu T, Park Y, Johnson JM, Jones DP: apLCMSadaptive processing of highresolution LC/MS data.
Bioinformatics 2009, 25(15):19301936. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ahmed FE: Utility of mass spectrometry for proteome analysis: part I. Conceptual and experimental approaches.
Expert Rev Proteomics 2008, 5(6):841864. PubMed Abstract  Publisher Full Text

Di Marco VB, Bombi GG: Mathematical functions for the representation of chromatographic peaks.
J Chromatogr A 2001, 931(12):130. PubMed Abstract  Publisher Full Text

Felinger A: Deconvolution of Overlapping Skewed Peaks.
Analytical Chemistry 1994, 66(19):30663072. Publisher Full Text

Johansson M, Berglund M, Baxter DC: Improving Accuracy in the Quantitation of Overlapping, Asymmetric, Chromatographic Peaks by Deconvolution  Theory and Application to Coupled GasChromatography AtomicAbsorption Spectrometry.
Spectrochim Acta B 1993, 48(11):13931409. Publisher Full Text

Papai Z, Pap TL: Determination of chromatographic peak parameters by nonlinear curve fitting using statistical moments.
Analyst 2002, 127(4):494498. PubMed Abstract  Publisher Full Text

Youn DY, Yun SJ, Jung KH: Improved Algorithm for Resolution of Overlapped Asymmetric Chromatographic Peaks.
J Chromatogr 1992, 591(12):1929. Publisher Full Text

TorresLapasio JR, GarciaAlvarezCoque MC, BaezaBaeza JJ: Global treatment of chromatographic data with MICHROM.
Anal Chim Acta 1997, 348(13):187196. Publisher Full Text

Caballero RD, GarciaAlvarezCoque MC, BaezaBaeza JJ: ParabolicLorentzian modified Gaussian model for describing and deconvolving chromatographic peaks.
Journal of Chromatography A 2002, 954(12):5976. PubMed Abstract  Publisher Full Text

Powell MJD: A Method for Minimizing a Sum of Squares of NonLinear Functions without Calculating Derivatives.

TorresLapasio JR, BaezaBaeza JJ, GarciaAlvarezCoque MC: A model for the description, simulation, and deconvolution of skewed chromatographic peaks.
Analytical Chemistry 1997, 69(18):38223831. Publisher Full Text

Buys TS, De Clerk K: BiGaussian fitting of skewed peaks.
Analytical Chemistry 1972, 44(7):12731275. Publisher Full Text

Felinger A: Data Analysis and Signal Processing in Chromatography. 1st edition. Amsterdam: Elsevier Science; 1998.

Sturm M, Bertsch A, Gropl C, Hildebrandt A, Hussong R, Lange E, Pfeifer N, SchulzTrieglaff O, Zerck A, Reinert K, et al.: OpenMS  an opensource software framework for mass spectrometry.
BMC Bioinformatics 2008, 9:163. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Dempster AP, Laird NM, Rubin DB: Maximum Likelihood from Incomplete Data Via Em Algorithm.

Schwarz G: Estimating Dimension of a Model.
Ann Stat 1978, 6(2):461464. Publisher Full Text