Abstract
Background
Nuclear magnetic resonance spectroscopy (NMR) is a powerful technique to reveal and compare quantitative metabolic profiles of biological tissues. However, chemical and physical sample variations make the analysis of the data challenging, and typically require the application of a number of preprocessing steps prior to data interpretation. For example, noise reduction, normalization, baseline correction, peak picking, spectrum alignment and statistical analysis are indispensable components in any NMR analysis pipeline.
Results
We introduce a novel suite of informatics tools for the quantitative analysis of NMR metabolomic profile data. The core of the processing cascade is a novel peak alignment algorithm, called hierarchical Clusterbased Peak Alignment (CluPA). The algorithm aligns a target spectrum to the reference spectrum in a topdown fashion by building a hierarchical cluster tree from peak lists of reference and target spectra and then dividing the spectra into smaller segments based on the most distant clusters of the tree. To reduce the computational time to estimate the spectral misalignment, the method makes use of Fast Fourier Transformation (FFT) crosscorrelation. Since the method returns a highquality alignment, we can propose a simple methodology to study the variability of the NMR spectra. For each aligned NMR data point the ratio of the betweengroup and withingroup sum of squares (BWratio) is calculated to quantify the difference in variability between and within predefined groups of NMR spectra. This differential analysis is related to the calculation of the Fstatistic or a oneway ANOVA, but without distributional assumptions. Statistical inference based on the BWratio is achieved by bootstrapping the null distribution from the experimental data.
Conclusions
The workflow performance was evaluated using a previously published dataset. Correlation maps, spectral and grey scale plots show clear improvements in comparison to other methods, and the downtoearth quantitative analysis works well for the CluPAaligned spectra. The whole workflow is embedded into a modular and statistically sound framework that is implemented as an R package called "speaq" ("spectrum alignment and quantitation"), which is freely available from http://code.google.com/p/speaq/ webcite.
Background
Nuclear magnetic resonance spectroscopy (NMR) is a powerful and widely applied analytical highthroughput technique to reveal and compare the quantitative metabolic profile of a given tissue in relation to various environmental and clinical parameters. A typical NMR spectrum is composed out of an xaxis, which indicates the resonance frequencies of the observed molecule, and a yaxis, which denotes the corresponding intensities, i.e., abundance. To analyse experimental NMR datasets, multivariate methods such as principle components analysis (PCA) or univariate techniques like Studentt test are commonly applied. However, chemical and physical sample variations due to, among others, differences in pH, temperature, ion content and the concentration of metabolites, make the analysis of the data challenging. To address these challenges, several preprocessing steps are commonly applied, including noise reduction, normalization, baseline correction, peak picking and spectrum alignment, prior to statistical analysis.
A crucial and often depreciated aspect in this process is peak alignment, which aims to compensate for small variations in the position of corresponding peaks between spectra. A number of spectral alignment approaches have previously been proposed. However, most of them come with particular disadvantages. For example, some methods use dynamic programming, like Correlation Optimized Warping (COW) and Dynamic Time Warping (DTW) [1,2]. Due to their computational complexity an alignment task based on these techniques may take hours. Several authors worked towards solutions to speed up this alignment process [3] used a Fast Fourier Transformation (FFT) crosscorrelation engine to improve the alignment speed (PAFFT). They also introduced an advanced extension, called recursive peak alignment by FFT (RAFFT), which recursively divides the spectrum into meaningful segments and aligns them until a certain degree of goodness is obtained. Some advanced peak picking approaches are Recursive Segmentwise Peak Alignment (RSPA) [4] and Generalized Fuzzy Hought Transform (GFHT) [5]. Other authors applied search algorithms to peak alignment, such as genetic algorithms in PAGA [6] and beam searching in PABS [7]. Recently [8], introduced the intervalcorrelationshifting (Icoshift) algorithm, which aligns spectra by maximizing the crosscorrelation between userdefined intervals.
Another approach that is commonly employed for the peak alignment of mass spectral data is based on hierarchical clustering and could be applied as well on NMR spectral data [914]. Most of these methods apply hierarchical clustering to the entire collection of all peaks from the individual spectra and "cut off" the resulting dendrogram at a suitable height to produce a number of clusters used for alignment. This approach works well on NMR data that is already calibrated to some extent. However, in some datasets, the peak positions of chemical resonances are significantly shifted between the samples. This strong shift could make the NMR spectra unclear to separate, which may lead to the wrong clustering, i.e. alignment, of peaks. The effect of strongly shifted spectra also challenges the methods based on spectral binning, like COW and Icoshift, because peaks could mistakenly be assigned to the wrong bins.
To address the problems with misaligned spectra, we first focus on the development of a robust and highly confident alignment algorithm. The method is based on a peakpicking approach for NMR spectra, called hierarchical Clusterbased Peak Alignment (CluPA). The alignment is embedded in a workflow (called speaq: "spectrum alignment and quantitation") that starts with deriving the peak list of each spectrum separately, followed by selecting the reference spectrum to which other spectra are aligned.
The highquality of the alignment introduced here allows for the use of standard and stable components up and downstream the processing pipeline. For example, prior to the peakpicking, noise reduction is accomplished by wavelet filtering. Further more, differential regions in an NMR spectra between two conditions (control versus case) are determined based on the ratio of the betweengroup to withingroup sums of squares (BWratio), as proposed by [15]. The BWratio is calculated for each data point in the NMR spectra. Because the spectra are well aligned, no variability, due to shifted peaks, is added to the BWstatistic.
Implementation
The complete workflow combines four major steps, illustrated in Figure 1 and itemized below. Each of these major steps is discussed in more detail in separate subsections. The core of the workflow, is a novel approach for spectral alignment.
Figure 1. Outline of the quantitative analysis suite. An overview of all the steps of the workflow.
1. Noise reduction and peak picking: remove instrument noise and collect peaks from the NMR spectra.
2. Reference determination: selecting a dominant and representative spectrum, which is used as template for the alignment.
3. Peak alignment: shifts are estimated by hierarchical clustering on the selected peaks and the template.
4. Quantification: determination of differential region in the NMR spectra based on the BWratio.
Noise reduction and peak finding
Although peak detection is a common step in the analysis of spectral data, it remains a challenging task, due to the noise inherent to many spectrometric techniques. A typical peak detection approach consists of three steps: smoothing or denoising, baseline correction and the actual peak picking. For each step, several methods have been developed, but there does not seem to be a "golden standard". Since the detected peaks are the input of the alignment procedure, selecting an appropriate technique is important. In our implementation, the method of [16], recommended in [17] is chosen. It employs continuous wavelet transform (CWT) to detect peaks in a NMR spectrum. We used the package MassSpecWavelet [16] which integrates waveletbased filters in the smoothing step, continuous wavelet transform for baseline correction, and signal to noise ratio thresholding and ridge lines for the peak picking [16,17].
In our approach, the method of Du et al. [16] could not be applied to the whole spectrum directly, due to computational complexity and possible insensitivity towards low intensity peaks in the NMR spectrum. To address this, the function peakdetectionCWT of the library MassSpecWavelet is used only on partitions of the spectrum. Therefore the spectrum was divided into predefined equal windowsized segments. Next, the function peakdetectionCWT is applied to a foursegment sized window that slides along the spectrum, one segment at a time. All detected peaks are subsequently merged into a final peaklist. This strategy avoids missing peaks in the margin of a segment. The segment size was chosen as a power of 2, in order to boost the speed of the CWT [16,17]. It should be noted that the segment size has to be wide enough to cover the widest peaks in spectra. Noisy peaks are removed by applying a baseline minimum intensity threshold (baselineThresh). The optimal threshold is datasetdependent, and should be supplied a priori by the user. For the selection of suitable parameters for the peak detection, we refer to [16]. The user may also use alternative peak detection methods, but it is important to take into account that the quality of the alignment and the downstream analysis can suffer from counterfeit peak detection.
Reference determination
The second step in the workflow consists of the selection of a reference spectrum. The reference spectrum serves as a kind of template for other spectra to be aligned against. The ideal reference contains the highest number of common chemical constituents [4]. Several methods exist to select an appropriate reference spectrum: either prior knowledge about the dataset can be used, or the reference can be selected according to a predefined criterion [4,18] propose a method based on the product of the Pearson correlation coefficients between spectra. The approach of [8] creates a virtual "average" spectrum as a reference. Alternatively, the spectrum that is most similar to the loading of the first principle component in a PCA model can be chosen [18].
It is difficult to define "common chemical constituents" when spectra are ill aligned. To ensure robust reference selection, we introduce an alternative method that incorporates a heuristic strategy to find the optimal template. It selects the spectrum with the largest goodness as the reference, which is defined by the sum of the interspectrum distances, according to formula 1 and 2:
where pS and pT represent the resonance frequencies, i.e., location along the xaxis, of the corresponding peaks of spectrum S and T, respectively.
The distance is computed by the sum of the minimum distances between every peak pT of T and the peaks pS of S, because the closest peak is chosen as the best candidate to be matched with pT. Note, that the distance is not symmetric, that is the distance of S to T and the distance of T to S are not necessarily equal.
In principle, the method projects each highdimensional NMR spectrum as a datapoint in a onedimensional space and selects the datapoint in the middle of the data cloud as the reference. A possible disadvantage of this method is that it favours the spectrum with a high number of peaks, which is not necessarily in the center of the data cloud. If that is the case, a lower ranked spectrum could be selected instead. Ideally, the perfect reference candidate has enough peaks corresponding to chemical resonance areas, i.e. frequencies and the peaks of the reference stay at the center of the distribution of peaks of other spectra in order to minimize the distance.
The computational complexity to find the reference spectrum includes O(n) for finding the minimum value and the second sum in formula (1); and O(m) for the other sum of formula (1) and the maximum of formula (2); where n is the maximum number of peaks of a spectrum and m is the number of spectra. The total complexity is then O(O(n) * O(n) * O(m) + O(m)) or O(n^{2}m + m) or O(n^{2}m). It should be noticed that good peak detection is required to restrain the computational complexity within acceptable limits. In practice, finding the reference spectrum is fast, since the number of peaks n of a spectrum is small, given that peak selection upstream of the workflow is adequate.
Spectrum alignment
The alignment method is called hierarchical Clusterbased Peak Alignment (CluPA), and is based on shifting segments of the original spectra, first globally and then in smaller local segments. In the first iteration, CluPA does the alignment on the whole spectrum. In subsequent iterations, CluPA recursively splits the segment(s) of previous steps into two smaller segments, and applies an individual alignment shift to each of these. The boundaries of these segments are defined by applying a hierarchical clustering algorithm on the combined peak list of the reference and the target spectrum. The clustering is based on the distance between peaks: peaks are hierarchically combined into a tree according to their distance. After each round, the segment is split into two smaller segments by cutting at the second level of the hierarchical tree. At each iteration, the present segments of the original spectra are aligned to the reference spectrum. To this end, an optimal shift step is determined between the target and reference spectrum within the present segments. This shift is subsequently applied to the corresponding segment in the target spectrum. The process is repeated until a segment contains only peaks of a single spectrum, or until there are only two peaks left in the list. A detailed outline of the algorithm is presented in the following pseudocode.
Input
• targetSpec: original spectrum to be aligned
• refSpec: original reference spectrum
• mergedPeakList: merged list of all peaks detected in targetSpec and refSpec
Output
• targetSpec: spectrum after alignment
CluPA(refSpec, targetSpec, mergedPeakList)
1. startPoint, endPoint = findShiftRegion(refSpec, targetSpec, mergedPeakList)
2. shiftStep = findShiftStep(refSpec, targetSpec, startPoint, endPoint)
3. targetSpec, mergedPeakList = shift(targetSpec, mergedPeakList, shiftStep, startPoint, endPoint)
4. hClusterTree = buildhCluster(mergedPeakList)
5. leftPeakList = cutLeftTree(hClusterTree, 2)
6. if satifiedCondition(leftPeakList) then
7. targetSpec = CluPA(refSpec, targetSpec,leftPeakList)
8. rightPeakList = cutRightTree(hClusterTree, 2)
9. if satifiedCondition(rightPeakList) then
10. targetSpec = CluPA(refSpec, targetSpec,rightPeakList)
11. return(targetSpec)
Step 1: findShiftRegion(refSpec, targetSpec, mergedPeakList): determines the region of the spectra for alignment, which must contain all peaks of the mergedPeakList. The left most of the region is the lowest intensity point in the spectral region on the left side of the list. Similarly, the end point of the region is the lowest intensity point in the spectral region on the right side of the list. It takes linear time O(d) to select the lowest intensity point, where d is the length of the spectral region. This value rapidly decreases when the algorithm iterates with increasingly smaller regions.
Step 2: findShiftStep(refSpec, targetSpec, startPoint, endPoint): finds a shift step by using FFT CrossCorrelation. The step has a computational cost O(dlogd). In implementation, the function is simulated from the function FFTcorr in RAFFT [3]. Alternatives, for example Pearson correlation and Weighted Cross Correlation in [19] could also be applied as an alternative, however, Pearson correlation takes more time than using FFT cross correlation.
Step 3: shift(targetSpec, mergedPeakList, shiftStep, startPoint, endPoint): moves shiftStep  the number of points on the current region of targetSpec spectrum and peak list. This simple step takes linear time O(d) to complete.
Step 4: buildhCluster(mergedPeakList): creates a hierarchical cluster tree from the peaks of the mergedPeakList. In the first step, each cluster has a single peak. The hierarchical cluster tree is built by in turn grouping the closest two clusters together. There are three common ways to define the distance between two clusters: Single linkage, complete linkage and average linkage [20]. Single linkage distance of two clusters is the minimum distance of two objects or peaks in each cluster, complete linkage distance is computed by maximum distance of the two objects, and average linkage is calculated by the average of all distances between any pairs of objects between the two clusters. By default, the average linkage is selected as the distance for clustering, however, other types of distance could be applied. This step takes O(n^{2}) to O(n^{2}logn) [20], depending on which linkage distance is selected, where n is the number of elements in the peak list. The calculation of single linkage distance is faster than its alternatives. In the implementation, the standard function hclust() in the stats package of the R software (R version 2.11.1 (20100531)) is applied.
Step 5: cutLeftTree(hClusterTree, 2): gets the peaks corresponding to the left tree at the second level of hClusterTree.
Step 8: cutRightTree(hClusterTree, 2): gets the peaks corresponding to the right tree at the second level of hClusterTree.
In the implementation, the function cutree() of the stats package is applied. It takes O(n) to be finished.
Step 6 and 9: satifiedCondition(leftPeakList) verifies whether the condition that determines whether a tree should be further divided are met. The tree has to contain peaks of both reference and target spectra, otherwise the spectral region does not need to be further aligned.
The maximum computational complexity in each stage is O(O(d) + O(dlogd) + O(d) + O(n^{2}logn) + 2 * O(n)) or O(dlogd + n^{2}logn), and O(logn) for the recursive loop. Therefore, the total complexity of the algorithm is O(logn) * O(dlogd + n^{2}logn).
Figure 2 demonstrates how the algorithm selects peak groups from the merged peak list, by hierarchical clustering. The red lines are bounding the secondlevel children of the tree, which corresponds to the segments that will be individually processed. The trees shown in Figure 2 correspond to regions in the original spectrum, which are depicted in Figure 3, and will undergo individual alignment.
Figure 2. Peak list segmentation by CluPA. At each iteration, CluPA cuts the hierarchical tree into two subtrees (marked by red lines). This process is recursively iterated with each subtree until the stopping criteria are satisfied. The arrows point to the corresponding subtrees after the cut.
Figure 3. Spectral view that corresponds to the hierarchical segmentation of Figure 2. The spectra in range of 25000 to 35000 are divided (by the vertical red line) into two smaller ranges that will undergo further alignment. The arrows point to the corresponding spectra after the cut.
Quantitative analysis
The thorough alignment of the NMR spectra, outlined in previous section, allows for a very straightforward and simple quantification analysis. The data returned by CluPA are well aligned and presented in a rectangular format. This means that the data are represented as a matrix in which the rows denote the spectra and the columns indicate the resonance frequency instances. The quantification strategy is inspired by the method proposed by Tan et al [21] for annotating regions of significance in SELDITOF spectra. In our approach, for each frequency instance in the matrix, we performed an analysis of the data based on the ratio of the betweengroup to withingroup sums of squares (BWratio) as proposed by Dudoit et al [15]. This ratio is related to the Fstatistic or oneway ANOVA, but we do not make assumptions about the distribution and the independency of the data. For example, the BWratio of frequency instance j is calculated as
where and denote the average intensity level of frequency instance j between all NMR spectra and within the NMR spectra belonging to group k only, respectively. The variable I is an indicator for the group. The variable x_{ij }denotes the ith intensity values of frequency j belonging to group k. Statistical inference on the BWstatistic is based on the sampled null distribution from the observed data. For this purpose the average of the groups is subtracted from the data. Next, data points are sampled for each frequency instance j using the same grouping as in the original setting. Based on the sampled data, the BWratio is calculated, which is representative for data showing no difference in the group means. This computation is iterated with replacement of the data until a well sampled null distribution is obtained. For each frequency j, the value of the BWstatistic is determined for the desired alphalevel based on the sampled null distribution.
We are aware that NMRspectra are correlated between consecutive frequency instances and cannot be treated as independent variables. A model which takes into account such a correlation structure to discern noise peaks from bona fide molecule peaks is part of future research. In addition, testing for every possible data entry can make the detection of differential frequency instances inefficient, because a valid NMRpeak can be composed out of multiple, redundant, data points. Therefore, correction of multiplicity is done by using the conservative Bonferoni correction. Doing so, the alphalevel is adjusted for the average number of peaks returned by the peak picking algorithm used in the first step of the workflow. We can already announce that this adhoc quantitative analysis is very fast and robust.
An interesting point is that the differential frequencies do not necessarily represent the apex of the NMRpeak, but are rather found in the shoulder of the peak. We also want to stress that such an adhoc approach is only suitable for well aligned data. Misalignment would increase the variability, making it difficult to discriminate differential frequencies.
Results and Discussion
The evaluation of the workflow proposed in this paper is outlined in three sections. First, we compare the alignment method against other available methods. Second, we evaluate the quantitative power of the approach. Third, we discuss the efficiency of the workflow.
Spectral alignment evaluation
We compared the proposed method to two alternatives that use FFT cross correlation to measure correlation: Icoshift [8] and RAFFT [3], The comparison was made using two different datasets: 1) The wine dataset, which is a known benchmark NMR dataset [8]; 2) A Huntington dataset [22]. Details regarding the data and the methodology are outlined in the methods section at the end of this paper, and in the original papers that describe the data [8,22].
As mentioned in [8], most methods such as COW, RSPA and Icoshift perform well on certain regions of the spectra, but perform less in specific regions, such as the lactate and ethanol region. Figure 4 shows gray scale representations and spectrum plots of the unaligned spectra and the results of the different alignment methods within this region. RAFFT did not properly align this region. Icoshift fails to properly align the lactic region in the automatic interval selection mode. However, when we specified the intervals from [8] using the userdefined interval mode, a satisfying alignment was obtained. CluPA aligns well on all main resonance peaks mentioned in [8], without the requirement of user interaction. In the case of the lactic acid region, CluPA yields a good alignment, as long the peak detection algorithm successfully detects the peaks in this region.
Figure 4. Spectral plots and grey scale plots of the region 70007500 of the Wine dataset. The spectrum region before and after alignment by different methods (indicated at the bottom) was visualised as a spectral overlay (top) (multiple spectra overlaid in specific color for each spectrum) and as a twodimensional greyscale plot (bottom) (x axis: resonance frequency, y axis: spectrum index, lighter zones correspond to peaks).
The same methods were also applied to the Huntington dataset. We show two types of selected regions: a region containing the highest peaks and the largest region containing strongly overlapping peaks.
a) Region with highest peaks: between 39600 and 40600 or 1.54 to 1.07 ppm
This region (Figure 5) is also the Lactic acid region, similar to the wine dataset, in which there are many small peaks around a very intense peak. Considering the grey scale plots, RAFFT, Icoshift and CluPA align the highest peak region in an almost identical way. The spectral plots, which boost the lower intensity scales, however show that CluPA generally outperforms the other alignment options with regards to the lower intensity peaks surrounding the highest peaks. Nevertheless there are still some misalignments, potentially caused by the fact that the peak detection algorithm does not fully detect these peaks. This problem may be overcome by adjusting the peak detection parameters. The greyscale plots also reveal spectra 5 and 20 as commonly misaligned, in accordance to their outlier status (see methods section).
Figure 5. Spectral plots and grey scale plots of the highintensity region (3960040600) of the Huntington dataset. Visualisation equivalent to Figure 4. The highest intensity reaches approximately to 1.3 * 10^{8}. In order to be able to see the smaller peaks, only the lower end of the intensity range was shown in the spectral plots. Due to the high intensity, only the highest peaks are clearly observed in the corresponding grey scale plots.
b) Region with high overlap: from 34300 to 35700 index value or from 3.96 until 3.32 ppm
The large number of peaks in this region (Figure 6) makes the alignment of spectra by a binning method challenging, since the bin may contain the wrong peaks for the alignment. Icoshift does give a very good alignment for some spectra, but it is not always successful. In some cases, the peaks are not fully aligned together and there are some artefact lines in the spectral plots of the Icoshift results. Both RAFFT and CluPA show better results in this region. The grey scale pictures show that RAFFT gives a perfect alignment for the left side of the region and yields an adequate improvement in the right side region. The best alignment is obtained by the CluPA algorithm, which correctly matches the high peaks of the region. The grey scale plots also reveals the noisy spectra 5, 17 and 20 as outliers.
Figure 6. Spectral plots and grey scale plots of the large and dense region (3430035700) of the Huntington dataset. Visualisation equivalent to Figure 4. Prior to alignment, the peaks corresponding to different chemical resonances show significant overlap.
A different way to assess the efficiency of the alignment algorithm consists of a correlation analysis between spectra. To this aim, the Pearson correlation coefficient was calculated between each pair of spectra within the (wine) dataset. A visual representation of all pairwise correlations for each method is shown in Figure 7. In general, all alignment methods yield an apparent improvement. Although ManualIcoshift works well on all main resonance peaks (as shown on previous figure), its correlation map shows lesser correlation for several spectra. The reason may be found in the fact that ManualIcoshift does alignment on some userdefined regions of the spectra, and does not fully consider all data points. The correlation maps of RAFFT, AutoIcoshift and CluPA are similar.
Figure 7. Pairwise Pearson correlation within the Wine dataset before and after alignment. Pairwise Pearson correlation coefficients between all spectra, for the different methods. Each dot within the heatmap corresponds to pair of spectra, with colors ranging from white (correlation = 0) to red (correlation = 1). The small inset on the topleft of each map is the histogram showing the distribution of the correlation coefficients and the value in its right side is the average correlation value.
We proposed a heuristic method to find the reference spectrum where other spectra are aligned against. The method selects the spectrum with the maximum goodness, which is computed from the distance between the peaks of the reference candidate and the others. To further evaluate the impact of reference selection, we used a simple test based on the average correlation values, obtained as explained above. Each sample was selected once as reference, and the others were aligned against it through the CluPA alignment algorithm. The distribution of the average correlation values for each run was drawn, and the correlation value corresponding to the use of the reference found by the heuristic method was marked by red lines, in Figure 8.a and 8.b. In the case of the wine dataset, the correlations are high with a narrow distribution. The reference is close to the center of this distribution. In the case of the Huntington dataset, the heuristic selects a reference in the center of the distribution of the nonoutliers, clearly demonstrating that the heuristic method works. This is further confirmed by the fact that this heuristic ranks the outlier spectra (5, 17 and 20, see methods section) lowest among all candidate reference spectra (results not shown).
Figure 8. Distribution of the average correlation values when using different references. a) The case of the wine data. b) The case of Huntington data. The red lines indicate the correlation values of the references found by the proposed heuristic.
Quantitative analysis
The quantitative analysis was performed as outlined in the workflow section. The alphalevel of 0.05 was corrected for the 1411 peaks found above an intensity of 50,000. The analysis returned 20 regions of interest, which had a significant BWratio. The top panel of Figure 9.a displays the significant regions over the entire xaxis. The blue line indicates the BWstatistic and the black line denotes the BWstatistic at a percentile of alpha/1411 using the null distribution. The null distribution is composed out of 1000 samples. BWvalues above the black line indicate a differential region. The plot in the bottom panel displays the NMRspectra for the red and white wine. From the figures it becomes clear that differential regions do not necessarily correspond to regions with intense peak abundances because the BWstatistic takes into account the variability between the groups instead of looking at absolute differences.
Figure 9. Quantitative analysis for the wine dataset. a) top panel: The blue line represents the BWstatistic for the red and white wine dataset. The black line indicates the critical BWvalue for rejecting the null hypothesis (i.e., no difference between the groups) at a Bonferonni adjusted alphalevel. The null hypothesis can be rejected when the blue line surpasses the black line. bottom panel: Intensity values of the frequency instances. Black lines indicate the red wine group. Red lines indicate the white wine group. b) closeup in region 75807690. c) closeup in the region 41804215. d) closeup in the region 62006235.
In order to demonstrate the proper functioning of the approach, we present several closeups of interesting regions. For example, Figure 9.b presents a closeup of the most intense peaks in the data (bottom panel). From the top panel it is clear that this region is not selected as a differential region; the BWstatistics stays under the critical value corresponding to the significance level. A different scenario is observed in Figure 9.c and 9.d, where differential regions are detected. A remarkable observation is that in Figure 9.c and 9.d the maxmimum BWstatistic is not reached at the apex of the NMRpeaks but in the ascending and descending shoulder of the peak. It seems that the due to lower variability in the shoulder, i.e., more robust measurements, it should be easier to detect differences. An approach which focuses on finding differences in the peak heights might discard valuable information.
We compared these results with the information in the original research paper [23], and found that most findings are in accordance to the published results. In figure 9.b, the highest intensity region or the ethanol region ranging 1.101.25 ppm is not a good region for classification because the authors had to use a very highly complex iPLS model with 6 components for it. Leave one out for RMSECV(Root Mean Squard Error of CrossValidation) and R2 (Squared correlation cofficient) are 0.31 and 84%, however, the authors reported that they were overfitting due to the complex model [23]. Figure 9.c is the methanol region ranging 3.33.4 ppm. The result from the paper showed that they used simpler iPLS model with 4 components, achieve 0.03 and 75% for RMSECV and R2, respectively. This means that this region is suitable for classification (see [23] for more details). We do not have quantification information corresponding to figure 9.d, or the acid acic region.
Algorithm speed and parameter settings
Since the alignment algorithm (CluPA) requires most of the computational resources of the workflow, in comparison with the statistical analysis, we briefly discuss its operational characteristics in this section. On a Macbook Pro 2,66 GHz (using R package version 2.11.1.) the whole process needs minutes to complete. As other peak based alignment methods, the peak detection process is the most time consuming. The alignment of the Wine dataset takes totally 2.4 minutes, consisting of 1.9 minutes for peak detection, 0.06 minutes for finding the reference and 0.5 minutes for the alignment process. The Huntington dataset required nearly four times the time for the Wine dataset: 9.5 minutes. Most of the time cost is reserved for peak detection (8.0 minutes). The remainder includes 0.04 minutes for finding the reference spectrum and 1.5 minutes for the alignment.
Besides the parameters for the peak detection, the workflow does not require parameters for the initialization prior to running. One extra optional parameter for CluPA, but also for both RAFFT and Icoshift, is the maximum shift point, which is inherited from the FFT crosscorrelation function. This value is usually 0.05 ppm, more or less 100 points, depending on the data. If this parameter is not set, the FFT crosscorrelation function will check on whole spectral regions to get the optimal shift step. Furthermore, users need to set the parameters for the peak detection algorithm or use other methods instead. Even though default values are set in the software implementation, they are not guaranteed to work optimally for any dataset. The selection of optimal parameters for peak detection algorithm is beyond the scope of this paper.
Conclusions
In this paper, we proposed a simple and efficient workflow which is centered around recursive hierarchical clustering. This strategy, which is referred to as speaq ("spectrum alignment and quantitation") is feasible because prior to the alignment, a peakpicking and reference selection approach is used to reduce the complexity related to alignment. The method aligns the target spectrum to the reference spectrum in a topdown fashion and makes use of Fast Fourier Transformation (FFT) crosscorrelation to reduce the computation time to find the shift step. As conventional recursive strategy like RSPA or RAFFT, the CluPA algorithm aligns a larger segment and then recursively divides it into smaller segments to refine the alignment. The differences between them lie in the method and criteria to divide the spectra. RAFFT selects the dividing point based on the optimal shifts computed from Fast Fourier Transform CrossCorrelation, RSPA detects in the reference and target spectrum lists of segments, then tries to find the corresponding segments between them, and goes into segments for further alignment. Our proposed method builds a hierarchical cluster tree from peak lists of reference and target, and divides the spectra into smaller regions based on the most distant clusters of the tree. However, instead of considering many clusters simultaneously, we cut off the dendrogram into just two sub trees (corresponding to two clusters) and then go further into each of them for finer alignment. Another feature of the workflow which is different from most other peakpicking based alignment methods is that instead of grouping peaks together to create segments for alignment, speaq does the alignment first, then it groups peaks and creates smaller segments. In general, this strategy decreases the shift of peaks before the hierarchical cluster tree is built, thus, it significantly reduces overlapping peaks that yield wrong clustering.
Despite the heuristic approach, selecting the perfect reference remains a nontrivial task, since one spectrum may be a good reference for a specific region but not for other regions. Within the software implementation, we therefore also allow users to intervene into the process by setting reference in the region they are confident about. This also allows to skip certain regions during the alignment. The maximum shift can also be set in a regionspecific manner. All this information can be readily provided to the software in an extra text file.
The speaq workflow was compared to two widely used methods Icoshift and RAFFT. The method shows several advantages over the others. It is easy to use because users do not need to set many initial parameters. Experiments on both a public dataset and an Huntington dataset show that it performs very well compared to the others. All method use FFT cross correlation for finding the shift because of its advantage in computational time. However, speaq is slower than both Icoshift and RAFFT, due to the time for peak detection and its recursive nature, but this is generally not a problem with the powerful computers we have nowadays.
In summary, the speaq workflow is unique in the way the subsequent steps are ordered. Because of rearranging the order of processing steps, simple and robust methods could be adopted for the workflow. A case in point is the effective analysis based on the BWratio. This could only be achieved since NMR spectra are well aligned. If this were not the case, misalignment would introduce sources of variability into the statistic making it difficult to select differential regions. I should be noticed that the proposed method is still open for improvements. For example, from the top panel of Figure 9.a it can be seen that the null hypothesis (black line) is remarkably constant. This constant line might suggest that the properties of the null distribution are applicable over the entire frequency range. Parametrization of the method can avoid the sampling of the null distribution and improve computational efficiency. The method currently does not take into account covariation of the intensities of peaks of the same molecule across samples [24]. The method was tested only on NMR datasets. Similar data types, such as chromatography data, may also benefit from the approach presented here.
Methods
Data sets
Wine data
The wine data from [23] includes the 1H NMR spectra of 40 table wines of different origin and color (red, white and rosé) with 8713 data points from 6.00 to 0.50 ppm. The dataset was aligned and analyzed by three different methods. First, Icoshift was used under two parameter settings, namely, autoIcoshift and manualIcoshift. The former setting divides the dataset into equally sized intervals. In the latter setting, the dataset is split into predefined intervals, which are defined in the spectra according to the knowledge of the user, prior to alignment [8]. Both settings use an average spectrum as reference. Second, RAFFT was applied with the automatically found reference. Third, the speaq workflow, described in this paper, was applied on peaks with an intensity larger than 50,000. The reference was automatically found as described.
Huntington data
To further demonstrate the quality of our method, a Huntington data set was processed with the proposed workflow. The metabolomics data set [22] combines 1H NMR spectroscopy and chemometrics in order to identify biomarkers in transgenic presymptomatic rats for Huntington disease. It is worth mentioning that there are a few outliers in this dataset: spectra 17 and 20 contain very few and mainly lowintensity peaks, whereas spectrum 5 shows different peak shapes. These irregularities can be attributed to experimental artefacts, but are intentionally not removed in order to assess their impact on the stability of the results.
The data has a large region from 34000 (4.10 ppm) to 36000 (3.18 ppm) wherein different peaks strongly overlap with each other. This makes it a challenge to select suitable regular intervals for Icoshift, as it is easy to divide peaks into wrong intervals. In Icoshift, this problem can be solved by letting the user select the intervals manually. However, even if the user does this, in order to get a more clear separation of small peak regions, these small regions should be considered to be refined after the alignment of the large region, in a way equivalent to recursive alignment algorithms like RAFFT, RSPA and CluPA. As in this dataset, the prior knowledge of the intervals is considered to be unknown, only the automatic version of Icoshift was applied, with a different number of intervals, i.e. 50 and 300. These values correspond to low and highresolution cases, respectively. RAFFT was applied with the reference found by the proposed algorithm. For CluPA, all detected peaks with intensity less than 50,000 were removed in order to reduce noiserelated peaks. The reference was automatically detected as described above, prior to alignment.
Software availability
A software implementation of the workflow described in this paper is freely available under an Apache 2 license. Source code and documentation can be downloaded from http://code.google.com/p/speaq/ webcite.
Authors' contributions
TNV, KL and DV developed the workflow and methods and wrote the manuscript. The software was implemented by TNV and DV. KS analysed the algorithmic complexity. KV and RD provided and interpreted the NMR data. The computational aspects of this work were supervised by BG and AV, whereas the chemical aspects were under supervision of FL. All authors read and approved the final manuscript.
Acknowledgements
TNV acknowledges support by a BOF interdisciplinary grant of the University of Antwerp. Koen Smets is supported by a Ph.D. Fellowship of the Research Foundation Flanders (FWO). This work is further supported by an SBO grant (IWT600450).
References

Nielsen NV, Carstensen JM, Smedsgaard J: Aligning of single and multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping. [http:/ / www.sciencedirect.com/ science/ article/ B6TG83SR91XC2/ 2/ 54c6fd7af365c6ca984d8a1a14c8995b] webcite
Journal of Chromatography A 1998, 805(12):1735. Publisher Full Text

Tomasi G, Berg Fvd, Andersson C: Correlation optimized warping and dynamic time warping as preprocessing methods for chromatographic data.
Journal of Chemometrics 2004, 18(5):231241. Publisher Full Text

Wong JWH, Durante C, Cartwright HM: Application of Fast Fourier Transform CrossCorrelation for the Alignment of Large Chromatographic and Spectral Datasets.
Analytical Chemistry 2005, 77(17):56555661. PubMed Abstract  Publisher Full Text

Veselkov KA, Lindon JC, Ebbels TMD, Crockford D, Volynkin VV, Holmes E, Davies DB, Nicholson JK: Recursive SegmentWise Peak Alignment of Biological 1H NMR Spectra for Improved Metabolic Biomarker Recovery.
Analytical Chemistry 2009, 81:5666. PubMed Abstract  Publisher Full Text

Alm E, Torgrip R, Åberg K, SchuppeKoistinen I, Lindberg J: A solution to the 1D NMR alignment problem using an extended generalized fuzzy Hough transform and mode support.
Analytical and Bioanalytical Chemistry 2009, 395:213223. PubMed Abstract  Publisher Full Text

Forshed J, SchuppeKoistinen I, Jacobsson SP: Peak alignment of NMR signals by means of a genetic algorithm. [http:/ / www.sciencedirect.com/ science/ article/ B6TF448V92TX3/ 2/ c787de4e2494245095accc2c715ba8e4] webcite
Analytica Chimica Acta 2003, 487(2):189199. Publisher Full Text

Lee G, Woodruff DL: Beam search for peak alignment of NMR signals. [http:/ / www.sciencedirect.com/ science/ article/ B6TF44CB66JC6/ 2/ cfd8cf26338b84100d49902c0625f4c0] webcite
Analytica Chimica Acta 2004, 513(2):413416. Publisher Full Text

Savorani F, Tomasi G, Engelsen SB: icoshift: A versatile tool for the rapid alignment of 1D NMR spectra. [http://www.ncbi.nlm.nih.gov/pubmed/20004603] webcite
Journal of Magnetic Resonance (San Diego, Calif.: 1997) 2010, 202(2):190202.
[PMID: 20004603]

Tibshirani R, Hastie T, Narasimhan B, Soltys S, Shi G, Koong A, Le Q: Sample classification from protein mass spectrometry, by 'peak probability contrasts'. [http://bioinformatics.oxfordjournals.org/content/20/17/3034.abstract] webcite
Bioinformatics 2004, 20(17):30343044. PubMed Abstract  Publisher Full Text

Zhang S, Wang H, Zhou X, Hoehn GT, DeGraba TJ, Gonzales DA, Suffredini AF, Ching W, Ng M, Wong STC: A Novel Peak Detection Approach with Chemical Noise Removal Using ShortTime FFT for prOTOF MS Data.
Proteomics 2009, 9(15):38333842.
[PMID: 19681055 PMCID: 2782493]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Codrea MC, Jiménez CR, Piersma S, Heringa J, Marchiori E: Robust peak detection and alignment of nanoLCFT mass spectrometry data. [http://portal.acm.org/citation.cfm?id=1761490] webcite
Proceedings of the 5th European conference on Evolutionary computation, machine learning and data mining in bioinformatics Valencia, Spain: SpringerVerlag; 2007, 3546.

Ishii S, Kadota K, Senoo K: Application of a clusteringbased peak alignment algorithm to analyze various DNA fingerprinting data. [http:/ / www.sciencedirect.com/ science/ article/ B6T304WSRDVH5/ 2/ fe6aa7fafad83de1d25020437447b745] webcite
Journal of Microbiological Methods 2009, 78(3):344350. PubMed Abstract  Publisher Full Text

Kazmi SA, Ghosh S, Shin D, Hill DW, Grant DF: Alignment of high resolution mass spectra: development of a heuristic approach for metabolomics. [http://www.springerlink.com/content/pu4m22840k68813n/fulltext.html] webcite
Metabolomics 2006, 2(2):7583. Publisher Full Text

De Souza DP, Saunders EC, McConville MJ, Likić VA: Progressive peak clustering in GCMS Metabolomic experiments applied to Leishmania parasites. [http://bioinformatics.oxfordjournals.org/content/22/11/1391.abstract] webcite
Bioinformatics 2006, 22(11):13911396. PubMed Abstract  Publisher Full Text

Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. [http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.114.5140] webcite
Journal of the American Statistical Association 2002, 97(457):7787. Publisher Full Text

Du P, Kibbe WA, Lin SM: Improved peak detection in mass spectrum by incorporating continuous wavelet transformbased pattern matching. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/22/17/2059] webcite
Bioinformatics 2006, 22(17):20592065. PubMed Abstract  Publisher Full Text

Yang C, He Z, Yu W: Comparison of public peak detection algorithms for MALDI mass spectrometry data analysis. [http://www.biomedcentral.com/14712105/10/4] webcite
BMC Bioinformatics 2009, 10:4. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Skov T, Berg Fvd, Tomasi G, Bro R: Automated alignment of chromatographic data.
Journal of Chemometrics 2006, 20(1112):484497. Publisher Full Text

Hageman JA, Wehrens R, De Gelder R, Buydens LMC: Powder pattern indexing using the weighted crosscorrelation and genetic algorithms. [http://www.ncbi.nlm.nih.gov/pubmed/12759904] webcite
Journal of Computational Chemistry 2003, 24(9):10431051.
[PMID: 12759904]
PubMed Abstract  Publisher Full Text 
Manning CD, Raghavan P, Schütze H: Introduction to information retrieval. Cambridge, UK; New York: Cambridge University Press; 2008.

Tan CS, Ploner A, Quandt A, Lehtiö J, Pernemalm M, Lewensohn R, Pawitan Y: Annotated regions of significance of SELDITOFMS spectra for detecting protein biomarkers. [http://www.ncbi.nlm.nih.gov/pubmed/17072907] webcite
Proteomics 2006, 6(23):61246133.
[PMID: 17072907]
PubMed Abstract  Publisher Full Text 
Verwaest KA, Vu TN, Laukens K, Clemens LE, Nguyen HP, Van Gasse B, Martins JC, Van Der Linden A, Dommisse R: 1H NMR based metabolomics of CSF and blood serum: A metabolic profile for a transgenic rat model of Huntington disease. [http://www.sciencedirect.com/science/article/pii/S0925443911001712] webcite
Biochimica et Biophysica Acta (BBA)  Molecular Basis of Disease 2011, 1812(11):13711379. Publisher Full Text

Larsen FH, Berg Fvd, Engelsen SB: An exploratory chemometric study of 1H NMR spectra of table wines.
Journal of Chemometrics 2006, 20(5):198208. Publisher Full Text

Smith CA, Want EJ, O'Maille G, Abagyan R, Siuzdak G: XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. [http://www.ncbi.nlm.nih.gov/pubmed/16448051] webcite
Analytical Chemistry 2006, 78(3):779787.
[PMID: 16448051]
PubMed Abstract  Publisher Full Text