Beijing NMR Center, Peking University, Beijing, PR China

College of Life Sciences, Peking University, Beijing, PR China

College of Chemistry and Molecular Engineer, Peking University, Beijing, PR China

Cancer Institute & Hospital, Chinese Academy of Medical Sciences, Beijing, PR China

No. 304 Hospital, Beijing, PR China

Chinese PLA General Hospital, Beijing, PR China

Abstract

Background

Spectral processing and post-experimental data analysis are the major tasks in NMR-based metabonomics studies. While there are commercial and free licensed software tools available to assist these tasks, researchers usually have to use multiple software packages for their studies because software packages generally focus on specific tasks. It would be beneficial to have a highly integrated platform, in which these tasks can be completed within one package. Moreover, with open source architecture, newly proposed algorithms or methods for spectral processing and data analysis can be implemented much more easily and accessed freely by the public.

Results

In this paper, we report an open source software tool, Automics, which is specifically designed for NMR-based metabonomics studies. Automics is a highly integrated platform that provides functions covering almost all the stages of NMR-based metabonomics studies. Automics provides high throughput automatic modules with most recently proposed algorithms and powerful manual modules for 1D NMR spectral processing. In addition to spectral processing functions, powerful features for data organization, data pre-processing, and data analysis have been implemented. Nine statistical methods can be applied to analyses including: feature selection (Fisher's criterion), data reduction (PCA, LDA, ULDA), unsupervised clustering (K-Mean) and supervised regression and classification (PLS/PLS-DA, KNN, SIMCA, SVM). Moreover, Automics has a user-friendly graphical interface for visualizing NMR spectra and data analysis results. The functional ability of Automics is demonstrated with an analysis of a type 2 diabetes metabolic profile.

Conclusion

Automics facilitates high throughput 1D NMR spectral processing and high dimensional data analysis for NMR-based metabonomics applications. Using Automics, users can complete spectral processing and data analysis within one software package in most cases. Moreover, with its open source architecture, interested researchers can further develop and extend this software based on the existing infrastructure.

Background

Since Nicholson et al. introduced the terminology

Nuclear Magnetic Resonance (NMR) is widely used in metabonomics studies. Compared to other analytical techniques, NMR has the advantage of fully quantitative analysis and minimal requirement for sample preparation

In this report, we introduce a new software tool, Automics, the first highly integrated open source software designed specifically for metabonomics to our knowledge. Automics runs on the Microsoft Windows platform, and it is developed with Visual C++. Automics provides features for almost all stages of metabonomics studies, including: NMR spectral processing (high throughput automatic modules and convenient manual modules), data organization, data pre-processing (four data filtering methods), and data analysis (nine data analysis methods), along with other useful functions such as statistical total correlation spectroscopy method (STOCSY), expression calculator and database resource exploring. Automics enables researchers to carry out most of their studies within only one software tool, and thus avoid extensive training on different software tools in order to use them properly. Furthermore, with the keen interest of researchers in developing new algorithms for spectral processing, data pre-processing and data analysis, Automics can serve as a framework for quickly implementing these new data processing algorithms and other useful features, due to its open source architecture. As it provides basic data structures, data management and low-level functions, Automics enables interested developers to focus on kernel algorithmic approaches instead of implementing infrastructure within this framework.

Implementation

Overview of software system

Automics contains modules for almost all stages of post-experimental spectral processing and data handling for NMR-based metabonomics [see Additional file

**Software package of Automics.** This compressed file contains executable installation package, source code package, HTML help file, test dataset and readme file of Automics. The latest version is available at Automics homepage

Click here for file

Functional modules and the workflow of Automics

**Functional modules and the workflow of Automics**.

The main interface and some of the feature windows of Automics are shown in Figure

Screenshot of the main graphical interface

**Screenshot of the main graphical interface**. (A) Directory browser for spectral dataset; (B) List of a selected spectral dataset; (C) Spectral processing window: displaying, moving, zooming and labeling etc. for spectral visualization either in single mode or multiple mode; (D) Column plot of a bucket/binning digitized spectrum; (E) Worksheet of the data organization module; (F) Column plot of the explained variance for R^{2 }(cum) and Q^{2 }(cum); (G) Scatter plot of PLS scores; (H) Column plot of PLS regression coefficients.

Spectra format conversion

Automics takes Bruker (Bruker Biospin, Germany) format raw FIDs and XWIN-NMR processed spectra by default. For NMR data collected on spectrometers from other venders, FID format conversion should be carried out with a conversion module. Before conversion, a meta-data file is first created in a format definition module, which contains parameters for the source format: FID filename, acquisition parameter filename, spectral width (Hz), chemical shift of spectral center (carrier position), observe frequency, optional FID file header size, byte order (little/big endian) and variable type of data points in computer memory (integer, 16/32 bits; float, 32/64 bits). Based on these parameters, Automics can convert raw FIDs of a variety of existing NMR FID formats, such as those from Varian (Varian Inc., USA) and JEOL (JEOL Ltd, Japan), to Bruker FID data format and rewrite them in Bruker style directories.

Manual spectral processing

Although high throughput automatic spectral processing is one of the major goals and an important feature of Automics, we believe a powerful and convenient manual processing module is still necessary. For situations when the automatic method does not work well (e.g. processing severely distorted spectra), the manual method may be an effective way to correct spectra.

Automics provides an easy to use manual spectral processing module. For spectral visualization, features such as spectral editing, peak labeling, peak information browsing, moving and zooming, are supported. To change the display properties of the visualized spectrum, a dialog can be used to set properties such as line color, line width, line style and background color. For spectral processing, floating tool bars can be used to continuously adjust zero-order and first-order phases for the interactive phase correction, or adjust coefficients of a selected fitting function (polynomial function, sine function and exponential function) for the interactive baseline correction. Other commonly used features, such as referencing, peak picking and spectral derivative, are also supported.

High throughput automatic spectral processing

1D NMR spectra can be automatically processed in batch mode in five steps: fast Fourier transform (Fig.

High throughput automatic spectral processing modules

**High throughput automatic spectral processing modules**. (A) Fast Fourier transform; (B) Automatic phase correction; (C) Peak alignment; (D) Bucket/Binning.

Fast Fourier transform

FFT converts NMR signals from time domain to frequency domain. This module can perform both complex FFT and real FFT. In addition, DC offset, zero filling, window function with a specific line broadening factor and removal of a potential digital filter imposed on FID (such as that from Bruker Avance spectrometer) are also carried out in this module.

Automatic phase correction

Traditionally, phase correction is generally accomplished manually by trial and error until the real part of the Fourier transformed spectrum appears globally as a pure NMR absorption spectrum. The corrected spectrum is dependent on one's experience. Several automatic methods have been proposed to estimate zero-order and first-order phases

Besides two global methods (maximization of the spectrum integral and minimization of the spectrum entropy) implemented in Automics, we have introduced another easier to implement method for automatic phase correction. This method does not require detection of isolated individual peaks and is efficient for processing large quantity of similar spectra in metabonomics studies. Considering a 1D ^{1}H NMR spectrum with good baseline, the regions near the two ends of the spectrum are normally free of signals. These regions usually belong to the baseline, and they should have nearly horizontal straight line shape after correction. Based on this principle, enough information can be acquired to calculate the zero-order (

(1) Define two pairs of small regions T_{i }and T_{j}, T_{k }and T_{l}(i, j, k and l are the center position of each region in data point) with a certain window length L (for example, 30 data points). T_{i }and T_{j }belong to the higher frequency baseline region of the spectrum, and T_{k}and T_{l }are in the lower frequency baseline region of the spectrum. Sum up each region and get their real parts (_{i}, _{j}, _{k}, _{l}) and imaginary parts (_{i}, _{j}, _{k}, _{l});

(2) Determine two phase errors (_{0}, _{1}) at positions of (i + j)/2 and (k + l)/2. Because these four regions all belong to the baseline, and the distance between two regions of each pair is very small, the two regions of each pair should have approximately the same phase errors (with a distance of 100 data points and a 300° first-order phase error, the difference between T_{i }and T_{j }usually is smaller than 4°). Therefore, T_{i }and T_{j }should have nearly the same intensity after correction, so do T_{k }and T_{l}. The two phase errors can be thus calculated with the following equations:

_{i}cos(_{0})+I_{i}sin(_{0}) = R_{j}cos(_{0})+I_{j}sin(_{0})

_{K}cos(_{1})+I_{K}sin(_{1}) = R_{1}cos(_{1})+I_{1}sin(_{0})

Therefore, the phase errors can be expressed as:

_{0 }= arctan((_{i }- _{j})/(_{j }- _{i}))+

_{1 }= arctan((_{k }- _{l})/(_{l }- _{k}))+

(3) Calculate the zero-order and first-order phases using the follow equations:

(4) Correct the spectrum with the determined

Automatic baseline correction

Current version of Automics provides two methods for automatic baseline correction: linear fitting and non-parametric recognition. Linear fitting method uses pre-defined positions of the spectrum to calculate coefficients of a linear function, which are then used to construct a baseline. Our experience has shown that this method works well in most cases, despite its simplicity.

A non-parametric method was implemented with a variant of Sergey's algorithm

Peak alignment

Frequency shifts due to unstable experimental and instrumental conditions are one of the main sources of unwanted variations for further data analysis. These variations obscure the process of pattern discovery and impede the performance of data analysis. Peak alignment is an essential step to remove effects of such variations from the spectral datasets. Spectral referencing, which sets the inner reference peak (DSS/TSP) of each spectrum to 0 ppm, can be regarded as a simple global method for peak alignment. This method shifts the entire spectrum based on the same reference peak position. Thus, all the spectra with global peak misalignments are well aligned. However, it is not sufficient for correcting individual peak misalignments in spectra, such as those from urine samples with variant solution conditions. Several methods have been proposed to solve this problem. A genetic algorithm can align peaks in automatically selected segments of each spectrum to the corresponding peaks in a pre-selected reference spectrum

Bucket/binning and normalization

Bucket/binning is a commonly used technique for digitizing a spectrum into a row vector. It has the advantages of minimizing misalignment effects and reducing data dimensionality (usually from several thousand to several hundreds of bins) for further analysis. However, it also leads to a lower data resolution. An extreme case of bucket/binning is that each bin contains a single data point, thus it has a full resolution. However, the quality of data generated in this way highly relies on accurate peak alignment, and this method may bear heavy computing burden such as calculating element-based leave-one-out cross validation.

Automics provides both full resolution (data point) bucket/binning and traditional bucket/binning with equal bin width options (Fig.

(1) Noise filtering: use a Savitsky-Golay filtering window to smooth the spectrum by removing high frequency noise with a pre-defined noise threshold.

(2) Peak finding: Among the data points whose intensities are above the threshold, find all maximal points. A maximal point is defined as a point with a number of adjacent consecutive data points on both sides that all have smaller intensities than this point; meanwhile, the intensities of these points on each side are in descending order.

(3) Peak width determining: For each side of a maximal point, the total number of data points whose intensities are in descending order is counted. The sum of the two numbers for both sides is used as peak width for this peak.

In addition to the above mentioned two methods, an intelligent adaptive binning method was also implemented in Automics

Data organization and data pre-processing

A worksheet module was developed in Automics for data organization. Data pre-processing and data analysis procedures are all based on data in the active worksheet. Automics can import/export data files in text format or Microsoft EXCEL spreadsheet format. Commonly used editing functions and some basic statistical analysis (column based statistics, row based statistics and matrix standardization) are supported.

To remove undesirable systematic variations in the spectroscopic data before data analysis, four commonly used data filter methods were integrated into Automics: multiplicative signal correction (MSC)

Data analysis module

After data pre-processing, data analysis modules can be invoked to analyze the data and build data models. Automics provides nine different pattern recognition methods for data analysis. These methods include feature (variable) selection method (Fisher's criterion (FC)

FC

Fisher's criterion method is a feature selection technique. The general purpose of the feature selection is to find significant features (variables) from the original data space in order to produce a better prediction result. Irrelevant features that introduce noises should be eliminated. The importance of an individual feature for discriminating different groups in the training dataset is expressed by the Fisher's ratio, which is the ratio of between-class variance to within-class variance for the training group. A feature with larger Fisher's ratio means that it is more important for classification. Users can select a number of features with the largest Fisher's ratios for further analysis.

PCA and PLS

These are two commonly used multivariate analysis techniques in metabonomics studies. The main purpose of PCA is to eliminate the collinear problem and then reduce the dimensionality of the original feature (variable) space. It is an unsupervised method used to reveal the internal structure of datasets in an unbiased way. PLS is a supervised method for regression. The overall goal of PLS is to maximize the covariance between the predictor space and the response space, and then use the predictor matrix to predict responses in the population. With a cutoff for predicted responses, PLS regression can be used for classification and discrimination analysis (PLS-DA). PCA and PLS reveal the variable contribution for the separation between different groups by loadings or regression coefficients.

LDA and ULDA

LDA is a well-known technique for the dimension reduction and the feature extraction closely related to PCA. Differing from PCA, LDA is a supervised method. It aims to find an optimal transformation that maps the data into a lower dimensional space with minimized within-class distance and maximized between-class distance, thus achieving the best separation between two or more classes of observations. A variant algorithm, ULDA, was proposed for solving singular problem limitation in LDA. ULDA employs the generalized singular value decomposition method to handle singular data. The advantage of this method is that features in the transformed space are uncorrelated, which makes it attractive for the feature dimension reduction. The work on plasma fatty acid metabolic profiling analysis by Yi et al.

K-Mean

K-Mean is an unsupervised method for grouping samples into a fixed number (

KNN and SIMCA

KNN and SIMCA are two supervised methods for classification using data similarity. In KNN method, the test dataset is classified by a majority vote of its neighbors, with a sample being assigned to the class most common amongst its

SVM

It was integrated into Automics as another powerful classification tool. SVM is based on rigorous statistical learning theory, and it has been used in a wide range of problems for the classification of datasets such as proteomics data and genomics data. SVM takes a set of features (variables) as input and outputs a classification or a regression vector. It maps input vectors into a higher dimensional feature space using a kernel function. The training procedure leads to the finding of a hyper plane in the feature space, which optimally separates training vectors of two classes. Then, it finds several support vectors that contribute most for the classification. When a new feature vector (sample or row vector in metabonomics) is input, its class membership is predicted on the basis of which side of the plane it maps.

Before invoking data analysis methods, new datasets must be created based on the data in the active worksheet by a dialog interface (Fig.

In addition to the implementation of these data analysis algorithms, Automics provides a convenient way to visualize parameters of data models in 2D scatter plot, line plot, or column plot. These parameters usually include scores, loading, explained variances (R^{2}, Q^{2}), residual matrix, Hotelling's T^{2 }etc. Plot properties (color, legend, title, footnote, scale of an axis etc.) can be conveniently changed. These features in Automics are comparable to those in commercial statistics software.

The performance of many pattern recognition methods are related to the inner data relationship and the data structure. Automics is flexible enough to combine different data pre-processing methods (noise filter, feature selection, dimension reduction) with different classification methods, and produce different classification methods such as FC/KNN, ULDA/KNN, FC/PLS-DA, O-PLS/PLS, and FC/DOSC/SVM etc., to facilitate different metabonomics applications and achieve better results.

Results and discussion

Automatic spectral processing

Automics provides an efficient way for processing a large number of NMR spectra in batch mode by a series of automatic methods, and produces reasonably well corrected spectra. Figure

Overlay display of 1D ^{1}H NMR spectra automatically processed by Automics

**Overlay display of 1D ^{1}H NMR spectra automatically processed by Automics**. The spectra were processed by automatic modules in the following steps: fast Fourier transform, phase correction (new introduced phase correction method), baseline correction (linear fitting) and peak alignment (global shift method).

For several different spectral datasets, the new automatic phase correction algorithm we proposed worked well. We have also examined this method on a set of spectra that contained significant zero-order and first-order phase distortions ranging from 20° to 300°, and we achieved less than 8° errors for the two phase values (data not shown). As our method does not use peaks for determining phases, our method has no weaknesses due to peak shape, digitization rate or peak overlap. The signal-to-noise ratio of a spectrum has little influence on phase determination, owing to our summation procedure. In practice, the baseline regions of a spectrum selected for determining phase errors are not always in a horizontal straight line, sometimes they could be a little tilted and have small slope angles. However, the angles can be determined from a corrected reference spectrum and then be applied to uncorrected spectra as a prior knowledge for compensation. The main drawback of this method is that it relies on a not severely distorted baseline. Nevertheless, the fact that a spectrum can not be phased correctly due to severe baseline problem may indicate an abnormal situation in the NMR experiment, which should not happen very frequently. Therefore, this drawback will not be a significant problem in metabonomics studies.

A metabonomics application using Automics

We have tested Automics for several application datasets. Here, with an example on the study of metabolic profile in type 2 diabetes, we provide an overview of the validity and the ability of Automics.

Sample preparation

Human blood samples were collected from 41 healthy adults and 57 patients with type 2 diabetes mellitus from No. 304 Hospital in Beijing. The ages of patients were between 21 and 79 years (44 ± 17, mean ± STD.). All the samples were collected under the same clinical condition before breakfast. The plasma samples were first allowed to clot in plastic tubes for about 1 hour at room temperature, and then aliquots of serum were collected and stored at -80°C until assayed. Right before the NMR experiment, each serum sample (150 _{2}O and 3

NMR experiment

All the 1D ^{1}H NMR spectra were collected at a temperature of 298 K on a Bruker Avance 600 MHz NMR spectrometer using Bruker pulse sequence NOESY PRESAT, which can be depicted as: RD-90°-t_{1}-90°-t_{m}-90°-acquisition. RD represents a relaxation delay of 1.5 s during which the water resonance is selectively irradiated, and t_{1 }is a fixed time interval. During the mixing time t_{m }(150 ms), the water resonance is irradiated for a second time. For each sample, 32 scans were collected into 16 K data points with a spectral width of 9615.4 Hz.

Spectral processing

Raw NMR FIDs from 41 healthy samples and 57 diabetic patient samples were processed in Automics using automatic spectral processing module, including Fast Fourier Transform (correction of DC offset, exponential window function with a line broadening factor of 0.3 Hz), phase correction (new method we proposed), baseline correction (linear fitting method) and peak alignment (global shift method). These automatic spectral processing procedures produced a good result (data not shown) and no further manual correction was carried out. All the processed spectra were data reduced to 470 segments between 0.2 ppm and 10.0 ppm using bucket/binning module with a bin width of 0.02 ppm. Due to the strong solvent signal, the spectral region between 4.6 ppm and 5.0 ppm was excluded.

PLS analysis together with DOSC and O-PLS

To determine whether it is possible to distinguish healthy and diabetic samples based on the NMR spectra, we carried out PCA and PLS analysis using data analysis module on the mean-centered data. PCA analysis showed that the two groups are severely overlapped. The PLS score plot of the second and the third principal components (t2 and t3) show that some clustering is evident even though there is overlap between the two groups (Fig. _{3 }groups from fatty acid side chains of lipids, in particular LDL and VLDL; Signals at 1.26 and 1.30 ppm are assigned to (CH_{2})_{n }groups from fatty acid side chains of lipids (mainly in VLDL, LDL); the signal at 1.34 ppm is assigned to lactate; the small region near 3.5 ppm should contain a set of signals from CH groups of glucose, sugars, glycerol and amino acids; the signals around 5.24 and 5.30 ppm are from CH groups of

PLS analysis of type 2 diabetic samples (57) and healthy samples (41) using Automics

**PLS analysis of type 2 diabetic samples (57) and healthy samples (41) using Automics**. (A) PLS scores show evident clustering between diabetic (○) and healthy (△) samples. The optimal separation occurs in the second and third components (t2, t3). (B) Regression coefficients of the corresponding PLS model. (C) PLS scores after application of DOSC for removal of one orthogonal component. (D) Regression coefficients of the PLS model after application of DOSC. (E) PLS scores after application of O-PLS. Note that the significant improvement for separation is both achieved by DOSC and O-PLS, and now the optimal separation occurs in the first principal component. (F) Regression coefficients of the PLS model after application of O-PLS.

In Figure

Feature for overlay display of regression coefficients and the corresponding spectrum in Automics

**Feature for overlay display of regression coefficients and the corresponding spectrum in Automics**. (A) Display of the whole spectral region; (B) Display of a zoomed part; (Green curve, NMR spectrum; Red curve, regression coefficients).

Comparison of different classification methods in Automics

Classification is an important objective in some metabonomics applications. PLS-DA, KNN, SIMCA and SVM implemented in Automics were applied to the dataset for predicting the class membership of unknown samples. Approximately two-thirds of the samples (66/98) were randomly selected to build a training model. The prediction ability of a model was calculated by cross validation. After training, the constructed model was then used to determine the class membership of the remaining one-third samples (testing set, 32/98). We have used different combinations of data pro-processing methods and classification methods to compare their prediction abilities. For example, FC/DOSC/SVM means Fisher's criterion is first used to select an appropriate number of features from the original feature space; then direct orthogonal signal correction is applied to the selected features to filter out unwanted variations; finally, SVM is applied to the corrected data for classification. The results of different combinations are shown in Table

Comparison of different classification methods

**Recognition rate**

**Prediction rate**

**Sensitivity**

**Specificity**

**Accuracy rate**

**Ctrl/PLS-DA**

95.5% (63/66)

75.0% (24/32)

91.2% (52/57)

85.4% (35/41)

89.8% (88/98)

**UV/PLS-DA**

98.5% (65/66)

78.1% (25/32)

91.2% (52/57)

92.7% (38/41)

91.8% (90/98)

**DOSC/PLS-DA**

100% (66/66)

84.4% (27/32)

93.0% (53/57)

95.1% (40/41)

94.9% (93/98)

**O-PLS/PLS-DA**

100% (66/66)

81.3% (26/32)

91.2% (52/57)

95.1% (40/41)

93.9% (92/98)

**FC/DOSC/PLS-DA**

98.5% (65/66)

90.6% (29/32)

94.7% (54/57)

97.6% (40/41)

95.9% (94/98)

**KNN (K = 3)**

95.5% (63/66)

71.9% (23/32)

84.2% (48/57)

92.7% (38/41)

87.8% (86/98)

**SIMCA**

90.9% (60/66)

75.0% (24/32)

87.7% (50/57)

82.9% (34/41)

85.7% (84/98)

**FC/KNN (K = 3)**

95.5% (63/66)

81.3% (26/32)

93.0% (53/57)

87.8% (36/41)

90.8% (89/98)

**SVM**

100% (66/66)

81.3% (26/32)

94.7% (54/57)

92.7% (38/41)

93.9% (92/98)

**DOSC/SVM**

100% (66/66)

87.5% (28/32)

94.7% (54/57)

97.6% (40/41)

95.9% (94/98)

**FC/SVM**

100% (66/66)

90.6% (29/32)

96.5% (55/57)

97.6% (40/41)

96.9% (95/98)

**FC/DOSC/SVM**

100% (66/66)

96.9% (31/32)

100% (57/57)

97.6% (40/41)

99.0% (97/98)

Prediction results were from different classification methods (25 healthy and 41 diabetic samples in the training set; 16 healthy and 16 diabetic samples in the testing set). Recognition rate is the correctly classified rate in the training set. Prediction rate is the correctly classified rate in the testing set. Sensitivity is the rate of true positive classified as positive. Specificity is the rate of true negative classified as negative (Ctrl, mean-centered scaling; UV, auto scaling; DOSC, direct orthogonal signal correction; O-PLS, orthogonal projections to latent structures; FC, Fisher's criterion for feature selection).

Without data pre-processing, the number of correctly classified samples in the testing set (prediction rate) decreases in the following order: SVM gives the best result (prediction rate 81.3%); SIMCA and PLS-DA show similar results (prediction rate 75.0%); and KNN produces the worst result (prediction rate 71.9%). With data pre-processing methods (FC was used to select the top 30 significant features from the original variables; DOSC and O-PLS were used to remove orthogonal variations for the first component) applied to the dataset, all the classifiers show better prediction performance. For example, DOSC/PLS-DA gives an improved prediction rate of 84.4%; O-PLS/PLS-DA, DOSC/SVM and FC/SVM also show improved prediction rates of 81.3%, 87.5% and 90.6%, respectively. These demonstrate the general ability of DOSC and O-PLS for removing noise from data set. As shown in Table

Whether using data pre-processing or not, SVM gives the best prediction result compared with PLS-DA, KNN and SIMCA on this dataset. The superiority of SVM in prediction suggests that not only well known collinear relationships, but also nonlinear relationships may exist in this dataset. To our knowledge, there are very few applications of SVM in metabonomics studies. The better performance of SVM on our dataset is consistent with the result from Bullinger et al

Other related aspects and the future of Automics

Currently, Automics provides a module for conveniently exploring database resources. This function is helpful when researchers want to explore the structure and chemical shift information of metabolites from available database such as Madison Metabolomics Consortium Database (MMCD) ^{1}H spectra to generate a correlation matrix about the intensity correlations among various peaks across the whole dataset

There is still plenty of room for improving the functionality and usability of Automics. For example, some commonly used data analysis approaches such as O2-PLS will be implemented in the near future. A more user-friendly interface is also in our plan for the future development. As Automics was designed with a free open architecture, interested researchers are encouraged to implement new algorithms and extend the software based on the existing infrastructure. We also expect that applications of Automics by metabonomics researchers will help us to get valuable feedbacks and suggestions for improving the platform.

Conclusion

In this paper, we introduced Automics, the first open source software tool with highly integrated modules specifically designed for NMR-based metabonomics applications. This tool covers almost all stages of the NMR-based metabonomics study workflow. The spectral processing modules in Automics are efficient and convenient for either processing a large number of spectra or processing single spectrum offline without commercial software. In addition, features such as data organization, data pre-processing and a wide range of data analysis techniques for multivariate data analysis, classification and regression have been implemented in Automics. Some of the useful data analysis methods in Automics (such as SVM) are not available in widely used commercial software tools, such as SIMCA-P (Umetrics, Sweden). Automics enables researchers to complete spectral processing and data analysis in one software package. Moreover, Automics could be applied to metabonomics data generated from other analytical techniques (such as mass spectroscopy), owing to its flexible and independent module designs. More details about the usage of Automics can be found in the well-documented HTML help [see Additional file

Availability and Requirements

**Project name: **Automics

**Project home page: **

**Operating system: **Windows 2000/NT/XP/2003

**Programming language: **Visual C++

**License: **open source under GNU license

Abbreviations

NMR: Nuclear Magnetic Resonance; MSC: Multiplicative Signal Correction; SNV: Standard Normal Variate Transform; DOSC: Direct Orthogonal Signal Correction; O-PLS: Orthogonal Projection to Latent Structures; FC: Fisher's Criterion; PCA: Principal Component Analysis; LDA: Linear Discriminant Analysis; ULDA: Uncorrelated Linear Discriminant Analysis; PLS: Partial Least Squared; KNN: K Nearest Neighbors; SIMCA: Soft Independent Modeling of Class Analogy; SVM: Support Vector Machine; STOCSY: Statistical Total Correlation Spectroscopy.

Authors' contributions

BX, CWJ and TW suggested desired features and algorithmic approaches. TW carried out the designing and implementation. KS, QYC, YFR, YMM, LJQ and JH provided samples or experimental data and helped in the evaluation of the package. The online documentation and the manuscript were written by TW with inputs from all co-authors. All authors have read and approved the final manuscript.

Acknowledgements

All NMR experiments were carried out at the Beijing Nuclear Magnetic Resonance Center (BNMRC), Peking University. This research was supported by Grant 2009CB521703 from 973 Program of China and Grant 30125009 from NSFC to BX.