Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Methodology article

On the analysis of glycomics mass spectrometry data via the regularized area under the ROC curve

Jingjing Ye1, Hao Liu2*, Crystal Kirmiz3, Carlito B Lebrilla3 and David M Rocke4

Author Affiliations

1 Department of Statistics, University of California, Davis, Davis, CA, 95616, USA

2 Division of Biostatistics, Dan L. Duncan Cancer Center, Baylor College of Medicine, Houston, TX, 77030, USA

3 Department of Chemistry, University of California, Davis, Davis, CA, 95616, USA

4 Division of Biostatistics, University of California, Davis, Davis, CA, 95616, USA

For all author emails, please log on.

BMC Bioinformatics 2007, 8:477  doi:10.1186/1471-2105-8-477


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/8/477


Received:18 May 2007
Accepted:12 December 2007
Published:12 December 2007

© 2007 Ye et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Novel molecular and statistical methods are in rising demand for disease diagnosis and prognosis with the help of recent advanced biotechnology. High-resolution mass spectrometry (MS) is one of those biotechnologies that are highly promising to improve health outcome. Previous literatures have identified some proteomics biomarkers that can distinguish healthy patients from cancer patients using MS data. In this paper, an MS study is demonstrated which uses glycomics to identify ovarian cancer. Glycomics is the study of glycans and glycoproteins. The glycans on the proteins may deviate between a cancer cell and a normal cell and may be visible in the blood. High-resolution MS has been applied to measure relative abundances of potential glycan biomarkers in human serum. Multiple potential glycan biomarkers are measured in MS spectra. With the objection of maximizing the empirical area under the ROC curve (AUC), an analysis method was considered which combines potential glycan biomarkers for the diagnosis of cancer.

Results

Maximizing the empirical AUC of glycomics MS data is a large-dimensional optimization problem. The technical difficulty is that the empirical AUC function is not continuous. Instead, it is in fact an empirical 0–1 loss function with a large number of linear predictors. An approach was investigated that regularizes the area under the ROC curve while replacing the 0–1 loss function with a smooth surrogate function. The constrained threshold gradient descent regularization algorithm was applied, where the regularization parameters were chosen by the cross-validation method, and the confidence intervals of the regression parameters were estimated by the bootstrap method. The method is called TGDR-AUC algorithm. The properties of the approach were studied through a numerical simulation study, which incorporates the positive values of mass spectrometry data with the correlations between measurements within person. The simulation proved asymptotic properties that estimated AUC approaches the true AUC. Finally, mass spectrometry data of serum glycan for ovarian cancer diagnosis was analyzed. The optimal combination based on TGDR-AUC algorithm yields plausible result and the detected biomarkers are confirmed based on biological evidence.

Conclusion

The TGDR-AUC algorithm relaxes the normality and independence assumptions from previous literatures. In addition to its flexibility and easy interpretability, the algorithm yields good performance in combining potential biomarkers and is computationally feasible. Thus, the approach of TGDR-AUC is a plausible algorithm to classify disease status on the basis of multiple biomarkers.

Background

With rapidly developing biotechnology, the use of high-throughput clinical laboratory data to detect disease conditions and predict patients' outcomes is becoming a reality for medical practice. These technologies include microarray, mass spectrometry applied to proteomics, and new imaging modalities, which have been engaged in research on detecting clinical disease, predicting patients' responses to different treatments and evaluating the prognosis of patients with disease [1].

Among those new biotechnologies, mass spectrometry (MS) is used increasingly for protein profiling in cancer research. The basic goal is to predict cancer on the basis of peptide/protein abundance from the MS data. Recent literatures on cancer classification using MS have identified some potential protein biomarkers in serum to distinguish cancer from normal samples (Baggerly et al.[2], Wagner et al.[3], Adam et al.[4]). However, sensitivity and reproducibility remains as a major concern in making the protein technology reliable [5].

As an alternative, glycomics is proposed as a new trend for biomarker detection at the end of 20th century. Glycomics is the study of glycans (oligosaccharides), and glycoproteins. Apweiler et al.[6] estimated that at least 50% of human proteins are glycosylated. Since glycans play crucial roles in cell communication and signalling events [7], they may be implicated in cancer. Compared to potential protein or peptide biomarkers, oligosaccharides are highly sensitive to biochemical environment and are more easily identified and quantified [8]. Therefore, in a study conducted in this lab, clinical glycomics is used to identify potential biomarkers for the early detection of ovarian cancer.

Ovarian cancer is one of the most deadly types of cancer among women [8]. Many investigators believe that early detection of ovarian cancer would improve the patients' survival. CA 125 is the only FDA approved biomarker for the early detection of ovarian cancer. However, it has unreasonably low sensitivity and specificity. For instance, only 50% women with Stage I ovarian cancer will have an elevated CA 125 and many benign conditions can cause elevated levels [8].

One technique of profiling oligosaccharides into serum was developed in this lab. The idea that glycoproteins which are sloughed off cells may be detected in patients' serum was utilized for this analysis. Serum samples were analyzed by the high-resolution MS to assess the variation of glycans in cancer patient sera compared to healthy patient sera. MS data are high-dimensional. Figure 1 shows a typical mass spectrum. In this experiment, a single spectrum contains 500,000 distinct mass-to-charge values (which measure the ratio of mass to the charge of glycans) and the corresponding relative intensities (which measure relative abundances of glycans) in the serum sample. It is desirable to use all informative glycans because multiple biomarkers may allow improved sensitivity and specificity of cancer detection [9].

thumbnailFigure 1. Typical Glycomic Mass Spectrum. A typical mass spectrometry glycomics spectra is plotted. The x-axis is m/z value and the y-axis is the corresponding intensity, which measures the relative abundance of glycans.

A number of recent studies have implemented machine learning algorithms to classify high-dimensional MS cancer data. Artificial neural networks (ANNs) were utilized by Ball et al.[10], Lancashire et al.[11] and Mian et al.[12], to discriminate different tumor states. Fushiki et al. [13] explored the efficient learning algorithm AdaBoost to extract potential biomarkers for classifying MS cancer from control samples. Decision tree based ensemble methods were proposed by Geurts et al.[14] to identify biomarkers for inflammatory diseases. Other algorithms, such as support vector machine (SVM) used by Xiong et al.[15], random forest (RF) applied by Wu et al.[16] and linear discriminant analysis (LDA) employed by Miketova et al.[17] and Lilien et al.[18], were also studied. Comparisons among algorithms in a case study of ovarian cancer classification were evaluated by Datta and DePadilla [19] and Wu et al.[16]. All of the above studies aim to discover the potential MS biomarkers that can distinguish one group from another. However, for prognostic and diagnostic purposes, how to combine those MS biomarkers and whether or not the combination is optimal are not addressed in those studies. Thus, an objective of this research is to consider a statistical method that combines the high-dimensional MS measurements into a single score to classify cancer status jointly with suitable preprocessing of the data.

There are several studies on combining biomarkers. Su and Liu [20] studied the case where markers follow a multivariate normal distribution. They gave a closed form of optimal solution to the linear parameters. Normality is not suitable for mass spectrometry data because measurements of relative abundance are always positive. Pepe and Thompson [21] considered linearly combining two biomarkers by optimizing the area under the ROC curve. The method was developed only for low-dimensional situation. And it is not trivial to generalize the approach to high-dimensional case. Ma and Huang [22] applied Pepe and Thompson's idea of optimizing AUC to microarray experiment. They used multivariate normal distribution in the simulation study and assumed independence between biomarkers, which is not true for mass spectrometry data. In addition, they implemented the threshold gradient algorithm, first proposed by Friedman and Popescu [23], without correctly recognizing the regularization parameter. In this work, the question on how to combine the high-dimensional mass spectrometry predictors into a single score for the purpose of classification is addressed. The performance of a classifier by maximizing the area under the ROC curve for linearly combining the biomarkers is evaluated. The technical difficulty of this optimization problem is that the empirical AUC function is not differentiable. The objective function is in fact an empirical 0–1 loss function with a large number of linear predictors, and it is well known that such optimization problem is ill-imposed. An approach that regularizes the area under the ROC curve while replacing the 0–1 loss function with a sigmoid function was investigated. A constrained threshold gradient descent regularization algorithm, which is first introduced by Friedman and Popescu [23], to stabilize the estimates is applied. In Friedman and Popescu, they demonstrated their algorithm in a quadratic objective function. In this study, their objective function is replaced with the area under the ROC. A simulation is also conducted on mass spectrometry data under the case-control design that will generate joint distribution of diseased samples and normal samples to evaluate this algorithm.

The article is organized as the following. Simulation study is described in the Testing Section which describes how effective the proposed TGDR-AUC approach is. The Implementation Section is for real mass spectrometry ovarian cancer data analysis after a description of our preprocessing method. TGDR-AUC method is applied to low-dimensional and high-dimensional ovarian cancer data. In the Conclusion and Discussion Section, it is concluded that the TGDR-AUC algorithm is appropriate in the analysis of mass spectrometry glycomic data. A detailed description for TGDR-AUC algorithm is in Method Section. The definition and properties of the ROC curve are reviewed. The area under the ROC curve as the objective function for maximizing the performance of the classifier is proposed. Furthermore, several sigmoid functions that replace the 0–1 loss function are introduced and a simple comparison among the sigmoid functions is shown. Threshold gradient direct regularization algorithm is explained after selection of sigmoid function as well as the detailed algorithm for parameter estimations.

Results

Testing

Testing of the TGDR-AUC algorithm was demonstrated through a simulation study. Since the mass spectrometry measures the relative abundance of molecules, the measurement is always positive. Hence, a positive distribution is a reasonable choice for data simulation. In contrast to Ma and Huang [22], who simulated data under normal distribution and assumed independence among biomarkers, exponential distribution was chosen to generate the simulation data. Since a better classifier is desired, the data used for simulation were chosen so that the true AUC equals to 0.95. The simulation is generated as the following:

• Denote X as normal patient, and m is the number of normal patients. Denote Y as the disease patient, and n is the number of diseased patients. For simplicity, we choose m = n. The dimension of the biomarkers is denoted as p.

• Simulate Xi as an exponential distribution with parameter λ = 1, i = 1, ..., n.

• Generate a Bernoulli trial B(1, 0.95) for n times.

• The data for the diseased patients are generated as Yi = Xi + 1 when Bernoulli trial is 1; Yi = max{0, Xi - 1} when Bernoulli trail is 0.

The number of the replication was chosen to be 500. The data is generated as joint distribution of X and Y. The true probability is P(X <Y) = 0.95, no matter what the linear combination β is. The goal of the simulation study is to show that the maximizer of empirical AUC by TGDR regularization is in fact our targeted maximization problem (4) (see Additional file 1, 2 for examples of simulated data).

Additional file 1. Simulated data with different combinations of n and p used to evaluate TGDR-AUC algorithm.

Format: ZIP Size: 107KB Download fileOpen Data

Additional file 2. Introduction on how to use the provided C++ code and the simulated data.

Format: DOC Size: 20KB Download file

This file can be viewed with: Microsoft Word ViewerOpen Data

To study whether the ratio of p and n has any impact on the results, the simulation cases are considered for the different combination pair of p and n as p/n → + ∞, p/n c, where c is a constant and p/n → 0. The p and n pairs are (10,5),(10,10),(10,25),(10,50),(25,5), (25,10),(25,25),(25,50) and (50,5),(50,10),(50,25),(50,50). The data are partitioned randomly into a training set of size n1 and a testing size of n2 with n1 + n2 = 2n. Dudoit [24] suggested that <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M1">View MathML</a>. The TGDR algorithm described in the Method Section was applied to the simulated data and summary statistics of estimated empirical AUC based on 500 simulated data sets are reported in Table 1.

Table 1. Simulation Study Result. Summary statistics of the simulation results of TGDR-AUC algorithm.

The simulation conducted is for a relatively small sample size and a small number of biomarkers. From Table 1, the regularization of TGDR tends to overestimate the AUC when p is less than n. As sample size increases, the estimated AUC by regularization approximates the true AUC. Furthermore, as sample size increases, the standard error stabilizes to be around 0.03.

Table 1 gives a guideline on how to use the TGDR-AUC algorithm for different situations. When a larger sample size n than biomarker p is observed, the algorithm is trustworthy, in the sense that the estimated AUC approximated the true AUC and the best ratio for p/n is around 1/5.0. However, when the number of biomarkers is much larger than the sample size, it remains unclear whether the TGDR-AUC algorithm tends to overestimate the AUC.

Implementation

Two real MS data sets are analyzed, one low-dimensional and another high-dimensional real ovarian cancer data. The low-dimensional data is preprocessed, and high-dimensional data is the mass spectrometry raw data. So high-dimensional data will be preprocessed first before it is carried on any further analysis. The high-dimensional data is a subset of the low-dimensional data because of missing raw data files. The TGDR-AUC algorithm is applied to both data sets for cancer diagnosis. All of the programming were done in C and R (see Additional file 2, 3, 4 for algorithm codes).

Additional file 3. C++ source code for TGDR-AUC algorithm to estimate optimal parameters of λ and τ.

Format: CPP Size: 21KB Download fileOpen Data

Additional file 4. C++ source code for TGDR-AUC algorithm to estimate linear combination parameters and AUC after selecting optimal λ and τ.

Format: CPP Size: 9KB Download fileOpen Data

Low-dimensional Ovarian Cancer Data Analysis

The data contained 73 patients, among which 24 were healthy patients and 49 were ovarian cancer patients. Total 14 glycan biomarkers were pre-selected for the low-dimensional data set. The 3-fold cross-validation TGDR-AUC algorithm was applied to select regularization parameter λ and then select the optimal tuning parameter τ. Using the determined λ and τ, the estimated empirical AUC for the data was further obtained. The result is summarized in the first column in Table 2. The estimated empirical AUC was as high as 0.95, which indicates an excellent cancer diagnosis for linear combing the 14 biomarkers. The ratio of biomarker number 14 to sample size number 73 was less than 1/5.0, so the estimated empirical AUC was trustworthy as suggested by the simulation results. The 500 bootstrap data sets were performed for given both λ and τ. The bootstrap standard error (SE) was 0.00126. The 95% confident interval for AUC was (0.9513,0.9560). The estimated coefficients of the 14 biomarkers are plotted in Figure 2. From the figure, biomarkers number 3 and 8 had the highest coefficients, suggesting the highest influence on the cancer diagnosis. Biomarkers number 11 and 13 had smaller estimated values, indicating less importance than other biomarkers. The low-dimensional data ROC based on the 14 peaks are plotted in Figure 3.

Table 2. Ovarian Cancer Data (with the bootstrap standard error in parenthesis). Implementation results of TGDR-AUC algorithm to MS ovarian cancer data.

thumbnailFigure 2. Low-dimensional Data Result. Plot of estimated coefficients by TGDR-AUC algorithm of potential biomarkers for Low-dimensional data.

thumbnailFigure 3. Low-dimensional Data ROC. Estimated ROC curve given the estimated optimal combination by TGDR-AUC algorithm of biomarkers for low-dimensional data.

High-dimensional Ovarian Cancer Data Analysis

In this section, the analysis starts from the raw ovarian cancer data. The problem with the raw mass spectrometry data is that the data is high-dimensional, so extracting useful information is crucial. For this high-dimensional data set, there were 19 normal patients and 21 cancer patients. Each patient had three measurements, called 10%, 20% and 40% fractions, corresponding to different sample extraction methods carried out prior to the mass spectrometry experiments. For each spectrum, the raw data contained 500,000 data points. The mass spectrometer's manufacturer(Varian FTMS Systems, Lake Forest, CA) provided their software for peak-selection of individual spectra from the 500,000 data points. The problem with their peak-identification is that it is based on an individual spectrum, meaning that for any two spectra, peaks are selected at different mass-to-charge values. Therefore, the peaks are not consistent between samples and not trustworthy for cancer diagnosis. Before any data analysis is performed, the preprocessing of the data is critical.

Preprocessing High-Dimensional Data

First, the peaks selected by the instrument's software for the spectra were used. The selected-peaks were grouped into a matrix, with each column corresponding to one spectrum and each row corresponding to one distinct mass-to-charge (m/z) value. If the spectrum did not have the peak at the m/z value, a zero was replaced to indicate missing value in the matrix. However, this resulted in many zeros in the matrix. The corresponding raw data was chosen to be substituted into the zero intensity. In this way, much more similar information could be included as the raw data.

However, the scale of raw data was not the same as the corresponding selected-peaks data file. The ratio factor between the raw and its corresponding selected-peaks file needed to be estimated. The estimation of the ratio factor was done as the following: find the nearest point in the raw data to its corresponding selected-peaks data, defined as the absolute distance between the corresponding m/z values. Two cases may happen: if the nearest m/z value in the raw data is unique, the ratio is calculated by the raw data intensity to the selected-peaks data intensity at the nearest value; if the nearest value is not unique, the ratio is calculated by the averaged intensities of raw data at those nearest values to the selected-peaks data intensity. For each file, the ratio was then averaged to give the unique factor estimation.

Using the estimated ratio for each file, the intensities from raw data could then be calculated to fill in the data matrix. There, again, may be two scenarios: if the closest m/z in raw data is unique to the selected-peaks data, then substitute in the corresponding raw data intensity with the adjustment by its ratio factor; if the closest m/z in raw data is not unique, then substitute in the maximum intensities of those closest m/z raw data to the corresponding column of the data matrix, with adjusting by its ratio factor. This was chosen as the maximum intensity because we wanted to include the strongest signal to be substituted in the data matrix.

After imputing the data as above, the data matrix was formed with each spectrum as one column of the matrix and intensities at all same m/z values. Before any statistical analysis could be completed, each column of the data matrix was further normalized by dividing total ion current of the corresponding raw data intensities to make sure the comparison of the spectrum would be made on the same level. An arbitrary factor 100000 times the intensities in the data matrix was to amplify the normalized intensities to a reasonable magnitude. Because the data variation was dependent on the mean, log transform was carried out on the data matrix. An arbitrary 0.00000001 was added in the intensities to ensure valid log transformation. The 10%, 20% and 40% fractions were combined by adding intensities up at each m/z value to group the data into one patient as one column in the data matrix.

After appropriate preprocessing of normalization and log transformation, the intensities for MS data are assumed to approximately meet the t-test requirements. We then performed a t-test to each m/z value on factor whether the patient had cancer or not. The p-values of the tests were recorded. The false discovery rate (FDR) by Benjamini and Hochberg [25] was applied to the t-test p-values to adjust to multi-test problem. Only adjusted p-values less than 0.05 were selected out to be potential biomarkers. As a result, 1228 biomarkers were selected for the high-dimensional data case.

High-dimensional Data analysis result

A 3-fold cross-validation of the TGDR-AUC algorithm was applied to the data. The result is listed as the second column in Table 2. The estimated empirical AUC was almost perfect, close to 1. The bootstrap method was applied to estimate the empirical AUC confident interval for 1228 biomarkers. 500 bootstrap data sets were generated. The bagged empirical AUC for the given optimal pair of λ and τ was calculated from the bootstrap sample. The estimated standard error was 0.0002527. The confident interval for bootstrap was (0.9945, 0.9955). The high-dimensional data had a small sample size compared to a large number of biomarkers, which suggests that the estimated empirical AUC may be overestimated by the simulation.

The estimated coefficients were also of interest, and the estimated coefficients are plotted in Figure 4. There were 139 biomarkers (more than 10% of total biomarkers) that had zero coefficients. Only 63 biomarkers had larger than 0.01 estimated coefficients. The TGDR-AUC algorithm provided a simultaneous dimension reduction technique so that if the estimated coefficient was zero, the corresponding biomarker did not have contribution to the AUC optimization. In this case, the dimension was reduced to 5% of the original dimension of 1228. McIntosh and Pepe [26] mentioned in their work that the AUC increases with the number of combined biomarkers. However, this may not be true. To see this, only the 63 biomarkers that have larger estimated coefficients were chosen and all the rest biomarkers coefficients were set to be zero. The ROC using all estimated 1228 biomarkers was compared to the ROC using only 63 biomarkers in Figure 5. The ROC with 63 biomarkers had a higher AUC value compared to AUC using all 1228 biomarkers. Although there was earlier doubt about the empirical AUC being optimistic, the resulting empirical AUC with smaller biomarker numbers indicated that this was a valid approach. Therefore, TGDR-AUC algorithm is a good classifier that provides the sufficiently unbiased AUC. Selected biomarker lists were further compared to those of peaks selected by their biochemical properties. Those 63 biomarkers were used because of their larger estimated coefficients, which indicated their potential as cancer biomarkers. Table 3 lists this result. The oligosaccharide composition was from Hyun Joo An, et al.[8]. The observed masses were also from that work for comparison. The biomarkers selected in that study matched well to those in this analysis. All of these biomarkers had high positive coefficients, which again suggest their potential contribution to cancer identification. More peaks of biomarkers are detected by more objective optimization method of regularized AUC.

Table 3. Comparisons of selected biomarkers between Hyun Joo An, et al. [8] and the TGDR-AUC algorithm. The biomarkers in Hyun Joo An, et al. [8] were selected based on their biochemical properties.

thumbnailFigure 4. High-dimensional Data Result. Plot of estimated coefficients by TGDR-AUC algorithm of potential biomarkers for high-dimensional data.

thumbnailFigure 5. High-dimensional Data ROC. Estimated ROC curve given the estimated optimal combination by TGDR-AUC algorithm of biomarkers for high-dimensional data.

Three m/z values with large positive coefficients are plotted in Figure 6. The m/z values 712.28, 915.43 were selected because of their relative large estimated coefficients and m/z value 1442.72 was illustrated for a higher mass range. The three plots correspond to the three areas. Black is for healthy patients and red for cancer patients. From the Figure 6, all of the areas visually showed larger intensities for cancer patients than healthy patients. Larger coefficients had visually larger differences between the groups. The estimations were verified to make biological sense.

thumbnailFigure 6. Three m/z value area to distinguish cancer from non-cancer. Plot of m/z values areas which can discriminate ovarian cancer (red) from non-cancer (black). The upper left plot is for m/z value 712.28; the upper right plot is for m/z 915.38 and lower plot is for m/z 1442.72. The m/z values are visually larger for cancer patients. For quantification purpose, simple descriptive statistics for the three areas are reported as the following: for m/z value 712.28, mean(standard deviation) of non-cancer samples are 4.506(3.181) and cancer samples are 10.633(3.615). The FDR-adjusted p-value for the comparison is 7.827e-07; for m/z value 915.38, mean(standard deviation) of non-cancer samples are 3.667(2.614) comparing to cancer samples 8.443(3.307). This results in FDR-adjusted p-value of 7.642e-06. The higher m/z 1442.72 has 2.958(2.435) for non-cancer samples of mean(standard deviation), while cancer samples are 7.275(3.1). The FDR-adjusted p-value for high mass area is 1.841e-05.

Conclusion and Discussion

The key contribution of this work is that the optimal rules, for purpose of classifying disease status on the basis of multiple biomarkers, are based on the maximization of the area under the ROC subjected to constrained threshold gradient direct algorithm. The approach presented here relaxes the normality assumption and the approach of Pepe and Thompson [21] is generalized. The analysis is applied to the high-dimensional mass spectrometry glycomics data. In contrast to Ma and Huang [22], the simulation data is generated based on the joint distribution of disease samples and control samples with non-normality assumption, which is more appropriate for mass spectrometry data. This simulation is able to assess the difference between the maximization of the empirical AUC and the target AUC. The simulation proves the asymptotic properties that estimated TGDR-AUC approaches the true AUC when the sample size is increasing compared to the dimensionality of p biomarkers. When applied to the real ovarian cancer data, the algorithm also provides the build-in dimension reduction technique. For the high-dimensional ovarian cancer data, we can detect the 63 most important biomarkers among the total of 1228 biomarkers with simultaneously estimating the linear combination coefficients. The selected 63 biomarkers match very well with the 14 peaks pre-selected based on biological evidence. The algorithm is a non-parametric approach, very flexible and easy to interpret. The algorithm is aimed to have optimal classification. The resulting AUC of the linear combination is plausible and should be optimal among all other possible combinations. The computation of TGDR-AUC is computational feasible of high-dimensional data. This high-dimensional analysis evaluates more than 1000 biomarkers in the algorithm and essentially could consider more.

Although the result of simulation cannot guarantee the estimated AUC comes close enough to the true AUC in large p biomarker number for small sample size situation, the algorithm still provides enough information in recognizing potential biomarkers. Since the performance of small dimension of biomarkers in the large sample size scenario gives excellent overall result, iteratively combining the biomarkers in high-dimensional and reducing the dimension of biomarker to some reasonable size is considered. With the reduced dimension, the biomarkers are combined again using the TGDR-AUC algorithm. The algorithm would continue until the ratio of number of biomarker and the sample size are in the comfortable zone suggested by our simulation. The TGDR-AUC algorithm proves to be a promising algorithm and might be recommended in combining mass spectrometry biomarker analysis for cancer diagnosis.

Methods

ROC Curve

A case-control study is considered where the main outcome is binary denoted as D, where D = 1 as the case and D = 0 as the control. Denote the relative abundance of the p glycomics Rp × 1 = (R1, ..., Rp)T. We consider the linear combination score of the form

Lβ(R) = β'R = β1R1 + β2R2 + ... + βpRp(1)

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M2">View MathML</a> is an unknown p-vector parameter and R serves as the classification predictors. The classification rule is constructed by β'R. To be more specific, we classify D = 1 if β'R c and D = 0 otherwise, for a cutoff value c. By varying the cutoff value c, we obtain the Receiver Operating Characteristic(ROC) curve.

ROC is a graphical plot of the sensitivity and 1-specificity, also known as true positive rate (TPR) and false positive rate (FPR), respectively. The TPR and FPR are defined by

TPR(c) = Pr(β'R c|D = 1),(2)

FPR(c) = Pr(β'R c|D = 0)(3)

for any cutoff value c. By varying the discrimination value of c, the TPR and FPR are plotted to generate the ROC curve, which is a two-dimensional plot of FPR(c) vs TPR(c) with -∞ ≤ c ≤ +∞. There is a balance between TPR and FPR. A completely random predictor would give a straight line at an angle of 45 degrees from the horizontal, from bottom left to top right, because as the threshold is raised, there would be equal numbers of true and false positives. ROC above the no-discrimination line would be preferred with better classification as the line closer to the upper left-corner point (0,1).

The overall performance of the classifier can be evaluated by the area under the ROC curve (AUC). Denote n as the number of diseased samples, m as the number of normal samples and p as the dimension of biomarkers. Denote Xi = (Xi1, ..., Xip) as the i-th normal subject, and Yj = (Yj1, ..., Yjp) as the j-th diseased subject, i = 1, ..., m, j = 1, ..., n. For a given parameter β, the corresponding ROC curve is generated by linearly combining the p biomarkers for classifier β'Y or β'X. It has been shown by Bamber [27] that the theoretical area under the ROC curve is a probability Pr(β'Y - β'X ≥ 0). To achieve the optimal performance, we need to maximize

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M3">View MathML</a>

(4)

Statistically, the empirical AUC is given by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M4">View MathML</a>

(5)

where the function Ψ(β; Xi, Yj) in (5) is defined as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M5">View MathML</a>

The empirical AUC is the same as the form of Mann-Whitney test statistics. The optimal estimator <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M6">View MathML</a> is then defined as the maximizer of AUC(β).

The Sigmoid Function

The main problem of the maximization (5) is that the objective function is not continuous, and thus not differentiable. The maximization is difficult to achieve and the maximizer is not unique. To overcome the difficulty, a smooth sigmoid function was chosen to approximate the objective function. The sigmoid function is a monotonically increasing function with a parameter r > 0 and limx → -∞ Sr(x) = 0, and limx → +∞ Sr(x) = 1. There are many choices of sigmoid functions, including:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M7">View MathML</a>

The larger r is, the better the approximation will be. For a given value r = 10, Figure 7 shows the approximation result. The first function S1, r(x) approximates the best. The first function S1, r(x) can be simply written as

thumbnailFigure 7. Sigmoid Function Approximation with r = 10. Several candidate sigmoid functions are plotted to approximate the indicator function.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M8">View MathML</a>

Therefore, for the choice of sigmoid function, the first function was used for further analysis. We now refer to the estimator <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M6">View MathML</a> as the maximizer of

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M9">View MathML</a>

(6)

where Y, X and index i, j are defined in (5).

Since the exponential part in sigmoid function may result in unbounded situation when larger r is selected or the data itself may be large, the exponential part was chosen to be controlled by normalizing the data in the following way: denote Zji = Yj - Xi, i = 1, ..., m, j = 1, ..., n. Z is a pairwise difference matrix between is the disease and normal patients. We then normalize Z as Z/||Z||2, where <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M10">View MathML</a> and p is the dimension of Z.

Threshold Gradient Direct Regularization (TGDR)

The TGDR approach constructs a parameter path β(λ) in parameter space that some of the points on that path are close to the point β* in(6) representing the optimal solution. The best parameter path will be selected by k-fold cross-validation technique. Consider now to minimize

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M11">View MathML</a>

(7)

where P(β) is the penalties. There are several candidate penalty terms, <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M12">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M13">View MathML</a> and P(β) = maxi = 1, ..., p βi, where p is the number of parameters. We choose the quadratic penalty term <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M13">View MathML</a> because it is the most common one and use this in our following analysis.

Let ν denotes the starting value on the path and Δν as the increment on the path. To implement the algorithm, we select Δν = 0.01. For any given r of the sigmoid function, a threshold τ is varying between 0 and 1. τ was chosen in the algorithm to be a vector {0,0.1,0.2,...,0.9,1}. The TGDR algorithm performs the following iteration steps:

• Initialize βi = 0.01 and ν = 0, for i = 1, ..., p.

• Compute the negative gradient gi(ν) = -∂G(β; λ)/∂βi evaluatedat βi(ν). Denote gi(ν) as the i-th component of g(ν). If maxi{|gi(ν)|} = 0, stop the iteration.

• Compute vector of fi(ν) as the i-th component of f(ν); fi(ν) = I{|gi(ν)| ≥ τ·maxl|gl(ν)|, l = 1, ..., p}.

• Update βi(ν + Δν) = βi(ν) + Δν × gi(ν) × fi(ν). Replace by ν + Δν.

• Repeat step 2–4 until β converges, which means <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M14">View MathML</a>. ε is a pre-select small number and k is the number of iteration steps. We choose ε = 1 × 10-8.

The tuning parameter or the threshold τ controls the distribution of estimator <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M6">View MathML</a>. When τ = 0, the estimator β is updated on every gradient and therefore the converged estimator is close to ridge regression (RR); while τ = 1, the estimated β is only updated on the maximum gradients, and the result is roughly corresponding to LASSO (Least Absolute Shrinkage and Selection Operator, [28]). The τ values in between 0 and 1 create the estimators more diverse than RR, but less than LASSO.

Double Cross-Validation

The selection of the parameter path λ is determined by k-fold cross-validation(CV). Since the empirical AUC is in fact a nonparametric two-sample comparison test, a slight variant of k-fold CV is considered and it is called double k-fold CV for two samples. The n number of diseased patients Y is randomly split into roughly equal-sized K1 parts and m number of normal patients X into roughly equal-sized K2 parts. Let k1 index which of {1, ..., n} is in {1, ..., K1} groups, and k2 index which of {1, ..., m} is in {1, ..., K2} groups. The cross-validation estimate of prediction error is defined to be

<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M15">View MathML</a>

(8)

where <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M16">View MathML</a> is calculated by (7) when the k1(j) part of Y and k2(i) part of X data are removed. The optimal λ* is found by minimizing (8).

The function in (8) reduces the p-dimensional optimization problem to be one-dimensional of minimizing λ. The golden section search method [29] was implemented, which does not require the calculation of the derivative, as the optimization method to search for the minimal value between 0 to a large number.

For any given τ in {0,0.1,0.2,...,0.9,1}, we run a double k-fold cross-validation. Denote τl, l = 1, ..., 11 as the l-th component in the vector {0,0.1,0.2,...,0.9,1}. After selection of regularized <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M17">View MathML</a> estimator for given τl, l = 1, ..., 11, the optimal <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M18">View MathML</a> is chosen to be min{CV(<a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M17">View MathML</a>), l = 1, ..., 11}, where CV(λ) is evaluated by substituting the regularized <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M17">View MathML</a> in (8). One may also adapt Stone's two-stage cross-validation [30], but it will be computationally intensive.

Positive constraints

Because of the positive nature of the mass spectrometry data, a positive constraint on the parameter β is reasonable. The objective function (7) is minimized subjected to the constraints βk ≥ 0 for k = 1, ..., p. Hence, the estimation <a onClick="popup('http://www.biomedcentral.com/1471-2105/8/477/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/8/477/mathml/M6">View MathML</a> will result in some exact zero coefficients when the optimization hits the positive constraint boundary, which means that the biomarker has no contribution to maximize the AUC.

Authors' contributions

JY implemented the software for the mass spectrometry preprocessing method and threshold gradient direct regularization and area under the curve and drafted the manuscript. HL designed the TGDR-AUC algorithm and helped draft the manuscript. CK performed the mass spectrometry glycomics experiment and helped revise the manuscript. CBL participated in the design of the mass spectrometry glycomics experiment. DMR contributed to the methods of preprocessing the mass spectrometry spectrum and oversaw the overall project. All authors read and approved the final manuscript.

Acknowledgements

The statistical part of work is funded by NIH grant: NIH NHGRI R01-HG003352, NIH NIEHS P42-ES04699 and NIH NCI P30-CA093373. The biological part of work is funded by NIH grant: NIH RO1 GM49077. In addition, the authors thank two anonymous referees for their insightful comments.

References

  1. Pepe M, Cai T, Longton GM: Combining Predictors for Classification using the Area Under the Receiver Operating Characteristic Curve.

    Biometrics 2006, 62:221-229. PubMed Abstract | Publisher Full Text OpenURL

  2. Baggerly A, Morris J, Wang J, Gold D, Xiao L, Coombes K: A comprehensive approach to the analysis of matrix-assited laser desorption/ionization-time of flight proteomics spectra from serum samples.

    Proteomics 2003, 3:1667-1672. PubMed Abstract | Publisher Full Text OpenURL

  3. Wagner M, Naik D, Pothen A: Protocols for disease classification from mass spectrometry data.

    Proteomics 2003, 3:1692-1698. PubMed Abstract | Publisher Full Text OpenURL

  4. Adam B, Qu Y, Davis JW, Ward MD, Clements MA, Cazares LH, Semmes OJ, Schellhammer PF, Yasui Y, Feng Z, Wright GL Jr: Serum Protein Fingerprinting Coupled with a Pattern-matching Algorithm Distinguishes Prostate Cancer from Benign Prostate Hyperplasia and Healthy Men.

    Cancer Research 2002, 62:3609-3614. PubMed Abstract | Publisher Full Text OpenURL

  5. Baggerly KA, Morris JS, Coombes KR: Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments.

    Bioinformatics 2004, 20(5):777-785. PubMed Abstract | Publisher Full Text OpenURL

  6. Apweiler R, Hermjakob H, Sharon N: On the Frequency of Protein Glycosylation, as deduced from analysis of the SWISS-PROT database.

    Biochimica et Biophysica Acta 1999, 1473(1):4-8. PubMed Abstract | Publisher Full Text OpenURL

  7. Varki A: Biological roles of oligosaccharides: all of the theories are correct.

    Glycobiology 1993, 3(2):97-130. PubMed Abstract | Publisher Full Text OpenURL

  8. An H, Miyamoto S, Lancaster K, Kirmiz C, Li B, Lam K, Leiserowitz G, Lebrilla C: Profiling of Glycans in Serum for the Discovery of Potential Biomarkers for Ovarian Cancer.

    Journal of Proteome Research 2006, 5:1626-1635. PubMed Abstract | Publisher Full Text OpenURL

  9. Pepe M, Etzioni R, Feng Z, Potter J, Thompson M, Thornquist M, Winget M, Yasui Y: Phases of biomarker development for early detection of cancer.

    Journal of the National Cancer Institute 2001, 93(14):1054-1061. PubMed Abstract | Publisher Full Text OpenURL

  10. Ball G, Mian S, Holding F, Allibone RO, Lowe J, Ali S, Li G, McCardle S, Ellis IO, Creaser C, Rees RC: An integrated approach utilizing artificial neural networks and SELDI mass spectrometry for the classification of human tumours and rapid identification of potential biomarkers.

    Bioinformatics 2002, 18(3):395-404. PubMed Abstract | Publisher Full Text OpenURL

  11. Lancashire LJ, Mian S, Rees RC, Ball GR: Preliminary artificial neural network analysis of SELDI mass spectrometry data for the classification of melanoma tissue. In 17th European Simulation Multiconference, Nottingham. Society for Modeling and Simulation International, SCS European Publishing House, Erlanger, Germany; 2003:131-135. OpenURL

  12. Mian S, Ball G, Hornbuckle J, Holding F, Carmichael J, Ellis I, Ali S, Li G, McArdle S, Creaser C, Rees R: A prototype methodology combining surface-enhanced laser desorption/ionization protein chip technology and artificial neural network algorithms to predict the chemoresponsiveness of breast cancer cell linear exposed to Paclitaxel and Doxorubicin under in vitro condition.

    Proteomics 2003, 3:1725-1737. PubMed Abstract | Publisher Full Text OpenURL

  13. Fushiki T, Fujisawa H, Eguchi S: Identification of biomarkers from mass spectrometry data using a "common" peak approach.

    BMC Bioinformatics 2006., 7(358) PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Geurts P, Fillet M, Seny D, Meuwis M, Malaise M, Merville M, Wehenkel L: Proteomics mass spectra classification using decision tree based ensemble methods.

    Bioinformatics 2005, 21(15):3138-3145. PubMed Abstract | Publisher Full Text OpenURL

  15. Xiong X, Fang X, Zhao J: Biomarker identification by feature wrappers.

    Genome Research 2001, 11:1878-1887. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  16. Wu B, Abbott T, Fishman D, McMurray W, Mor G, Stone K, Ward D, Williams K, Zhao H: Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data.

    Bioinformatics 2003, 19(13):1636-1643. PubMed Abstract | Publisher Full Text OpenURL

  17. Miketova P, Abbas-Hawka C, Hadfiled T: Microorganism gram-type differentiation of whole cells based on pyrolysis high-resolution mass spectrometry data.

    Journal of Analytical and Applied Pyrolysis 2003, 67:109-122. Publisher Full Text OpenURL

  18. Lilien RH, Farid H, Donald BR: Probabilistic disease classification of expression-dependent proteomic data from mass spectrometry of human serum.

    Journal of Computational Biology 2003, 10:925-946. PubMed Abstract | Publisher Full Text OpenURL

  19. Datta S, DePadilla LM: Feature selection and machine learning with mass spectrometry data for distinguishing cancer and non-cancer samples.

    Statistical Methodology 2006, 3:79-92. Publisher Full Text OpenURL

  20. Su J, Liu J: Linear Combinations of Multiple Diagnostic Markers.

    Journal of the American Statistical Association 1993, 88(424):1350-1355. Publisher Full Text OpenURL

  21. Pepe M, Thompson M: Combining diagnostic test results to increase accuracy.

    Biostatistics 2000, 1(2):123-140. PubMed Abstract | Publisher Full Text OpenURL

  22. Ma S, Huang J: Regularized ROC method for disease classification and biomarker selection with microarray data.

    Bioinformatics 2005, 21:4356-4362. PubMed Abstract | Publisher Full Text OpenURL

  23. Friedman J, Popescu B: Gradient Directed Regularization for linear regression and classification. [http://www-stat.stanford.edu/jhf/ftp/path.pdf] webcite

    Technical report Department of Statistics, Stanford University, CA; 2004. OpenURL

  24. Dudoit S, Fridlyand J, Speed T: Comparison of discrimination methods for tumor classification based on microarray data.

    Journal of the American Statistical Association 2002, 97:77-87. Publisher Full Text OpenURL

  25. Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.

    Journal of Royal Statistical Society B 1995, 57(1):289-300. OpenURL

  26. McIntosh M, Pepe M: Combining Several Screening Tests: Optimality of the Risk Score.

    Biometrics 2002, 58:657-664. PubMed Abstract | Publisher Full Text OpenURL

  27. Bamber D: The area above the ordinal dominance graph and the area below the receiver operating characteristic graph.

    Journal of Mathematical Psychology 1975, 12:387-415. Publisher Full Text OpenURL

  28. Tibshirani R: Regression shrinkage and selection via lasso.

    Journal of the Royal Statistical Society B 1996, 58:267-288. OpenURL

  29. Press W, Teukolsky S, Vetterling W, Flannery B: Golden Section Search in One Dimension. In Numerical Recipies in C: the Art of Scientific Computing. 2nd edition. Combridge University Press; 1992. OpenURL

  30. Stone M: Cross-validatory choice and assessment of statistical predictions.

    Journal of Royal Statistical Society 1974, 36:111-147. OpenURL

  31. Gui J, Li H: Threshold gradient descent method for censored data regression with applications in pharmacogenomics.

    Pac Symp Biocomput 2005, :272-283. PubMed Abstract | Publisher Full Text OpenURL