Abstract
Background
Novel molecular and statistical methods are in rising demand for disease diagnosis and prognosis with the help of recent advanced biotechnology. Highresolution mass spectrometry (MS) is one of those biotechnologies that are highly promising to improve health outcome. Previous literatures have identified some proteomics biomarkers that can distinguish healthy patients from cancer patients using MS data. In this paper, an MS study is demonstrated which uses glycomics to identify ovarian cancer. Glycomics is the study of glycans and glycoproteins. The glycans on the proteins may deviate between a cancer cell and a normal cell and may be visible in the blood. Highresolution MS has been applied to measure relative abundances of potential glycan biomarkers in human serum. Multiple potential glycan biomarkers are measured in MS spectra. With the objection of maximizing the empirical area under the ROC curve (AUC), an analysis method was considered which combines potential glycan biomarkers for the diagnosis of cancer.
Results
Maximizing the empirical AUC of glycomics MS data is a largedimensional optimization problem. The technical difficulty is that the empirical AUC function is not continuous. Instead, it is in fact an empirical 0–1 loss function with a large number of linear predictors. An approach was investigated that regularizes the area under the ROC curve while replacing the 0–1 loss function with a smooth surrogate function. The constrained threshold gradient descent regularization algorithm was applied, where the regularization parameters were chosen by the crossvalidation method, and the confidence intervals of the regression parameters were estimated by the bootstrap method. The method is called TGDRAUC algorithm. The properties of the approach were studied through a numerical simulation study, which incorporates the positive values of mass spectrometry data with the correlations between measurements within person. The simulation proved asymptotic properties that estimated AUC approaches the true AUC. Finally, mass spectrometry data of serum glycan for ovarian cancer diagnosis was analyzed. The optimal combination based on TGDRAUC algorithm yields plausible result and the detected biomarkers are confirmed based on biological evidence.
Conclusion
The TGDRAUC algorithm relaxes the normality and independence assumptions from previous literatures. In addition to its flexibility and easy interpretability, the algorithm yields good performance in combining potential biomarkers and is computationally feasible. Thus, the approach of TGDRAUC is a plausible algorithm to classify disease status on the basis of multiple biomarkers.
Background
With rapidly developing biotechnology, the use of highthroughput clinical laboratory data to detect disease conditions and predict patients' outcomes is becoming a reality for medical practice. These technologies include microarray, mass spectrometry applied to proteomics, and new imaging modalities, which have been engaged in research on detecting clinical disease, predicting patients' responses to different treatments and evaluating the prognosis of patients with disease [1].
Among those new biotechnologies, mass spectrometry (MS) is used increasingly for protein profiling in cancer research. The basic goal is to predict cancer on the basis of peptide/protein abundance from the MS data. Recent literatures on cancer classification using MS have identified some potential protein biomarkers in serum to distinguish cancer from normal samples (Baggerly et al.[2], Wagner et al.[3], Adam et al.[4]). However, sensitivity and reproducibility remains as a major concern in making the protein technology reliable [5].
As an alternative, glycomics is proposed as a new trend for biomarker detection at the end of 20th century. Glycomics is the study of glycans (oligosaccharides), and glycoproteins. Apweiler et al.[6] estimated that at least 50% of human proteins are glycosylated. Since glycans play crucial roles in cell communication and signalling events [7], they may be implicated in cancer. Compared to potential protein or peptide biomarkers, oligosaccharides are highly sensitive to biochemical environment and are more easily identified and quantified [8]. Therefore, in a study conducted in this lab, clinical glycomics is used to identify potential biomarkers for the early detection of ovarian cancer.
Ovarian cancer is one of the most deadly types of cancer among women [8]. Many investigators believe that early detection of ovarian cancer would improve the patients' survival. CA 125 is the only FDA approved biomarker for the early detection of ovarian cancer. However, it has unreasonably low sensitivity and specificity. For instance, only 50% women with Stage I ovarian cancer will have an elevated CA 125 and many benign conditions can cause elevated levels [8].
One technique of profiling oligosaccharides into serum was developed in this lab. The idea that glycoproteins which are sloughed off cells may be detected in patients' serum was utilized for this analysis. Serum samples were analyzed by the highresolution MS to assess the variation of glycans in cancer patient sera compared to healthy patient sera. MS data are highdimensional. Figure 1 shows a typical mass spectrum. In this experiment, a single spectrum contains 500,000 distinct masstocharge values (which measure the ratio of mass to the charge of glycans) and the corresponding relative intensities (which measure relative abundances of glycans) in the serum sample. It is desirable to use all informative glycans because multiple biomarkers may allow improved sensitivity and specificity of cancer detection [9].
Figure 1. Typical Glycomic Mass Spectrum. A typical mass spectrometry glycomics spectra is plotted. The xaxis is m/z value and the yaxis is the corresponding intensity, which measures the relative abundance of glycans.
A number of recent studies have implemented machine learning algorithms to classify highdimensional MS cancer data. Artificial neural networks (ANNs) were utilized by Ball et al.[10], Lancashire et al.[11] and Mian et al.[12], to discriminate different tumor states. Fushiki et al. [13] explored the efficient learning algorithm AdaBoost to extract potential biomarkers for classifying MS cancer from control samples. Decision tree based ensemble methods were proposed by Geurts et al.[14] to identify biomarkers for inflammatory diseases. Other algorithms, such as support vector machine (SVM) used by Xiong et al.[15], random forest (RF) applied by Wu et al.[16] and linear discriminant analysis (LDA) employed by Miketova et al.[17] and Lilien et al.[18], were also studied. Comparisons among algorithms in a case study of ovarian cancer classification were evaluated by Datta and DePadilla [19] and Wu et al.[16]. All of the above studies aim to discover the potential MS biomarkers that can distinguish one group from another. However, for prognostic and diagnostic purposes, how to combine those MS biomarkers and whether or not the combination is optimal are not addressed in those studies. Thus, an objective of this research is to consider a statistical method that combines the highdimensional MS measurements into a single score to classify cancer status jointly with suitable preprocessing of the data.
There are several studies on combining biomarkers. Su and Liu [20] studied the case where markers follow a multivariate normal distribution. They gave a closed form of optimal solution to the linear parameters. Normality is not suitable for mass spectrometry data because measurements of relative abundance are always positive. Pepe and Thompson [21] considered linearly combining two biomarkers by optimizing the area under the ROC curve. The method was developed only for lowdimensional situation. And it is not trivial to generalize the approach to highdimensional case. Ma and Huang [22] applied Pepe and Thompson's idea of optimizing AUC to microarray experiment. They used multivariate normal distribution in the simulation study and assumed independence between biomarkers, which is not true for mass spectrometry data. In addition, they implemented the threshold gradient algorithm, first proposed by Friedman and Popescu [23], without correctly recognizing the regularization parameter. In this work, the question on how to combine the highdimensional mass spectrometry predictors into a single score for the purpose of classification is addressed. The performance of a classifier by maximizing the area under the ROC curve for linearly combining the biomarkers is evaluated. The technical difficulty of this optimization problem is that the empirical AUC function is not differentiable. The objective function is in fact an empirical 0–1 loss function with a large number of linear predictors, and it is well known that such optimization problem is illimposed. An approach that regularizes the area under the ROC curve while replacing the 0–1 loss function with a sigmoid function was investigated. A constrained threshold gradient descent regularization algorithm, which is first introduced by Friedman and Popescu [23], to stabilize the estimates is applied. In Friedman and Popescu, they demonstrated their algorithm in a quadratic objective function. In this study, their objective function is replaced with the area under the ROC. A simulation is also conducted on mass spectrometry data under the casecontrol design that will generate joint distribution of diseased samples and normal samples to evaluate this algorithm.
The article is organized as the following. Simulation study is described in the Testing Section which describes how effective the proposed TGDRAUC approach is. The Implementation Section is for real mass spectrometry ovarian cancer data analysis after a description of our preprocessing method. TGDRAUC method is applied to lowdimensional and highdimensional ovarian cancer data. In the Conclusion and Discussion Section, it is concluded that the TGDRAUC algorithm is appropriate in the analysis of mass spectrometry glycomic data. A detailed description for TGDRAUC algorithm is in Method Section. The definition and properties of the ROC curve are reviewed. The area under the ROC curve as the objective function for maximizing the performance of the classifier is proposed. Furthermore, several sigmoid functions that replace the 0–1 loss function are introduced and a simple comparison among the sigmoid functions is shown. Threshold gradient direct regularization algorithm is explained after selection of sigmoid function as well as the detailed algorithm for parameter estimations.
Results
Testing
Testing of the TGDRAUC algorithm was demonstrated through a simulation study. Since the mass spectrometry measures the relative abundance of molecules, the measurement is always positive. Hence, a positive distribution is a reasonable choice for data simulation. In contrast to Ma and Huang [22], who simulated data under normal distribution and assumed independence among biomarkers, exponential distribution was chosen to generate the simulation data. Since a better classifier is desired, the data used for simulation were chosen so that the true AUC equals to 0.95. The simulation is generated as the following:
• Denote X as normal patient, and m is the number of normal patients. Denote Y as the disease patient, and n is the number of diseased patients. For simplicity, we choose m = n. The dimension of the biomarkers is denoted as p.
• Simulate X_{i }as an exponential distribution with parameter λ = 1, i = 1, ..., n.
• Generate a Bernoulli trial B(1, 0.95) for n times.
• The data for the diseased patients are generated as Y_{i }= X_{i }+ 1 when Bernoulli trial is 1; Y_{i }= max{0, X_{i } 1} when Bernoulli trail is 0.
The number of the replication was chosen to be 500. The data is generated as joint distribution of X and Y. The true probability is P(X <Y) = 0.95, no matter what the linear combination β is. The goal of the simulation study is to show that the maximizer of empirical AUC by TGDR regularization is in fact our targeted maximization problem (4) (see Additional file 1, 2 for examples of simulated data).
Additional file 1. Simulated data with different combinations of n and p used to evaluate TGDRAUC algorithm.
Format: ZIP Size: 107KB Download file
Additional file 2. Introduction on how to use the provided C++ code and the simulated data.
Format: DOC Size: 20KB Download file
This file can be viewed with: Microsoft Word Viewer
To study whether the ratio of p and n has any impact on the results, the simulation cases are considered for the different
combination pair of p and n as p/n → + ∞, p/n → c, where c is a constant and p/n → 0. The p and n pairs are (10,5),(10,10),(10,25),(10,50),(25,5), (25,10),(25,25),(25,50) and (50,5),(50,10),(50,25),(50,50).
The data are partitioned randomly into a training set of size n_{1 }and a testing size of n_{2 }with n_{1 }+ n_{2 }= 2n. Dudoit [24] suggested that
Table 1. Simulation Study Result. Summary statistics of the simulation results of TGDRAUC algorithm.
The simulation conducted is for a relatively small sample size and a small number of biomarkers. From Table 1, the regularization of TGDR tends to overestimate the AUC when p is less than n. As sample size increases, the estimated AUC by regularization approximates the true AUC. Furthermore, as sample size increases, the standard error stabilizes to be around 0.03.
Table 1 gives a guideline on how to use the TGDRAUC algorithm for different situations. When a larger sample size n than biomarker p is observed, the algorithm is trustworthy, in the sense that the estimated AUC approximated the true AUC and the best ratio for p/n is around 1/5.0. However, when the number of biomarkers is much larger than the sample size, it remains unclear whether the TGDRAUC algorithm tends to overestimate the AUC.
Implementation
Two real MS data sets are analyzed, one lowdimensional and another highdimensional real ovarian cancer data. The lowdimensional data is preprocessed, and highdimensional data is the mass spectrometry raw data. So highdimensional data will be preprocessed first before it is carried on any further analysis. The highdimensional data is a subset of the lowdimensional data because of missing raw data files. The TGDRAUC algorithm is applied to both data sets for cancer diagnosis. All of the programming were done in C and R (see Additional file 2, 3, 4 for algorithm codes).
Additional file 3. C++ source code for TGDRAUC algorithm to estimate optimal parameters of λ and τ.
Format: CPP Size: 21KB Download file
Additional file 4. C++ source code for TGDRAUC algorithm to estimate linear combination parameters and AUC after selecting optimal λ and τ.
Format: CPP Size: 9KB Download file
Lowdimensional Ovarian Cancer Data Analysis
The data contained 73 patients, among which 24 were healthy patients and 49 were ovarian cancer patients. Total 14 glycan biomarkers were preselected for the lowdimensional data set. The 3fold crossvalidation TGDRAUC algorithm was applied to select regularization parameter λ and then select the optimal tuning parameter τ. Using the determined λ and τ, the estimated empirical AUC for the data was further obtained. The result is summarized in the first column in Table 2. The estimated empirical AUC was as high as 0.95, which indicates an excellent cancer diagnosis for linear combing the 14 biomarkers. The ratio of biomarker number 14 to sample size number 73 was less than 1/5.0, so the estimated empirical AUC was trustworthy as suggested by the simulation results. The 500 bootstrap data sets were performed for given both λ and τ. The bootstrap standard error (SE) was 0.00126. The 95% confident interval for AUC was (0.9513,0.9560). The estimated coefficients of the 14 biomarkers are plotted in Figure 2. From the figure, biomarkers number 3 and 8 had the highest coefficients, suggesting the highest influence on the cancer diagnosis. Biomarkers number 11 and 13 had smaller estimated values, indicating less importance than other biomarkers. The lowdimensional data ROC based on the 14 peaks are plotted in Figure 3.
Table 2. Ovarian Cancer Data (with the bootstrap standard error in parenthesis). Implementation results of TGDRAUC algorithm to MS ovarian cancer data.
Figure 2. Lowdimensional Data Result. Plot of estimated coefficients by TGDRAUC algorithm of potential biomarkers for Lowdimensional data.
Figure 3. Lowdimensional Data ROC. Estimated ROC curve given the estimated optimal combination by TGDRAUC algorithm of biomarkers for lowdimensional data.
Highdimensional Ovarian Cancer Data Analysis
In this section, the analysis starts from the raw ovarian cancer data. The problem with the raw mass spectrometry data is that the data is highdimensional, so extracting useful information is crucial. For this highdimensional data set, there were 19 normal patients and 21 cancer patients. Each patient had three measurements, called 10%, 20% and 40% fractions, corresponding to different sample extraction methods carried out prior to the mass spectrometry experiments. For each spectrum, the raw data contained 500,000 data points. The mass spectrometer's manufacturer(Varian FTMS Systems, Lake Forest, CA) provided their software for peakselection of individual spectra from the 500,000 data points. The problem with their peakidentification is that it is based on an individual spectrum, meaning that for any two spectra, peaks are selected at different masstocharge values. Therefore, the peaks are not consistent between samples and not trustworthy for cancer diagnosis. Before any data analysis is performed, the preprocessing of the data is critical.
Preprocessing HighDimensional Data
First, the peaks selected by the instrument's software for the spectra were used. The selectedpeaks were grouped into a matrix, with each column corresponding to one spectrum and each row corresponding to one distinct masstocharge (m/z) value. If the spectrum did not have the peak at the m/z value, a zero was replaced to indicate missing value in the matrix. However, this resulted in many zeros in the matrix. The corresponding raw data was chosen to be substituted into the zero intensity. In this way, much more similar information could be included as the raw data.
However, the scale of raw data was not the same as the corresponding selectedpeaks data file. The ratio factor between the raw and its corresponding selectedpeaks file needed to be estimated. The estimation of the ratio factor was done as the following: find the nearest point in the raw data to its corresponding selectedpeaks data, defined as the absolute distance between the corresponding m/z values. Two cases may happen: if the nearest m/z value in the raw data is unique, the ratio is calculated by the raw data intensity to the selectedpeaks data intensity at the nearest value; if the nearest value is not unique, the ratio is calculated by the averaged intensities of raw data at those nearest values to the selectedpeaks data intensity. For each file, the ratio was then averaged to give the unique factor estimation.
Using the estimated ratio for each file, the intensities from raw data could then be calculated to fill in the data matrix. There, again, may be two scenarios: if the closest m/z in raw data is unique to the selectedpeaks data, then substitute in the corresponding raw data intensity with the adjustment by its ratio factor; if the closest m/z in raw data is not unique, then substitute in the maximum intensities of those closest m/z raw data to the corresponding column of the data matrix, with adjusting by its ratio factor. This was chosen as the maximum intensity because we wanted to include the strongest signal to be substituted in the data matrix.
After imputing the data as above, the data matrix was formed with each spectrum as one column of the matrix and intensities at all same m/z values. Before any statistical analysis could be completed, each column of the data matrix was further normalized by dividing total ion current of the corresponding raw data intensities to make sure the comparison of the spectrum would be made on the same level. An arbitrary factor 100000 times the intensities in the data matrix was to amplify the normalized intensities to a reasonable magnitude. Because the data variation was dependent on the mean, log transform was carried out on the data matrix. An arbitrary 0.00000001 was added in the intensities to ensure valid log transformation. The 10%, 20% and 40% fractions were combined by adding intensities up at each m/z value to group the data into one patient as one column in the data matrix.
After appropriate preprocessing of normalization and log transformation, the intensities for MS data are assumed to approximately meet the ttest requirements. We then performed a ttest to each m/z value on factor whether the patient had cancer or not. The pvalues of the tests were recorded. The false discovery rate (FDR) by Benjamini and Hochberg [25] was applied to the ttest pvalues to adjust to multitest problem. Only adjusted pvalues less than 0.05 were selected out to be potential biomarkers. As a result, 1228 biomarkers were selected for the highdimensional data case.
Highdimensional Data analysis result
A 3fold crossvalidation of the TGDRAUC algorithm was applied to the data. The result is listed as the second column in Table 2. The estimated empirical AUC was almost perfect, close to 1. The bootstrap method was applied to estimate the empirical AUC confident interval for 1228 biomarkers. 500 bootstrap data sets were generated. The bagged empirical AUC for the given optimal pair of λ and τ was calculated from the bootstrap sample. The estimated standard error was 0.0002527. The confident interval for bootstrap was (0.9945, 0.9955). The highdimensional data had a small sample size compared to a large number of biomarkers, which suggests that the estimated empirical AUC may be overestimated by the simulation.
The estimated coefficients were also of interest, and the estimated coefficients are plotted in Figure 4. There were 139 biomarkers (more than 10% of total biomarkers) that had zero coefficients. Only 63 biomarkers had larger than 0.01 estimated coefficients. The TGDRAUC algorithm provided a simultaneous dimension reduction technique so that if the estimated coefficient was zero, the corresponding biomarker did not have contribution to the AUC optimization. In this case, the dimension was reduced to 5% of the original dimension of 1228. McIntosh and Pepe [26] mentioned in their work that the AUC increases with the number of combined biomarkers. However, this may not be true. To see this, only the 63 biomarkers that have larger estimated coefficients were chosen and all the rest biomarkers coefficients were set to be zero. The ROC using all estimated 1228 biomarkers was compared to the ROC using only 63 biomarkers in Figure 5. The ROC with 63 biomarkers had a higher AUC value compared to AUC using all 1228 biomarkers. Although there was earlier doubt about the empirical AUC being optimistic, the resulting empirical AUC with smaller biomarker numbers indicated that this was a valid approach. Therefore, TGDRAUC algorithm is a good classifier that provides the sufficiently unbiased AUC. Selected biomarker lists were further compared to those of peaks selected by their biochemical properties. Those 63 biomarkers were used because of their larger estimated coefficients, which indicated their potential as cancer biomarkers. Table 3 lists this result. The oligosaccharide composition was from Hyun Joo An, et al.[8]. The observed masses were also from that work for comparison. The biomarkers selected in that study matched well to those in this analysis. All of these biomarkers had high positive coefficients, which again suggest their potential contribution to cancer identification. More peaks of biomarkers are detected by more objective optimization method of regularized AUC.
Table 3. Comparisons of selected biomarkers between Hyun Joo An, et al. [8] and the TGDRAUC algorithm. The biomarkers in Hyun Joo An, et al. [8] were selected based on their biochemical properties.
Figure 4. Highdimensional Data Result. Plot of estimated coefficients by TGDRAUC algorithm of potential biomarkers for highdimensional data.
Figure 5. Highdimensional Data ROC. Estimated ROC curve given the estimated optimal combination by TGDRAUC algorithm of biomarkers for highdimensional data.
Three m/z values with large positive coefficients are plotted in Figure 6. The m/z values 712.28, 915.43 were selected because of their relative large estimated coefficients and m/z value 1442.72 was illustrated for a higher mass range. The three plots correspond to the three areas. Black is for healthy patients and red for cancer patients. From the Figure 6, all of the areas visually showed larger intensities for cancer patients than healthy patients. Larger coefficients had visually larger differences between the groups. The estimations were verified to make biological sense.
Figure 6. Three m/z value area to distinguish cancer from noncancer. Plot of m/z values areas which can discriminate ovarian cancer (red) from noncancer (black). The upper left plot is for m/z value 712.28; the upper right plot is for m/z 915.38 and lower plot is for m/z 1442.72. The m/z values are visually larger for cancer patients. For quantification purpose, simple descriptive statistics for the three areas are reported as the following: for m/z value 712.28, mean(standard deviation) of noncancer samples are 4.506(3.181) and cancer samples are 10.633(3.615). The FDRadjusted pvalue for the comparison is 7.827e07; for m/z value 915.38, mean(standard deviation) of noncancer samples are 3.667(2.614) comparing to cancer samples 8.443(3.307). This results in FDRadjusted pvalue of 7.642e06. The higher m/z 1442.72 has 2.958(2.435) for noncancer samples of mean(standard deviation), while cancer samples are 7.275(3.1). The FDRadjusted pvalue for high mass area is 1.841e05.
Conclusion and Discussion
The key contribution of this work is that the optimal rules, for purpose of classifying disease status on the basis of multiple biomarkers, are based on the maximization of the area under the ROC subjected to constrained threshold gradient direct algorithm. The approach presented here relaxes the normality assumption and the approach of Pepe and Thompson [21] is generalized. The analysis is applied to the highdimensional mass spectrometry glycomics data. In contrast to Ma and Huang [22], the simulation data is generated based on the joint distribution of disease samples and control samples with nonnormality assumption, which is more appropriate for mass spectrometry data. This simulation is able to assess the difference between the maximization of the empirical AUC and the target AUC. The simulation proves the asymptotic properties that estimated TGDRAUC approaches the true AUC when the sample size is increasing compared to the dimensionality of p biomarkers. When applied to the real ovarian cancer data, the algorithm also provides the buildin dimension reduction technique. For the highdimensional ovarian cancer data, we can detect the 63 most important biomarkers among the total of 1228 biomarkers with simultaneously estimating the linear combination coefficients. The selected 63 biomarkers match very well with the 14 peaks preselected based on biological evidence. The algorithm is a nonparametric approach, very flexible and easy to interpret. The algorithm is aimed to have optimal classification. The resulting AUC of the linear combination is plausible and should be optimal among all other possible combinations. The computation of TGDRAUC is computational feasible of highdimensional data. This highdimensional analysis evaluates more than 1000 biomarkers in the algorithm and essentially could consider more.
Although the result of simulation cannot guarantee the estimated AUC comes close enough to the true AUC in large p biomarker number for small sample size situation, the algorithm still provides enough information in recognizing potential biomarkers. Since the performance of small dimension of biomarkers in the large sample size scenario gives excellent overall result, iteratively combining the biomarkers in highdimensional and reducing the dimension of biomarker to some reasonable size is considered. With the reduced dimension, the biomarkers are combined again using the TGDRAUC algorithm. The algorithm would continue until the ratio of number of biomarker and the sample size are in the comfortable zone suggested by our simulation. The TGDRAUC algorithm proves to be a promising algorithm and might be recommended in combining mass spectrometry biomarker analysis for cancer diagnosis.
Methods
ROC Curve
A casecontrol study is considered where the main outcome is binary denoted as D, where D = 1 as the case and D = 0 as the control. Denote the relative abundance of the p glycomics R_{p × 1 }= (R_{1}, ..., R_{p})^{T}. We consider the linear combination score of the form
where
ROC is a graphical plot of the sensitivity and 1specificity, also known as true positive rate (TPR) and false positive rate (FPR), respectively. The TPR and FPR are defined by
for any cutoff value c. By varying the discrimination value of c, the TPR and FPR are plotted to generate the ROC curve, which is a twodimensional plot of FPR(c) vs TPR(c) with ∞ ≤ c ≤ +∞. There is a balance between TPR and FPR. A completely random predictor would give a straight line at an angle of 45 degrees from the horizontal, from bottom left to top right, because as the threshold is raised, there would be equal numbers of true and false positives. ROC above the nodiscrimination line would be preferred with better classification as the line closer to the upper leftcorner point (0,1).
The overall performance of the classifier can be evaluated by the area under the ROC curve (AUC). Denote n as the number of diseased samples, m as the number of normal samples and p as the dimension of biomarkers. Denote X_{i }= (X_{i1}, ..., X_{ip}) as the ith normal subject, and Y_{j }= (Y_{j1}, ..., Y_{jp}) as the jth diseased subject, i = 1, ..., m, j = 1, ..., n. For a given parameter β, the corresponding ROC curve is generated by linearly combining the p biomarkers for classifier β'Y or β'X. It has been shown by Bamber [27] that the theoretical area under the ROC curve is a probability Pr(β'Y  β'X ≥ 0). To achieve the optimal performance, we need to maximize
Statistically, the empirical AUC is given by
where the function Ψ(β; X_{i}, Y_{j}) in (5) is defined as
The empirical AUC is the same as the form of MannWhitney test statistics. The optimal
estimator
The Sigmoid Function
The main problem of the maximization (5) is that the objective function is not continuous, and thus not differentiable. The maximization is difficult to achieve and the maximizer is not unique. To overcome the difficulty, a smooth sigmoid function was chosen to approximate the objective function. The sigmoid function is a monotonically increasing function with a parameter r > 0 and lim_{x → ∞ }S_{r}(x) = 0, and lim_{x → +∞ }S_{r}(x) = 1. There are many choices of sigmoid functions, including:
The larger r is, the better the approximation will be. For a given value r = 10, Figure 7 shows the approximation result. The first function S_{1, r}(x) approximates the best. The first function S_{1, r}(x) can be simply written as
Figure 7. Sigmoid Function Approximation with r = 10. Several candidate sigmoid functions are plotted to approximate the indicator function.
Therefore, for the choice of sigmoid function, the first function was used for further
analysis. We now refer to the estimator
where Y, X and index i, j are defined in (5).
Since the exponential part in sigmoid function may result in unbounded situation when
larger r is selected or the data itself may be large, the exponential part was chosen to be
controlled by normalizing the data in the following way: denote Z_{ji }= Y_{j } X_{i}, i = 1, ..., m, j = 1, ..., n. Z is a pairwise difference matrix between is the disease and normal patients. We then
normalize Z as Z/Z_{2}, where
Threshold Gradient Direct Regularization (TGDR)
The TGDR approach constructs a parameter path β(λ) in parameter space that some of the points on that path are close to the point β_{* }in(6) representing the optimal solution. The best parameter path will be selected by kfold crossvalidation technique. Consider now to minimize
where P(β) is the penalties. There are several candidate penalty terms,
Let ν denotes the starting value on the path and Δν as the increment on the path. To implement the algorithm, we select Δν = 0.01. For any given r of the sigmoid function, a threshold τ is varying between 0 and 1. τ was chosen in the algorithm to be a vector {0,0.1,0.2,...,0.9,1}. The TGDR algorithm performs the following iteration steps:
• Initialize β_{i }= 0.01 and ν = 0, for i = 1, ..., p.
• Compute the negative gradient g_{i}(ν) = ∂G(β; λ)/∂β_{i }evaluatedat β_{i}(ν). Denote g_{i}(ν) as the ith component of g(ν). If max_{i}{g_{i}(ν)} = 0, stop the iteration.
• Compute vector of f_{i}(ν) as the ith component of f(ν); f_{i}(ν) = I{g_{i}(ν) ≥ τ·max_{l}g_{l}(ν), l = 1, ..., p}.
• Update β_{i}(ν + Δν) = β_{i}(ν) + Δν × g_{i}(ν) × f_{i}(ν). Replace by ν + Δν.
• Repeat step 2–4 until β converges, which means
The tuning parameter or the threshold τ controls the distribution of estimator
Double CrossValidation
The selection of the parameter path λ is determined by kfold crossvalidation(CV). Since the empirical AUC is in fact a nonparametric twosample comparison test, a slight variant of kfold CV is considered and it is called double kfold CV for two samples. The n number of diseased patients Y is randomly split into roughly equalsized K_{1 }parts and m number of normal patients X into roughly equalsized K_{2 }parts. Let k_{1 }index which of {1, ..., n} is in {1, ..., K_{1}} groups, and k_{2 }index which of {1, ..., m} is in {1, ..., K_{2}} groups. The crossvalidation estimate of prediction error is defined to be
where
The function in (8) reduces the pdimensional optimization problem to be onedimensional of minimizing λ. The golden section search method [29] was implemented, which does not require the calculation of the derivative, as the optimization method to search for the minimal value between 0 to a large number.
For any given τ in {0,0.1,0.2,...,0.9,1}, we run a double kfold crossvalidation. Denote τ_{l}, l = 1, ..., 11 as the lth component in the vector {0,0.1,0.2,...,0.9,1}. After selection of regularized
Positive constraints
Because of the positive nature of the mass spectrometry data, a positive constraint
on the parameter β is reasonable. The objective function (7) is minimized subjected to the constraints
β_{k }≥ 0 for k = 1, ..., p. Hence, the estimation
Authors' contributions
JY implemented the software for the mass spectrometry preprocessing method and threshold gradient direct regularization and area under the curve and drafted the manuscript. HL designed the TGDRAUC algorithm and helped draft the manuscript. CK performed the mass spectrometry glycomics experiment and helped revise the manuscript. CBL participated in the design of the mass spectrometry glycomics experiment. DMR contributed to the methods of preprocessing the mass spectrometry spectrum and oversaw the overall project. All authors read and approved the final manuscript.
Acknowledgements
The statistical part of work is funded by NIH grant: NIH NHGRI R01HG003352, NIH NIEHS P42ES04699 and NIH NCI P30CA093373. The biological part of work is funded by NIH grant: NIH RO1 GM49077. In addition, the authors thank two anonymous referees for their insightful comments.
References

Pepe M, Cai T, Longton GM: Combining Predictors for Classification using the Area Under the Receiver Operating Characteristic Curve.
Biometrics 2006, 62:221229. PubMed Abstract  Publisher Full Text

Baggerly A, Morris J, Wang J, Gold D, Xiao L, Coombes K: A comprehensive approach to the analysis of matrixassited laser desorption/ionizationtime of flight proteomics spectra from serum samples.
Proteomics 2003, 3:16671672. PubMed Abstract  Publisher Full Text

Wagner M, Naik D, Pothen A: Protocols for disease classification from mass spectrometry data.
Proteomics 2003, 3:16921698. PubMed Abstract  Publisher Full Text

Adam B, Qu Y, Davis JW, Ward MD, Clements MA, Cazares LH, Semmes OJ, Schellhammer PF, Yasui Y, Feng Z, Wright GL Jr: Serum Protein Fingerprinting Coupled with a Patternmatching Algorithm Distinguishes Prostate Cancer from Benign Prostate Hyperplasia and Healthy Men.
Cancer Research 2002, 62:36093614. PubMed Abstract  Publisher Full Text

Baggerly KA, Morris JS, Coombes KR: Reproducibility of SELDITOF protein patterns in serum: comparing datasets from different experiments.
Bioinformatics 2004, 20(5):777785. PubMed Abstract  Publisher Full Text

Apweiler R, Hermjakob H, Sharon N: On the Frequency of Protein Glycosylation, as deduced from analysis of the SWISSPROT database.
Biochimica et Biophysica Acta 1999, 1473(1):48. PubMed Abstract  Publisher Full Text

Varki A: Biological roles of oligosaccharides: all of the theories are correct.
Glycobiology 1993, 3(2):97130. PubMed Abstract  Publisher Full Text

An H, Miyamoto S, Lancaster K, Kirmiz C, Li B, Lam K, Leiserowitz G, Lebrilla C: Profiling of Glycans in Serum for the Discovery of Potential Biomarkers for Ovarian Cancer.
Journal of Proteome Research 2006, 5:16261635. PubMed Abstract  Publisher Full Text

Pepe M, Etzioni R, Feng Z, Potter J, Thompson M, Thornquist M, Winget M, Yasui Y: Phases of biomarker development for early detection of cancer.
Journal of the National Cancer Institute 2001, 93(14):10541061. PubMed Abstract  Publisher Full Text

Ball G, Mian S, Holding F, Allibone RO, Lowe J, Ali S, Li G, McCardle S, Ellis IO, Creaser C, Rees RC: An integrated approach utilizing artificial neural networks and SELDI mass spectrometry for the classification of human tumours and rapid identification of potential biomarkers.
Bioinformatics 2002, 18(3):395404. PubMed Abstract  Publisher Full Text

Lancashire LJ, Mian S, Rees RC, Ball GR: Preliminary artificial neural network analysis of SELDI mass spectrometry data for the classification of melanoma tissue. In 17th European Simulation Multiconference, Nottingham. Society for Modeling and Simulation International, SCS European Publishing House, Erlanger, Germany; 2003:131135.

Mian S, Ball G, Hornbuckle J, Holding F, Carmichael J, Ellis I, Ali S, Li G, McArdle S, Creaser C, Rees R: A prototype methodology combining surfaceenhanced laser desorption/ionization protein chip technology and artificial neural network algorithms to predict the chemoresponsiveness of breast cancer cell linear exposed to Paclitaxel and Doxorubicin under in vitro condition.
Proteomics 2003, 3:17251737. PubMed Abstract  Publisher Full Text

Fushiki T, Fujisawa H, Eguchi S: Identification of biomarkers from mass spectrometry data using a "common" peak approach.
BMC Bioinformatics 2006., 7(358) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Geurts P, Fillet M, Seny D, Meuwis M, Malaise M, Merville M, Wehenkel L: Proteomics mass spectra classification using decision tree based ensemble methods.
Bioinformatics 2005, 21(15):31383145. PubMed Abstract  Publisher Full Text

Xiong X, Fang X, Zhao J: Biomarker identification by feature wrappers.
Genome Research 2001, 11:18781887. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wu B, Abbott T, Fishman D, McMurray W, Mor G, Stone K, Ward D, Williams K, Zhao H: Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data.
Bioinformatics 2003, 19(13):16361643. PubMed Abstract  Publisher Full Text

Miketova P, AbbasHawka C, Hadfiled T: Microorganism gramtype differentiation of whole cells based on pyrolysis highresolution mass spectrometry data.
Journal of Analytical and Applied Pyrolysis 2003, 67:109122. Publisher Full Text

Lilien RH, Farid H, Donald BR: Probabilistic disease classification of expressiondependent proteomic data from mass spectrometry of human serum.
Journal of Computational Biology 2003, 10:925946. PubMed Abstract  Publisher Full Text

Datta S, DePadilla LM: Feature selection and machine learning with mass spectrometry data for distinguishing cancer and noncancer samples.
Statistical Methodology 2006, 3:7992. Publisher Full Text

Su J, Liu J: Linear Combinations of Multiple Diagnostic Markers.
Journal of the American Statistical Association 1993, 88(424):13501355. Publisher Full Text

Pepe M, Thompson M: Combining diagnostic test results to increase accuracy.
Biostatistics 2000, 1(2):123140. PubMed Abstract  Publisher Full Text

Ma S, Huang J: Regularized ROC method for disease classification and biomarker selection with microarray data.
Bioinformatics 2005, 21:43564362. PubMed Abstract  Publisher Full Text

Friedman J, Popescu B: Gradient Directed Regularization for linear regression and classification. [http://wwwstat.stanford.edu/jhf/ftp/path.pdf] webcite
Technical report Department of Statistics, Stanford University, CA; 2004.

Dudoit S, Fridlyand J, Speed T: Comparison of discrimination methods for tumor classification based on microarray data.
Journal of the American Statistical Association 2002, 97:7787. Publisher Full Text

Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.

McIntosh M, Pepe M: Combining Several Screening Tests: Optimality of the Risk Score.
Biometrics 2002, 58:657664. PubMed Abstract  Publisher Full Text

Bamber D: The area above the ordinal dominance graph and the area below the receiver operating characteristic graph.
Journal of Mathematical Psychology 1975, 12:387415. Publisher Full Text

Tibshirani R: Regression shrinkage and selection via lasso.
Journal of the Royal Statistical Society B 1996, 58:267288.

Press W, Teukolsky S, Vetterling W, Flannery B: Golden Section Search in One Dimension. In Numerical Recipies in C: the Art of Scientific Computing. 2nd edition. Combridge University Press; 1992.

Stone M: Crossvalidatory choice and assessment of statistical predictions.

Gui J, Li H: Threshold gradient descent method for censored data regression with applications in pharmacogenomics.
Pac Symp Biocomput 2005, :272283. PubMed Abstract  Publisher Full Text