Department of Pediatrics, Stanford University, MSOB X111, Stanford, CA 94305, USA
Division of Biostatistics, University of California Berkeley, 101 Haviland Hall, Berkeley, CA 94720, USA
Abstract
Background
When a large number of candidate variables are present, a dimension reduction procedure is usually conducted to reduce the variable space before the subsequent analysis is carried out. The goal of dimension reduction is to find a list of candidate genes with a more operable length ideally including all the relevant genes. Leaving many uninformative genes in the analysis can lead to biased estimates and reduced power. Therefore, dimension reduction is often considered a necessary predecessor of the analysis because it can not only reduce the cost of handling numerous variables, but also has the potential to improve the performance of the downstream analysis algorithms.
Results
We propose a TMLEVIM dimension reduction procedure based on the variable importance measurement (VIM) in the frame work of targeted maximum likelihood estimation (TMLE). TMLE is an extension of maximum likelihood estimation targeting the parameter of interest. TMLEVIM is a twostage procedure. The first stage resorts to a machine learning algorithm, and the second step improves the first stage estimation with respect to the parameter of interest.
Conclusions
We demonstrate with simulations and data analyses that our approach not only enjoys the prediction power of machine learning algorithms, but also accounts for the correlation structures among variables and therefore produces better variable rankings. When utilized in dimension reduction, TMLEVIM can help to obtain the shortest possible list with the most truly associated variables.
Background
Gene expression microarray data are typically characterized by large quantities of variables with unknown correlation structures
As always, there is no onesizefitsall solution to this problem, and one often needs to resort to a mixandmatch strategy. The univariatemeasurement based gene selection is a very popular approach in the field. It is fast and scales easily to the dimension of the data. The output is usually stable and easy to understand, and fulfills the objectives of the biologists to directly pursue interesting findings. However, it often relies on oversimplified models. For instance, the univariate analysis evaluates every gene in isolation of others, with the unrealistic assumption of independence among genes. As a result, it carries a lot of noise and the selected genes are often highly correlated, which themselves create problems in subsequent analysis. Also, due to the practical limit of the size of the gene subset, real informative genes with weaker signals will be left out. In contrast, PCA/PLS constructs a few gene components as linear combinations of all genes in a dataset. This "Super Gene" approach assumes that the majority of the variation in the dataset can be explained by a small number of underlying variables. One then uses these gene components to predict the outcome. These approaches can better handle the dependent structure of genes and their performances are quite acceptable
Methods
Suppose the observed data are i.i.d.
Consider the semiparametric regression model:
where
Our goal is to estimate
More detailed descriptions of the TMLE methodology and the conducted simulations.
Click here for file
1. Obtain the initial estimator
2. Obtain the
3. Compute the "clever covariate":
4. Fit regression
5. Update the initial estimate
and update the initial fitted values
6. Compute the variance estimate
where
7. Construct the test statistic:
The TMLE estimator
In the application to dimension reduction, for each variable in the dataset, we compute a TMLEVIM pvalue. We then reduce our variable space based on these pvalues. There are two notions. First, in principle, a separate initial estimator
Results and Discussion
Simulation studies
We performed two sets of simulations. The first set of simulations investigates how TMLEVIM responds to changes in the number of confounding variables, the correlation level among variables, and the noise levels. The second set studies the TMLEVIM with more complex correlation structures and model misspecification. The performance of the dimension reduction procedure was primarily evaluated by the achieved prediction accuracy using a prediction algorithm on the reduced sets of variables, illustrated in the following analysis flow:
Two prediction algorithms, LASSO and D/S/A (Deletion/Substitution/Addition)
Part I
In simulation I, we varied the number of noncausal variables (
where
Simulations were run for combinations of:
•
•
• and
For each combination, we simulated a training set of 500 data points and a testing set of 5000 data points. The training set was used to obtain the prediction model while the testing set was used to calculate the L2 risk. We also calculated a crossexamined L2 risk using a testing set with a
TMLEVIM used LASSO to obtain both the initial estimator
The simulation I results
MVR
DSA
MVR
DSA
0.2341 ;
0.2436 ;
0.4035 ;
0.4230 ;
0.1
1.0870

1.2136

0.6522

0.8846

na
1.0130
na
1.0680
0.2202 ;
0.2231 ;
0.2341 ;
0.2027 ;
0.3
1.0776

1.0684

0.1528

0.0958

na
1.0345
na
1.0299
0.2425 ;
0.1169 ;
0.4883 ;
0.1268 ;
0.5
1.0373

1.0331

0.0355

0.0149

na
1.0251
na
1.0335
0.3599 ;
0.1307 ;
0.8001 ;
0.1740 ;
0.7
1.0081

1.0000

0.0275

0.0162

na
1.0693
na
1.1055
0.2262 ;
0.1364 ;
0.9390 ;
0.5498 ;
0.9
0.8415

0.5502

0.0364

0.0204

na
1.2630
na
1.6103
Bold fonts: testing set (a). Italic fonts: testing set (b).
na: not available. : the same value as the previous entry.
•
•
•
•
The
• The proportion of the risk reduction (
• Most
• As to the number of
• When
Figure
A typical example in simulation I
A typical example in simulation I. This graph presents the average L2 risk of the final prediction model on the candidate list from the URVIM and the TMLEVIM, for simulation I data with setup (
The additional materials of the conducted simulations.
Click here for file
Part II
Simulation II examines the TMLEVIM on largerscale datasets with much more complex correlation structures. The simulation consists of 500 samples and 1000 variables. We used a correlation matrix derived from the top 800 genes in a real dataset published in
Details of this simulation is provided in the Additional File
The simulation II results (pvalue)
Simulation
Linear
Polynomial
T.P
F.P.
T.P.
F.P.
URVIM
0.2887
13.8
605.3
0.1851
13.4
555.9
0.4849
16.6
280.5
0.3245
14.7
255.5
0.6289
19.7
29.1
0.4203
17.9
24
TMLEVIM(
0.6479
20
41.6
0.4498
19.2
105.9
The candidate variable list contains all variables with pvalues less than 0.05.
The numbers in Table
The simulation II results (top 100)
Simulation
Linear
Polynomial
T.P
cor.
T.P.
cor.
URVIM
0.1444
9.0
0.2956
0.0862
8.2
0.3642
0.1907
8.8
0.2534
0.1605
7.2
0.2590
0.6059
19.9
0.2289
0.4132
19.2
0.2234
TMLEVIM(
0.5916
20
0.1242
0.3859
17.7
0.0867
The candidate variable list contains the top 100 variables ranked by their pvalues.
We also carried out the TMLEVIM(
Data Analysis
Breast cancer patients are often put on chemotherapy after the surgical removal of the tumor. However not all patients will respond to chemotherapy, and proper guidance for selecting the optimal regimen is needed. Gene expression data have the potential for such predictions, as studied in
The goal of the study is to select a set of genes that best predict the clinical response pCR. The first step is to reduce the number of genes worth of consideration, and we applied both URVIM and TMLEVIM (with
Analysis results are tabulated in Table
The analysis result of the breast cancer dataset
Num. of genes in the candidate list
C.V. classification accuracy
Corr. level among the top 100 genes
URVIM
327
0.7669
0.43
660
0.7744
0.18
818
0.7744
0.21
The venn diagram of the breast cancer data
The venn diagram of the breast cancer data. This venn diagram shows the overlaps of identified candidate genes from the breast cancer dataset using the URVIM, the
The TMLEVIM(
In summary, the URVIM and RFVIM seemed to have identified genes that are strong predictors of the clinical variable ER status. The ER status is a strong indicator of the outcome pCR. Hence, the final prediction accuracy still seems quite good. The TMLEVIM has identified a list of genes of which a small proportion is strong predictors of ER status and others are not associated with the ER status. Its prediction accuracy is slightly better than that of the URVIM and RFVIM.
Conclusions
We have shown in this paper with extensive simulations that the TMLE based variable importance measurement can be incorporated into a dimension reduction procedure to improve the quality of the list of the candidate variables. It requires an initial estimator
A popular dimension reduction approach is the principle component analysis (PCA). The PCA computation does not involve the outcome, and so it could be less powerful when prediction is the primary goal. Its output is a linear combination of all the genes. Though not a gene selection approach, we still carried it out on our simulation I data as an interesting comparison to our approach. PCA demonstrates an intermediate performance with respect to the URVIM and the TMLEVIM on small pvalue cutoffs. This means a few top components carry all the prediction power. When the pvalue cutoff is increased, and more components enter the candidate list, its results became quite unsatisfying. When the correlation structure changes among the genes, PCA has done a poor predicting job. The PCA results are contained in Additional File
The PCA results.
Click here for file
Usually, the reduced set of variables will serve as the input of a prediction algorithm to build a model. Such algorithms used in this article include MVR, LASSO, and D/S/A. We have noticed that in most of our simulations, the MVR prediction often achieves a similar risk as LASSO and D/S/A on the TMLEVIM reduced set of variables. It suggests that further variable selection may not be necessary for the TMLEVIM candidate list, and we can use simpler algorithms to get a good prediction. In fact, the TMLEVIM can go beyond the scope of dimension reduction. It can be iteratively applied to the data until it converges to a list of several variables that are most likely to be causal to the outcome. In this case, one may want to use the Super Learner
TMLEVIM is a quite general approach. Besides gene expression data, TMLEVIM can also be applied to genetic mapping problems. The genomewide association studies (GWAS) can involve more than a million of genetic markers. In this case, only the univariate analysis seems to be feasible of ranking every marker. With the TMLEVIM procedure, we can run more complex algorithms on a subset of top ranked markers, taking it as the initial estimator, and then evaluate every single marker. The variable importance of each marker is thus obtained through a multimarker approach and being adjusted for its confounder. However, the GWAS in human beings is usually casecontrol data, and the current TMLEVIM needs to be extended to accommodate such outcomes.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
MvdL conceived the project and designed the algorithm. HW implemented the algorithm, designed the simulation studies, and collected and analyzed the data. All authors participated in drafting the manuscript.
Acknowledgements
The authors want to thank Cathy Tugulus for sharing her codes and her helpful comments on this work. The authors also thank the reviewers for their precious appraisal of the earlier version of this manuscript. This work was by NIH R01 AI074345. The authors declare no conflicts of interest.