Genome-wide association studies have been firmly established in investigations of the associations between common genetic variants and complex traits or diseases. However, a large portion of complex traits and diseases cannot be explained well by common variants. Detecting rare functional variants becomes a trend and a necessity. Because rare variants have such a small minor allele frequency (e.g., <0.05), detecting functional rare variants is challenging. Group iterative sure independence screening (ISIS), a fast group selection tool, was developed to select important genes and the single-nucleotide polymorphisms within. The performance of the group ISIS and group penalization methods is compared for detecting important genes in the Genetic Analysis Workshop 17 data. The results suggest that the group ISIS is an efficient tool to discover genes and single-nucleotide polymorphisms associated to phenotypes.
Understanding the inherited basis of genetic variation in human health and disease is currently one of the most challenging tasks. Genome-wide association studies have been used to establish the statistical association between hundreds of loci across the genome and common complex traits. Although this approach has brought substantial knowledge and understanding of the diverse molecular pathways that underlie specific diseases, more evidence shows that a large portion of complex diseases cannot be explained by common genetic variants [1,2]. Therefore alternative approaches are needed to detect and analyze rare variants associated with disease susceptibility genes. Although statistical methods for the detection of common functional variants (e.g., with minor allele frequencies [MAF] > 0.05) have been extensively developed and successively applied to numerous studies, methods for detecting rare functional variants are limited. Some methods developed for analysis of common variants can be easily extended to rare variants, for example, single-marker test, multiple-marker test, and collapsing methods, but their performance may not be optimal [3-5].
The primary purpose of this paper is to analyze quantitative traits Q1 and Q2 in replicates 1–200 of the Genetic Analysis Workshop 17 (GAW17) simulated mini-exome data . We study the GAW17 data set using modern ultra-high-dimensional model selection and group selection techniques. Given the natural group structure (i.e., genes) among single-nucleotide polymorphisms (SNPs), group selection tools can select the groups that consist of a number of weak predictors (i.e., SNPs with small MAFs) whose effect as a group on the phenotypes could be significant. In the context of the GAW17 data set, these weak predictors are just rare genetic variants. Contrary to collapsing methods, modern ultra-high-dimensional model selection techniques consider the joint effect among groups as well as among individuals and avoid oversimplification of the model. We propose group iterative sure independence screening (ISIS) for gene and SNP selection. We apply the method to analyze the GAW17 data and to compare it with penalized likelihood methods, such as the group least absolute shrinkage and selection operator (LASSO) and the group minimax concave penalty (MCP) in terms of the true significant genes (i.e., genes with significant SNPs) in the simulated GAW17 data. Functional variants are referred to as important variants throughout the text.
Because the SNPs are naturally grouped by genes, we consider a linear model with J groups of variables:
where y is an n × 1 response vector, Xj is an n × pj matrix corresponding to the jth group of variables, βj is a pj × 1 coefficient vector, and ε is a random noise vector with normal distribution. Denote X = (X1, …, XJ ) and .
We assume that the model is bilevel sparse, which means that only a small number of βj are nonzero vectors and, moreover, that each nontrivial βj is itself a sparse vector. In our analysis of the GAW17 data set, the response y is the quantitative phenotype Q1 or Q2, and the predictors are the 24,487 SNPs grouped in 3,205 genes. The bilevel sparse assumption, interpreted in this study, says that only a small number of genes are related to the phenotype of interest and that only some of the SNPs in these related genes are important. The assumption on sparsity plays a critical role in high-dimensional statistical modeling. The bilevel sparse assumption is appropriate for models with grouped predictors.
Because the GAW17 data are mini-exome human data, we use 0, 1, and 2 to denote genotypes AA, Aa, and aa, respectively. Thus each column in the design matrix X consists of the numbers 0, 1, and 2. Among the 24,487 SNPs in the data set, there are 9,433 SNPs with a MAF of 0.07% [= 1/(697 × 2)]; that is, this is the smallest MAF in the GAW17 data because only 1 individual out of 697 individuals has a variant at each such SNP locus. The fact that 9,433 is much greater than 697 makes no statistical model identifiable. Because of the nonidentifiability of the model, it is necessary to consider the group (gene) selection.
Many high-dimensional model selection and group selection techniques have been developed recently. One of the most popular methods is the penalized likelihood method, such as the LASSO, the smoothly clipped absolute deviation (SCAD) penalty, the MCP, and their extensions for group selection (e.g., group LASSO and group MCP) [7-11]. Two popular algorithms are used to find the maximizer of the penalized likelihood. The first algorithm is the least angle regression (LARS) algorithm and its extensions . It is efficient when the number of parameters (p) is comparable with the number of samples (n). However, when p is ultrahigh, as in our setting (24,487 potential predictors with only 697 observations), the LARS algorithm usually cannot be applied. The second algorithm is coordinate descent, which optimizes a target function with respect to a single variable at a time, iteratively cycling through all variables until convergence is reached . This method is even faster than LARS and can converge to the same solutions as LARS in many cases. However, it is unknown whether the two methods work well in the ultrahigh setting.
An alternative method for model selection that is different from the penalization approach is correlation learning, for example, sure independence screening (SIS) and iterative sure independence screening (ISIS) and their extensions for the group selection, called group SIS and group ISIS . Suppose that there are J groups of variables, denoted by G1, …, GJ. In the first step of this approach, regression is performed using a single group of variables for each Gj, j = 1, …, J. (Here, if the number of variables in Gj is small, we can simply perform a least-squares regression. Otherwise, when the size of Gj is large, we can use penalized regressions [LASSO or SCAD] or greedy algorithms [forward stepwise selection] to select the model and to estimate the coefficients.) After the first step, the residual sum of squares (RSS) is calculated by dividing by the degrees of freedom (df) for each regression using a single group. Then, the groups are ranked, and the top k groups are selected on the basis of the smallest RSS/df value. This is the main procedure for the group SIS method.
The group ISIS is just the iterative version of group SIS. First, one or several groups, say, Gi1, …, Gik, are selected using group SIS. Then, regression is performed using these selected groups together with a single new group from the rest of the groups for each of the new groups. Next, all the new groups are ordered by the RSS/df values, and top groups with the smallest RSS/df values are selected and added to the pool of selected groups. This process is iterated until some criterion is met. Note that this version of group ISIS corresponds to a version of ISIS introduced by Fan et al. , which improves the original ISIS given by Fan and Lv .
Group SIS and group ISIS are general strategies, and they have many variants in different applications. In this paper, we describe two procedures related to group ISIS for gene selection and apply them to analyze the GAW17 data. In the GAW17 data, the phenotypes may have strong associations with factors such as sex, age, and smoking. Therefore these factors are always considered important by default.
For procedure 1 we preset a maximal iteration K (12 or 15 in our study) and a constant C (5 in our simulation). In the first step, we perform a regression using each single group Gj and sex, age, and smoking, where j goes from 1 to J. We restrict the model size to at most C + 3 when we do the regressions. For example, if the jth group Gj consists of 12 variables, we use a forward stepwise selection method to select 5 among those 12 variables besides sex, age, and smoking. Then, we rank the RSS/df values for all groups. After ranking, we select the group, denoted by G(1), with the lowest RSS/df value in the model. At the kth iteration, we perform a regression using selected groups G(1), …, G(k−1), sex, age, and smoking together with a single group for each group from the rest. We restrict the whole model size to at most Ck + 3 when we do the regressions. Then, we select the new group G(k) by ranking the RSS/df values. The procedure ends when k >K. In short, we select one group each time until the kth iteration. As we will see, this procedure can be considered a special case of procedure 2. However, as the counterpart of forward stepwise selection in the setting of group selection, procedure 1 is an interesting procedure in itself. Therefore we include it in our numeric studies.
Procedure 2 is similar to procedure 1 except that it selects multiple groups each time and allows deletion. To make the algorithm more efficient, we select k0 = 5 groups with the lowest RSS/df values at each iteration. After adding these five new groups, we do group selection immediately and keep only the selected groups before entering the next iteration. The procedure stops when it is stable, that is, when it deletes newly selected groups in the group selection. We use group LASSO and group MCP with either the Akaike information criterion (AIC) or the Bayesian information criterion (BIC) for group selection in the deletion step and discard the groups that are not selected by group LASSO or group MCP. We find that the procedure usually stops within 10 iterations, so we set the maximum number of iterations at 10. Group MCP and group LASSO are implemented using R package grpreg and can also be directly applied to select important genes. It is interesting to compare procedure 1 and procedure 2 with these methods.
Two phenotypes of Q1 and Q2 are studied as interesting responses in this paper. Because there are 200 replicates of data sets in GAW17 and because each replicate has a certain amount of noise, we consider three noise levels (low, medium, and high). In the low noise level case, we take average of 200 replicates as the response variable; the medium noise level refers to the average of 10 replicates. For the high noise level, only one replicate is considered. We select the first replicate as an example of high noise data to compare the different selection methods. Other individual replicates could have been used, but the same conclusion is obtained.
Both group ISIS procedures (procedure 1 and procedure 2) are used for gene selection. In particular, for procedure 2 we apply group MCP and group LASSO by means of the AIC and BIC in the deletion step of each iteration. We also compare procedure 1 and procedure 2 with the penalization approaches (group MCP and group LASSO). The results are listed in Table 1. One can see that group ISIS can select most of the important genes with a small number of false positives from a huge pool when the noise level is low or medium.
Table 1. Comparisons of penalized likelihood methods and group ISIS procedures 1 and 2 at three different noise levels
To compare these methods more carefully, we applied them to 20 data sets with medium noise level. The data were obtained by taking the average of replicates 1–10, 11–20, …, 191–200. The results are summarized in Table 2. It seems that the performance of the penalized likelihood methods highly depends on the penalty structure (group LASSO or group MCP) and information criterion (AIC or BIC). Group MCP with BIC performed best among all four methods we investigated. We see that group ISIS procedure 2 performed well with lower numbers of false positives and lower false discovery proportions compared with penalized likelihood methods. Furthermore, based on the result of group ISIS, penalized likelihood methods or other methods can be applied in the second stage to select important SNPs. In fact, as a by-product in the final iteration of procedure 2, group MCP does perform SNP selection within selected genes (see Table 3). For example, in the case of low noise level Q2, The group MCP procedure 2 with AIC selected 13 genes with one false positive (OR1Q1) and one false negative (INSIG1). The procedure became stable at iteration 5. In the final deletion step, group MCP with AIC selected 13 genes among 18 and 32 SNPs within these 13 selected genes. We found only 3 false-positive SNPs (2 of them from OR1Q1 and 1 from SREBF1) among these 32 SNPs. Group MCP is quite aggressive for the individual selection within groups. One could use other methods to lower the number of false negatives regarding SNP selection.
Table 2. Comparisons of penalized likelihood methods and group ISIS procedure 2 at medium noise level over 20 data sets
Table 3. Genes and SNPs selected by group ISIS procedure 2 (group MCP with AIC) for the low noise data
Although it is unlikely in practice, when additional information is known in advance (e.g., all important SNPs are nonsynonymous or have MAF less than 0.2), we can filter out parts of the SNPs first and apply group ISIS to the rest. There are 13,572 nonsynonymous SNPs in 2,196 genes and 23,131 low-MAF (<0.2) SNPs in 3,100 genes. For the phenotypes Q1 and Q2, all important SNPs are nonsynonymous and have low MAF, so we lose nothing by searching among the restricted data set. We applied group ISIS to these data sets and found that the results were almost identical with the results reported in Table 1. Therefore we conclude that ultra-high dimensionality and existence of common variants (say, MAF > 0.2) hardly influence the performance of group ISIS. That is one of the main advantages of group ISIS.
The high noise level is the main reason that no method works well using only one replicate of the data. Besides that, to understand why some genes (e.g., VEGFC for Q1) cannot be selected even for the low noise data, we checked the data carefully. We found that SNP C4S4935, the only important SNP in gene VEGFC, is identical to SNP C13S348 in gene FLT1 in the design matrix. Gene FLT1 includes many SNPs related to Q1. So if FLT1 is selected, VEGFC does not have any priority to be selected. In short, the nonidentifiability may bring up some issue in gene selection when some important genes have few related SNPs.
Discussion and conclusions
In this study, we used modern ultra-high-dimensional model selection tools to detect important genes and important SNPs related to the phenotypes of interest. Group SIS and group ISIS were developed to conquer the difficulty of ultra-high dimensionality and nonidentifiability. As group LASSO is to LASSO, group SIS (or group ISIS) is the counterpart to SIS (ISIS) in the setting of group selection. These group selection tools work well under the bilevel sparse assumption. We used the penalized likelihood methods and group ISIS procedures to analyze the GAW17 data and compared their performance in recovering important genes associated with phenotypes Q1 and Q2. It seems that the group ISIS approach performs better than the penalized likelihood methods in terms of number of false positives and false discovery proportion. In particular, the proposed methods work well when additional replicates are available, that is, at the low and medium noise levels. When only one replicate with high noise is available, it seems that no method works well.
The authors have no conflict of interest.
YSN designed the study, performed the statistical analysis, and drafted the manuscript. NH participated in its design and statistical analysis, and helped to refine the statistical procedures. LA conceived of the study, and participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.
This research is supported partly by University of Arizona internal grants. The Genetic Analysis Workshop is supported by National Institutes of Health grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
Stat Appl Genet Mol Biol 2008., 7
J Am Stat Assoc 2001, 96:1348-1360. Publisher Full Text
J R Stat Soc Ser B 2006, 68:49-67. Publisher Full Text
Ann Stat 2010, 38:894-942. Publisher Full Text
Ann Stat 2004, 32:407-499. Publisher Full Text
Ann Appl Stat 2007, 1:302-332. Publisher Full Text
J R Stat Soc Ser B 2008, 70:849-911. Publisher Full Text