Abstract
Background
Different highdimensional regression methodologies exist for the selection of variables to predict a continuous variable. To improve the variable selection in case clustered observations are present in the training data, an extension towards mixedeffects modeling (MM) is requested, but may not always be straightforward to implement.
In this article, we developed such a MM extension (GAMMMMI) for the automated variable selection by a linear regression based genetic algorithm (GA) using multimodel inference (MMI). We exemplify our approach by training a linear regression model for prediction of resistance to the integrase inhibitor Raltegravir (RAL) on a genotypephenotype database, with many integrase mutations as candidate covariates. The genotypephenotype pairs in this database were derived from a limited number of subjects, with presence of multiple data points from the same subject, and with an intraclass correlation of 0.92.
Results
In generation of the RAL model, we took computational efficiency into account by optimizing the GA parameters one by one, and by using tournament selection. To derive the main GA parameters we used 3 times 5fold crossvalidation. The number of integrase mutations to be used as covariates in the mixed effects models was 25 (chrom.size). A GA solution was found when R^{2}_{MM} > 0.95 (goal.fitness). We tested three different MMI approaches to combine the results of 100 GA solutions into one GAMMMMI model. When evaluating the GAMMMMI performance on two unseen data sets, a more parsimonious and interpretable model was found (GAMMMMI TOP18: mixedeffects model containing the 18 most prevalent mutations in the GA solutions, refitted on the training data) with better predictive accuracy (R^{2}) in comparison to GAordinary least squares (GAOLS) and Least Absolute Shrinkage and Selection Operator (LASSO).
Conclusions
We have demonstrated improved performance when using GAMMMMI for selection of mutations on a genotypephenotype data set. As we largely automated setting the GA parameters, the method should be applicable on similar datasets with clustered observations.
Keywords:
Variable selection; Linear regression; Genetic algorithm; Mixedeffects model; Multimodel inferenceBackground
In recent studies, classical regression methods for prediction of a continuous variable from a large number of covariates have been extended for the training of a model when the data set is hierarchical in nature [14]. In this article we extend our genetic algorithm (GA) variable selection methodology in [5] to allow for clustering in the data. We compare the performance of multimodel inference (MMI) using restricted maximum likelihood (REML) mixedeffects modeling [6,7] (MM) with ordinary least squares regression [8] (OLS) and compare GAMMI with the commonly used penalized regression method Least Absolute Shrinkage and Selection Operator [9] (LASSO). We also show how to optimally set the GA parameters.
As an example, the training of a linear regression model for prediction of Raltegravir (RAL) resistance (“phenotype”) from mutations in the HIV integrase region (“genotype”) is worked out. The data sets used for training and testing were described in more detail in [5]. The training set consisted of n = 991 clonal genotypephenotype measurements, from multiple clones derived from 153 clinical isolates (on average 5 à 6 clones per isolate) and repeated measurements (on average 3) from 28 sitedirected mutants (invitro lab created clones with a designed mutational pattern), and the number of candidate mutations for selection was p = 322. Two test sets were used: the first consisted of population data of 171 clinical isolates (test set 1), the second consisted of 67 integrase sitedirected mutants containing most of the known RAL resistance associated mutational patterns [10] (test set 2). As it was found in [5] that a second order model did not significantly outperform a first order model, we did not consider interaction terms.
The paper is organized as follows. We begin by recalling the Simple Genetic Algorithm for variable selection in OLS linear regression. Then, we introduce GAMM as an extension for clustered data. Finally, we introduce MMI for estimation of the model parameters, combining the results from multiple GAMM (or GAOLS) runs, followed by a short section on how we applied the LASSO method for comparison. In the remainder of the paper, we illustrate our methodology on an example for the predictive modeling of RAL resistance. For this example, we describe in detail how we optimized the GA parameter settings, and we report the results of comparing GAMMMMI with LASSO and GAOLSMMI. When nominating one ‘best’ model, from all models evaluated in the comparison, we chose the GAMMMMI TOP18 model as a sparse model with high biological relevance (17 out of 18 integrase mutations in this model have been confirmed to be associated with resistance [5]), and having better predictive accuracy than LASSO and GAOLSMMI models with equal number of mutations selected. Throughout the text of this article GA related terminology is written in italic.
Methods
GAOLS
The Simple Genetic Algorithm, due to John Holland [1115], is used to evaluate a set P of regression models M with p_{sel} variables. In GA terminology: P is a population, and a model M ∈ P, is called an individual, the p_{sel} model variables determine the individuals’ chromosome. The number of models in a population, P, is fixed, as well as the number of variables p_{sel} in a model M. In GA terminology: P is called the population size (pop.size), and p_{sel} is called the chromosome size (chrom.size). Thus, each regression model M represents a candidate subset of p_{sel} variables (in GA terminology variables are called genes), and a GA fitness function has to be defined to identify the better or ‘more fit’ individuals. In GAOLS, we used the linear model R^{2} (OLS) goodnessoffit statistic as fitness function. The better the model M fits to the data, the higher R^{2} (with 0 ≤ R^{2} ≤ 1). Models with R^{2} > goal fitness are termed solutions to the optimization problem. A Darwinian evolution is applied to modify the population over several generations. The GA finds a solution using the search procedure as given in Table 1.
Table 1. Simple genetic algorithm
In step 3 of Table 1, the mutation genetic operator alters a gene (replacing it with another gene from the pool of candidate genes) in a chromosome with probability Pm. The crossover genetic operator recombines the genotypes of two individuals. The probability of an individual to be selected for crossover is Pc. The key in the optimization is to keep a good balance between selective pressure (Table 1 step 2) and genetic diversity (Table 1 step 3). The GA run is completed when an individual is found with fitness > goal fitness. When no solution is found within a maximum number of generations (max.generations), the GA run is halted. For step 2 of Table 1, we used tournament selection as detailed in Section II (Results and discussion). Also, elitism is used, meaning that the best chromosome (highest R^{2}) is passed through to the next generation, with a probability Pe.
The running of the GA is done multiple times to generate a set S of solutions. A ranking by importance can then be made for all variables based on their frequency in S.
GAMM
Although OLS parameter estimates are known to be unbiased when neglecting the correlation structure [6], in this article we want to evaluate whether using a mixed model for the GA models, using a random subject effect in addition to the fixed effects (variables as in the OLS model), can improve the interpretability or performance of the final linear regression model, derived with MMI (next section).
The GAMM methodology makes use of the Simple Genetic Algorithm (Table 1), completely analogous to GAOLS, producing a ranking of variables by their frequency in a set S of GA solutions. However, there is no single commonly used definition for the R^{2} statistic as is the case for OLS [16,17]. Several definitions have been suggested that all have different interpretations in the presence of correlated errors. Here, we used the marginal R^{2}_{MM} definition from [18], quantifying the variance explained by the fixed effects. As new data will originate from other subjects than those used for the training of the model, the random effects cannot be used for prediction. In [1] it has also been described that conditional R^{2} (variance explained by the entire model, including the random effects) should not be used for fixedeffect variable selection. For us, the main motivation for using R^{2}_{MM} was that the MM can be fitted using REML, resulting in better estimates for the variance components, needed in the estimation of the fixed effects, especially in models with many fixed effects [7].
In the example, for predicting the RAL phenotype y from the integrase clonal genotype x ∈ [0, 1]^{p}, the mixed model M uses one random effect/ cluster factor α_{i} (clones are clustered per clinical isolate/ sitedirected mutant):
with β_{0} the intercept, and y_{ij} the jth response of cluster i,
and
If x_{k} ∉ M: β_{k} ≡ 0.
The marginal R^{2} is calculated as:
where is the variance calculated from the fixed effects β_{k}:
is the betweencluster variance, and is the withincluster variance.
The intraclass correlation: for the model without fixed effects was 0.92, showing very strong withincluster correlation, and suggesting that accounting for this correlation may improve the performance of our model.
GAMMI
In [19,20] it has been described that, when the number of samples in the training data is small, making inference from a single best model, e.g., produced with stepwise regression, leads to the inclusion of noise variables. Here, we used MMI to combine the information from the GA solutions into a final model for making predictions. As a GA run is stopped as soon as the goal fitness (calculated in section VI (Results and discussion)) is achieved (Table 1, step 4), GA solutions were ‘equally fit’. Thus, we used equal weighting of the GA solutions in the MMI. In [6] it was shown that for stepwise regression using an information criterion for selection – as we used in [5] for deriving a consensus model from the GA ranking of variable frequencies – one should for MM use the biased ML estimators. An advantage of using MMI in combination with GAMM is that REML can still be used. Thus, using MMI, we could make a fair comparison between GAOLS and GAMM.
For estimation of the parameters for the final model, we used the following three MMI approaches on the GA solutions:
1. Refitting for a TOP selection of the GA ranking: from the GAranking, the variables with highest frequencies were retained for the final model, which was then refitted using OLS/MM.
2. Averaging of parameter estimates using all GA solutions (, if x_{k} not in GA solution) (MMI1):
with S the number of GA solutions.
3. Averaging of parameter estimates using GA solutions where (MMI2):
For the model averaging in 2 and 3, parameters were (re)fitted using OLS/MM for all m variables with presence at least once in a GA solution or for a TOP selection of variables in the GA ranking only.
LASSO
LASSO [9] is a regularization method that performs variable selection by constraining the size of the coefficients, also called shrinkage. By applying an L1 absolute value penalty, regression coefficients are ‘shrunk’ towards zero, forcing some of the regression coefficients to zero. Using the R package glmselect 1.93 [21], for the described example in this paper we performed variable selection using the LASSO technique on the clonal genotypephenotype database returning a LASSO ranking of variables (solution path) as selected by decreasing the amount of penalty applied. Besides using the shrinkage coefficients for variable estimation (default LASSO) we also applied OLS and MM to the LASSO selected variables (postLASSO [22]).
Results and discussion
GA parameter settings
We optimized the GA parameters one by one in the order (I  > VI) as described below, and taking computational efficiency into account (see Additional file 1). Tournament selection was used as selection method to form a new population of more fit individuals. GA parameters Pm and Pc were optimized together using a metaGA. Pe and pop.size were fixed in advance and were not optimized. Pe was set to conserve the best chromosome in three consecutive generations, followed by a generation where the probability of keeping the best chromosome was set to 20%. Pop.size was set equal to 20. To set the main GA parameters: max.generations, chrom.size, and num.runs we used crossvalidation (Additional file 1 point 7).
Additional file 1. GA parameter settings and computational efficiency. Computational efficiency is taken into account by setting the GA parameters one by one: I and II: Pm and Pc, IItournament.size, IIImax.generations, IVchrom.size, Vnum.runs (needed until goal.fitness is set in VI). Preset values are used for the parameters to be optimized later in the process. For deriving the GA parameters in III  > V we used 3 × 5fold crossvalidation.
Format: PDF Size: 13KB Download file
This file can be viewed with: Adobe Acrobat Reader
For running the GA, we used the R package GALGO 1.0.11 [23]. After inspection of the R^{2}_{CV} results, with exception of goal.fitness, we took the same optimized GA parameters values for GAOLS and GAMM (for the model comparison): pop.size = 20, chrom.size = 25, Pm = 0.1, Pc = 0.6, Pe = (1,1,1,0.2), max.generations = 500, tournament.size = 10, num.solutions = 100, goal.fitness.ols = 0.957, and goal.fitness.mm = 0.95. In Additional file 2 is the R code we used to derive these settings and to run GAMMMMI.
Additional file 2. R code. GAMMMMI source code in R programming language (R ≥ 2.15.2 from http://cran.rproject.org webcite).
Format: R Size: 28KB Download file
I. MetaGA for selection of Pm and Pc
For the metaGA optimization of the parameters Pm and Pc (Table 2), we used the R package gaoptim 1.0 with the default settings (except for pop.size_{meta}= 20, instead of 100 (default)) [24]. GAOLS was used as the metaGA fitness function returning the R^{2} from the best chromosome for the (Pm, Pc) combinations. Different random numbers were generated for each of the GAOLS runs, thus the same realvalued combination (Pm, Pc) with multiple presence in the metaGA population did not give the same fitness result. The fitness landscape from 2000 (pop.size_{meta} × num.generations_{meta}) points is shown in Figure 1.
Table 2. MetaGA optimization of Pm and Pc
Figure 1. GA parameter: mutation probability (Pm) and crossover probability (Pc). R^{2}fitness landscape from metaGA.
Crossover was a fairly weak genetic operator as can be seen from the red band in Figure 1. Oppositely, the mutation genetic operator was a strong operator and was best taken in the range [0.1, 0.4]. The metaGA converged at (Pm,Pc) = (0.258,0.372). For further evaluation in Section II, we also selected (0.1,0.6) and (0.2,0.6) located in the largest dark red area in Figure 1 (R^{2} > 0.91).
II.Tournament selection
Tournament selection [15,25] is a selection method to bias the selection towards the more fit individuals. Pop.size tournaments are organized with k randomly selected chromosomes. The winner of a tournament is the chromosome with the best fitness (highest R^{2}). The pop.size tournament winners become the new population. Selection pressure, the degree to which better individuals are favoured, is increased when the tournament size is increased, as the winner from a large tournament will, on average, have a higher fitness than the winner of a small tournament.
In the optimization (Table 3), all tournament sizes 1 ≤ k ≤ pop.size were considered. From section I, we selected the following (Pm,Pc) combinations for evaluation: (0.1,0.6), (0.2,0.6), and (0.258,0.372). We also considered (0.05,0.7), the (Pm,Pc) combination used in [5]. From Figure 2, to improve the R^{2}(OLS) performance the tournament.size k should be taken > 2. We chose to continue to use k = 10 (as preset in section I). Slightly better R^{2} performance was seen for the (Pm,Pc) combinations (0.1,0.6) and (0.2,0.6). The former was chosen for reasons of computational efficiency.
Table 3. GA parameter settings to evaluate tournament.size and ( Pm , Pc )
Figure 2. GA parameter: tournament size. Mean R^{2}(OLS) from best chromosomes (from 10 runs) on the training data for tournament selection with tournament.size 1–20 (pop.size).
III. Maximum number of generations
In Table 4, the GA settings for evaluating max.generations are summarized. Evaluation was done for both GAOLS and GAMM, calculating as the mean from 3 repetitions of 5fold crossvalidation: where was calculated as the correlation between the phenotype measurements of all observations in the database (contained exactly once in test set T_{ij}, j = 1…5) and their mean prediction (MMI1) of the 10 best chromosomes from GAOLS/GAMM (trained on train set TR_{ij} containing 4/5 of the subjects).
Table 4. GA parameter settings to evaluate max.generations
From Figure 3, it can be seen that, while the improvement in R^{2}_{CV} when increasing max.generations from 100 to 300 was larger for GAMM than for GAOLS, the R^{2}_{CV} performance for GAMM was found to be lower than for GAOLS. Stabilization of R^{2}_{CV} was seen for both GAOLS and GAMM for num.generations ≥ 400. We chose max.generations = 400 to be used further in the sections IV and V. Note that for the preset goal fitness = 1, max.generations was the number of generations always executed. For the model comparison, with the goal fitness calculated (Section VI), we set max.generations = 500 for both GAOLS and GAMM.
Figure 3. GA parameter: number of generations. R^{2}_{CV} from mean prediction of best chromosomes from 10 runs (3 × 5fold CV) (MMI1).
IV.Chromosome size
In Table 5 the GA settings for evaluating chrom.size are presented. Analogously as for num.generations, evaluation was done for GAOLS as well as GAMM, using 3 × 5fold crossvalidation (see section III).
Table 5. GA parameter settings to evaluate chrom.size
The R^{2}_{CV} performance is shown in Figure 4. Stabilization in performance was seen for both GAOLS and GAMM for chrom.size ≥ 25. We chose chrom.size = 25 to be used further. Thus, after optimizing chrom.size, the GAMM performance was now equal to the GAOLS performance (R^{2}_{CV} = 0.87).
Figure 4. GA parameter: chromosome size. R^{2}_{CV} from mean prediction of best chromosomes from 10 runs (3 × 5fold CV) (MMI1).
V. Number of GA runs
The GA settings for evaluating num.runs are shown in Table 6. Analogously as for max.generations and chrom.size, evaluation was done for both GAOLS and GAMM using 3 × 5fold crossvalidation (see section III).
Table 6. GA parameter settings to evaluate num.runs
In Figure 5 the R^{2}_{CV} performance is shown using all best chromosomes from num.runs in the model averaging (MMI1) (cf. sections III and IV), including the cases where the GA variable selection is done with MM and reestimation of the variables in the MMI is done with OLS and vice versa. A similar R^{2} performance was observed when using GAOLS or GAMM for the variable selection. However, a higher R^{2}_{CV} performance was observed when using OLS for estimation of the best chromosome parameters. The R^{2}_{CV} performance was stable for num.runs ≥ 10. When increasing num.runs from 100 to 500 for GAOLS variable selection, only a slight increase in R^{2}_{CV} performance was seen.
Figure 5. GA parameter: number of GA runs. R^{2}_{CV} from mean prediction of best chromosomes from num.runs (3 × 5fold CV) (MMI1).
In Figure 6 the R^{2}_{CV} performance is shown using only the ‘best’ best chromosome from num.runs for prediction in the crossvalidation. Overall, a similar R^{2}_{CV} performance was observed when using OLS or MM for estimation of the ‘best’ best chromosome parameters, and using GAMM or GAOLS for variable selection.
Figure 6. GA parameter: number of GA runs. R^{2}_{CV} from prediction of ‘best’ best chromosome from num.runs (3 × 5fold CV).
In Figure 7, the ‘x% best’ best chromosomes (shown on the xaxis in log scale) were used in the model averaging (MMI1). Evaluation was done for num.runs = 100 or 500 and num.runs = 50 or 100 for GAOLS and GAMM, respectively. For GAOLS, the highest R^{2}_{CV} was 0.89 and was obtained when including the ‘five best’ best chromosomes (top 1% chromosomes with highest R^{2}(OLS) from num.runs = 500). Also, for GAMM, including the ‘five best’ best chromosomes (top 5% of num.runs = 100 with highest R^{2}_{MM}) gave the highest R^{2}: 0.879 and 0.885 for MMIMM and MMIOLS, respectively. Thus, both for GAOLSMMI and GAMMMMI inclusion of the ‘five best’ best chromosomes yielded an improvement in R^{2}_{CV} performance in comparison to using all best chromosomes (Figure 5) or ‘the best’ best chromosome (Figure 6). As previously noted from Figure 5, also a better R^{2}_{CV} performance was again found using OLS estimation of parameters than when using MM estimation. For GAOLS num.runs = 100 was repeated 5 times (splitting the best chromosomes as available from num.runs = 500 in five consecutive parts for evaluation using MMI1). The mean curve of these 5 repeats is shown, together with the 95% confidence interval error bars. The peak of this mean curve is at ‘6% best’ best chromosomes included. The GAMM curve with MMIOLS estimation (num.runs = 100) is situated within these error bars. Thus, for num.runs = 100 GAMM and GAOLS perform equally well in selecting variables for the model in the crossvalidation. For GAOLS the R^{2}_{CV} performance using the ‘five best’ best chromosomes using num.runs = 500 is better than when using num.runs = 100. Therefore, calculation of the goal fitness for GAOLS in section VI was done from the num.runs = 500 best chromosomes. For calculation of the goal fitness for GAMM, num.runs = 100 was used. Note that, once the goal fitness was calculated, num.runs was set to NA (not applicable). Instead, the model comparison will be based on num.solutions = 100 (number of best chromosomes with R^{2} > goal.fitness) for both GAOLS and GAMM.
Figure 7. GA parameter: number of GA runs. R^{2}_{CV} from mean prediction of ‘x% best’ best chromosomes from num.runs (3 × 5fold CV) (MMI1).
VI.Goal fitness
As derived from Figure 7, for calculating the goal fitness we considered the fitness of the ‘5 best’ best chromosomes: this is the top 1% (of num.runs = 500) for GAOLS and the top 5% (of num.runs = 100) for GAMM. For each of the 15 (3 × 5) CV training sets we calculated the nonparametric onesided (1p,1α) tolerance upper limit [26] on the R^{2}fitness distribution of best chromosomes from num.runs with p = 1% and p = 5% for GAOLS and GAMM, respectively, and α = 0.05 (95% confidence). The interpretation is that with confidence level 1α not more than (100 × p)% of the best chromosomes have R^{2}fitness values exceeding this limit. To be able to calculate these tolerance limits the requested number of runs was [26]. This requirement was met for num.runs = 100 ≥ 59, and num.runs = 500 ≥ 299 for GAMM and GAOLS, respectively. The goal fitness was then calculated as the mean of the CV tolerance upper limits:
which equals goal.fitness = 0.957 for GAOLS and goal.fitness = 0.95 for GAMM. For the calculation we used the R package tolerance 0.5.2 [27].
GAOLS vs. GAMM: variable selection
GAOLS and GAMM variable selection were performed on the clonal genotypephenotype database using the GA parameters as specified in the above sections. The percentage of runs that failed reaching the goal.fitness with max.generations = 500 was 16% and 23.1% for GAOLS and GAMM, respectively.
Figure 8 shows the relation between the frequency of the variables selected in the GA using OLS and MM. While frequency differences were clearly observed (e.g. for 74M, 151I, 230R, 84L, 140S, 143C, 155S, and 140A), a strong correlation was obtained (R^{2} = 0.95). Eight integrase mutations were selected as variables in all 100 GA solutions for both GAOLS and GAMM: 92Q, 97A, 143G, 143R, 148H, 148K, 148R, 155H. This number would possibly be lower when increasing num.solutions, leading to nonselection for a few GA solutions. This was now already the case with num.runs = 100 for 66K (always selected by GAMM, 99/100 selected by GAOLS) and 121Y (always selected by GAOLS, 99/100 selected by GAMM).
Figure 8. Pearson correlation of GA frequency rankings using MM vs. OLS (R^{2 }= 0.95). Mutations with frequency > 20 for OLS or MM are labeled.
GAOLS and GAMM variable selection vs. LASSO
Figure 9 shows the comparison of the GAOLS and GAMM ranking with the LASSO top 50 ranking of variables, shown on the xaxis. The variables selected are integrase mutations, indicated as primary/secondary/other. Primary and secondary mutations have been associated with RAL resistance in [10]. Note that whereas a single primary mutation causes RAL resistance, the effect on resistance of secondary mutations not in a combination with a primary mutation is minor [5]. In Figure 9, most of the primary and secondary mutations had a high ranking for GAOLS, GAMM and LASSO. However, some of the ‘other’ mutations such as 66K, 121Y, 143G and 155S with presence in one or more of the publically available genotypic algorithms: ANRS (http://www.hivfrenchresistance.org webcite), Rega (http://regaweb.med.kuleuven.be webcite), and Stanford (http://hivdb.stanford.edu webcite), had a lower ranking for LASSO. We note that 66K, 121Y and 155S were introduced in the database as sitedirectedmutants not in a combination with other mutations [5], and LASSO was less sensitive in selecting these. Another observation was that for LASSO, the secondary mutations 140A and 140S were ranked higher than the primary mutations. Also, mutations not listed by any of the public algorithms such as 6E, 125A, and 200L, had a higher ranking for LASSO in comparison to GAOLS and GAMM. Oppositely, one of the “other” integrase mutations ranked higher by GAOLS and GAMM, and not listed by any of the public genotypic algorithms was 84L. In [5] we already discussed that its selection may result from a more complex interaction of three secondary mutations with which 84L cooccurred in the clones of one clinical isolate.
When we compared the GAOLS ranking with the GAMM ranking (Figures 8 and 9), a relatively low ranking was seen e.g., for GAOLS for 140A and 155S, which favours GAMM for its interpretation.
GAOLSMMI vs. GAMMMMI vs. LASSO: R^{2} performance on test set 1 and test set 2
In Tables 7 and 8 are the results of the R^{2} performance comparison of GAOLSMMI, GAMMMMI, and LASSO on the two test sets with n = 171 clinical isolates and n = 67 sitedirected mutants, respectively. Models containing the TOP151821242730 or ALL variables as selected by LASSO, GAOLS and GAMM were considered. Note that as randomness is incorporated in the GA optimization techniques there are more mutations with presence in at least one of the GA solutions, m = 193 and m = 200 for GAOLS and GAMM respectively, compared to m = 51 mutations with absolute value of the regression coefficients above zero in the LASSO solution path.
On test set 1, using MM for the variable estimation had a slightly better R^{2} performance than using OLS, for all models considered. Note that this was not the case in the crossvalidation (section V) where OLS R^{2}_{CV} performance was higher, possibly due to the inclusion of multiple clinical isolates from the same patient. However, as patient information was not given for the training set, we could not take this into account. For the TOP15/TOP18 models containing the smallest number of variables, the best performance was seen for GAMMMMI1 (R^{2} = 0.839). For the TOP21 > ALL models with more variables considered, the best performance was seen for GAOLSMMI1 (R^{2} = 0.8390.841). When estimating ALL GAOLS/GAMM variables, the worst performance was seen for MMI2 (R^{2} = 0.7010.725) where noise variables were clearly overweighted. For LASSO, the best R^{2} performance on test set 1 was obtained using MM for the variable estimation for the TOP15 > TOP24 selection of variables (R^{2} = 0.8240.835). For LASSO TOP27 > ALL, the best R^{2} performance was obtained using the LASSO shrinkage coefficients (R^{2} = 0.8260.828).
On test set 2, for the sparse models the best performance was observed for LASSOOLS TOP15 (R^{2} = 0.734), GAMMMM TOP18 (R^{2} = 0.770), and GAMMOLS TOP21 (R^{2} = 0.777). For the TOP21 > ALL models, the best performance was seen for LASSO (R^{2} = 0.7710.789). In contrast to the results for test set 1, the MMI2 R^{2} performance was now found to be higher than for MMI1, for the GAOLS/MM models. The reason is that while test set 1 consisted of clinical samples, with 82.5% not containing any of the primary RAL resistance mutations [5], test set 2 consisted of sitedirected mutants containing most of the known resistance patterns but lacking any noise variables as found in clinical samples. Nevertheless, on test set 2, the GAMM R^{2} values were found to be better than for GAOLS, confirming that a better selection of variables as made by GAMM (cf. the above two sections) led to a better performance on unseen data.
Therefore, on the example training set in this article we would favour the GAMMMMI TOP18 model. Based on the performance on test set 2, for the MMI variable estimation refitting using MM may be preferred over MMI1MM.
Conclusions
In this article, we extended our GA variable selection methodology to mixed models which account for clustering in the data. Using crossvalidation, we optimized the GA parameter settings taking also computational efficiency into account. For the workedout example, all settings could be taken equal for GAOLS and GAMM, with exception of goal.fitness for which we used a marginal R^{2} definition. The model parameters for prediction could then be estimated using MMIMM (REML) on the GA solutions obtained from 100 GA runs. When testing LASSO, GAOLS and GAMM on two unseen data sets, all methods had good performance. When imposing a parsimony restriction for better interpretability of the model, the GAMMMMI TOP18 model had better predictive accuracy (R^{2}) than GAOLS and LASSO.
In summary, we belief that GAMMMMI is a direct approach to derive a sparse and interpretable model for making predictions with good accuracy on small data sets with clustered observations and a large number of candidate variables, where chance of overfitting with standard regression techniques is high.
Availability and requirements
Project name: GAMMMMI.
Project home page: http://sourceforge.net/projects/gammmmi webcite.
Operating system: Platform independent.
Programming language: R ≥ 2.15.2, perl, MATLAB.
Other requirements: requires galgo 1.0.11 [23].
License: GNU GPL.
Any restrictions to use by nonacademics: none.
Competing interests
The authors declare that they have no competing interests
Authors’ contributions
KVdB developed the GAMMMMI approach, performed all analyses and wrote the manuscript. GV conceived of the study, assisted in its design, and revised the manuscript. HvV assisted in study design. All authors read and approved the final manuscript.
Acknowledgements
The authors would like to thank the anonymous reviewers for their constructive comments to improve the manuscript. Financial support from the IAP research network #P7/06 of the Belgian Government (Belgian Science Policy) is gratefully acknowledged.
References

Orelien JG, Edwards LJ: Fixedeffect variable selection in linear mixed models using R^{2} statistics.
Comput Stat Data An 2008, 52:18961907. Publisher Full Text

Schelldorfer J, Bühlmann P, Van de Geer S: Estimation for highdimensional linear mixedeffects models using l1penalization.
Scand J Stat 2011, 38:197214. Publisher Full Text

Taylor JD, Verbyla AP, Cavanagh C, Newberry M: Variable selection in linear mixed models using an extended class of penalties.
Aust N Z J Stat 2012, 54:427449. Publisher Full Text

Hajjem A, Bellavance F, Larocque D: Mixedeffects random forest for clustered data.
J Stat Comput Simul 2012.
[http://dx.doi.org/10.1080/00949655.2012.741599 webcite]

Van der Borght K, Verheyen A, Feyaerts M, Van Wesenbeeck L, Verlinden Y, Van Craenenbroeck E, van Vlijmen H: Quantitative prediction of integrase inhibitor resistance from genotype through consensus linear regression modeling.
Virol J 2013, 10:8. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Verbeke G, Molenberghs G: Linear Mixed Models for Longitudinal Data. New York: Springer; 2000.

Zuur AF, Ieno EN, Walker N, Saveliev AA, Smith GM: Mixed Effects Models and Extensions in Ecology with R. New York: Springer; 2009.

Kutner MH, Nachtsheim CJ, Neter J, Li W: Applied Linear Statistical Models. New York: McGrawHill; 2004.

Tibshirani R: Regression Shrinkage and Selection via the Lasso.

FDA: Isentress (raltegravir) drug label.
2009.
[http://www.accessdata.fda.gov/drugsatfda_docs/label/2009/022145s004lbl.pdf webcite]

Affenzeller M, Winkler S, Wagner S, Beham A: Genetic Algorithms and Genetic Programming – Modern Concepts and Practical Applications. Boca Raton: CRC Press; 2009.

Butz MV: RuleBased Evolutionary Online Learning Systems – A Principled Approach to LCS Analysis and Design. Berlin: Springer; 2006.

Hopgood AA: Intelligent Systems for Engineers and Scientists. Boca Raton: CRC Press; 2001.

Sivanandam SN, Deepa SN: Introduction to Genetic Algorithms. Heidelberg: Springer; 2008.

Michalewicz Z: Genetic Algorithms + Data Structures = Evolution Programs. New York: Springer; 1996.

Edwards LJ, Muller KE, Wolfinger RD, Quaqish BF, Schabenberger O: An R^{2} statistic for fixed effects in the linear mixed model.
Stat Med 2008, 27:61376157. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kramer M: R^{2 }statistics for mixed models. In Proceedings of the 17th annual Kansas State University Conference on Applied Statistics in Agriculture: 2527 April 2005. Manhattan, Kansas. Kansas State University: ; 2005:148160.

Nakagawa S, Schielzeth H: A general and simple method for obtaining R^{2} from generalized linear mixedeffects models.
Methods Ecol Evol 2013, 4:133142. Publisher Full Text

Flack VF, Chang PC: Frequency of selecting noise variables in subset regression analysis: a simulation study.

Lukacs PM, Burnham KP, Anderson DR: Model selection bias and Freedman’s paradox.
Ann Inst Stat Math 2010, 62:117125. Publisher Full Text

Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent.
J Stat Softw 2010, 33:1.
R package version 1.93. [http://CRAN.Rproject.org/package=glmnet webcite]
PubMed Abstract  PubMed Central Full Text 
Belloni A, Chernozhukov V: Least squares after model selection in highdimensional sparse models.
Bernoulli 2013, 19:521547. Publisher Full Text

Trevino V, Falciani F: GALGO: an R package for multivariate variable selection using genetic algorithms.
Bioinformatics 2006, 22:11541156.
R package version 1.0.11. [http://biptemp.bham.ac.uk/vivo/galgo/AppNotesPaper.htm webcite]
PubMed Abstract  Publisher Full Text 
2013. [gaoptim: Genetic Algorithm optimization for realvalued problems]
R package version 1.0. [http://CRAN.Rproject.org/package=gaoptim webcite]

Miller BL, Goldberg DE: Genetic algorithms, tournament selection, and the effects of noise.

Krishnamoorthy K, Mathew T: Statistical Tolerance Regions: Theory, Applications, and Computation. Hoboken, NJ: John Wiley & Sons; 2009.

Young DS: An R package for estimating tolerance intervals.
J Stat Softw 2010, 36:5.
R package version 0.5.2. [http://CRAN.Rproject.org/package=tolerance webcite]