Abstract
Background
A central goal of genomics is to predict phenotypic variation from genetic variation. Fitting predictive models to genomewide and whole genome single nucleotide polymorphism (SNP) profiles allows us to estimate the predictive power of the SNPs and potentially develop diagnostic models for disease. However, many current datasets cannot be analysed with standard tools due to their large size.
Results
We introduce SparSNP, a tool for fitting lasso linear models for massive SNP datasets quickly and with very low memory requirements. In analysis on a large celiac disease case/control dataset, we show that SparSNP runs substantially faster than four other stateoftheart tools for fitting large scale penalised models. SparSNP was one of only two tools that could successfully fit models to the entire celiac disease dataset, and it did so with superior performance. Compared with the other tools, the models generated by SparSNP had better than or equal to predictive performance in crossvalidation.
Conclusions
Genomic datasets are rapidly increasing in size, rendering existing approaches to model fitting impractical due to their prohibitive time or memory requirements. This study shows that SparSNP is an essential addition to the genomic analysis toolkit.
SparSNP is available at http://www.genomics.csse.unimelb.edu.au/SparSNP webcite
Background
One of the challenges raised by recent advances in the genomics of complex phenotypes is the prediction of phenotype given genotype, such as prediction of disease from SNP data. Successful identification of SNPs strongly predictive of disease promises a better understanding of the biological mechanisms underlying the disease, and has the potential to lead to early disease diagnosis and preventative strategies. The question of predictive ability is also closely related to the proportion of phenotypic and genetic variance that can be explained by common SNPs and the lively debate surrounding the “missing heritability” of many complex diseases [1]. To quantify the genetic effect, we must fit a statistical model to all SNPs simultaneously. Lassopenalised models [2] are well suited to this task, since they perform variable selection — some model weights are exactly zero and thus excluded from the model. In this way, lasso models remove the need for screening SNPs based on univariable statistics prior to fitting a multivariable model [3].
However, fitting models to genomewide or wholegenome data is challenging since such studies typically assay thousands to tens of thousands of samples and hundreds of thousands to millions of SNPs. With standard analysis tools, modelling genomewide and whole genome data is either impossible or extremely inefficient. For example, most existing analysis tools require loading the entire dataset into memory prior to fitting the models, which is both timeconsuming and requires large amounts of memory to store the data and fit the models. In order to perform simultaneous modelling of SNP variation across the genome and build predictive models of disease and phenotype, it is clear that there is a need for new tools that are fast, not memory intensive, and easy to use.
Here, we present the tool SparSNP, which is an efficient implementation of lassopenalised linear models. SparSNP can fit lasso models to largescale genomic datasets in minutes using small amounts of memory, outperforming equivalent inmemory methods. Thus, SparSNP makes it practical to analyse massive datasets without the use of specialised computing hardware or cloud computing. SparSNP produces crossvalidated model weights that can be used to select the top predictive SNPs. SparSNP also allows the resulting models to be evaluated for predictive power and phenotypic/genetic variance explained. The main features of SparSNP are:
· implementation of _{ℓ1}penalised linear regression for continuous traits and _{ℓ1}penalised classification for binary traits;
· speed — SparSNP fits models to data with 10^{4} samples and 5×1^{05}SNPs in <10 minutes on standard hardware;
· small (and tunable) amounts of memory are required: ∼1GiB for the datasets analysed here;
· compatibility with PLINK [4] BED (SNPmajor ordering) and FAM files (single phenotype);
· crossvalidation is performed natively, removing the need to manually split datasets;
· convenient external validation if an independent dataset is available;
· conveniently produces a set of models with increasing numbers of SNPs in each model, allowing for model selection based on crossvalidated predictive performance;
· calculates the area under receiveroperatingcharacteristic curves (AUC) and explained phenotypic or genetic variance, in crossvalidation or on validation datasets.
An outline of the SparSNP analysis pipeline is shown in Figure 1. See Additional File 1 for details of the analysis workflow.
Figure 1. SparSNP analysis pipeline. An example pipeline for analysing a SNP discovery dataset with SparSNP and testing the model on a validation dataset. Most of the data preparation and processing can be done with PLINK
Additional file 1. An example of a SparSNP workflow, covering basic quality control, training the model on discovery data, applying the model to validation data, plotting the results, and postprocessing.
Format: PDF Size: 207KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 2. Details of the implementation of SparSNP and other supplementary results.
Format: PDF Size: 232KB Download file
This file can be viewed with: Adobe Acrobat Reader
Results
To assess the performance of SparSNP and compare it with existing methods, we used a celiac disease case/control dataset [5], consisting of N = 11,940 samples from five European populations (Italian, Finnish, two British, and Dutch), with p = 516,504 autosomal SNPs. The data processing and quality control have been described in the original publication.
We performed two types of comparisons, one for computing speed and one for predictive performance. For the speed comparison, we timed the process of fitting the model to data subsets of different sizes. For the predictive comparison, we used crossvalidation for one population (Finns) over a grid of hyperparameters. We compared the following methods:
· SparSNP 0.89^{a}, with _{ℓ1}penalised squared hinge loss, implemented in C(see Additional Files 3 and 4);
Additional file 3. SparSNP v0.89, statically linked version for 64bit Linux x8664.
Format: BZ2 Size: 2.9MB Download file
Additional file 4. SparSNP v0.89, version for 64bit Mac OS X.
Format: BZ2 Size: 304KB Download file
· glmnet 1.7 [6]^{b}, with logistic loss (binomial family), implemented as a Fortran library for [7]. glmnet implements two variants of cyclical coordinate descent, together with warm restarts and active set convergence;
· LIBLINEAR 1.8 [8]^{c}, with _{ℓ1}penalised squared hinge loss (model 5), implemented in C++. LIBLINEAR uses coordinate descent to optimise the loss function;
· LIBLINEARCDBLOCK [9]^{d} with block _{ℓ2}regularised squared hinge loss (model 1), implemented in C++. LIBLINEARCDBLOCK splits the data into blocks of user defined size, loads each one into memory, and performs a blockwise optimisation of the loss function;
· and HyperLasso [10]^{e}, logistic regression with the double exponential (DE) prior (equivalent to lasso), implemented in C++. HyperLasso implements cyclical coordinate descent as well.
All five methods assumed a model that is additive in the minor allele dosage {0, 1, 2}.
Table 1 summarises the relative strengths and weaknesses of the five methods evaluated here, with respect to memory requirements, speed, best AUC in prediction, whether the tool successfully fitted a model to the largest dataset, the number of genetic models available, and ease of use. We now present these results in detail.
Table 1. Comparison of the evaluated methods
Figure 2. Timing experiments. Time (in seconds) for model fitting, over subsamples of the celiac disease dataset, taken as the minimum time over 10 independent runs. The inset panel shows the results for 50,000 SNPs in more detail, note the different scales. For inmemory methods we included the time to read the data into memory. For SparSNP and glmnet we used a penalty grid of size 20, and a maximum model size of 2048 SNPs. LIBLINEAR (denoted “LLL1”) and LIBLINEARCDBLOCK (denoted “LLCDL2”) induced one model with C = 1. LIBLINEARCDBLOCK used m = 50 blocks. For some datasets, glmnet and LIBLINEAR did not complete and these running times are not shown. HyperLasso is not shown since it took much longer to complete than the other methods
Figure 3. Prediction experiments. LOESSsmoothed AUC and explained phenotypic variance (denoted “VarExp”), for the Finnish celiac disease dataset, for increasing model sizes. AUC is estimated over 20×3fold crossvalidation, except for HyperLasso for which we ran only 2×3fold crossvalidation due to the high computational cost. The explained phenotypic variance is estimated from the AUC using the method of [11], assuming a population prevalence of celiac disease K=1%. Note that glmnet, HyperLasso, LIBLINEAR (denoted “LLL1”), and SparSNP used an _{ℓ1}penalised model, whereas LIBLINEARCDBLOCK (denoted “LLCDL2”) used an _{ℓ2}penalised model (non sparse), inducing a model using all 516,504 SNPs, therefore it is shown as a horizontal line across all model sizes. Note that tuning the _{ℓ2}penalty for LIBLINEARCDBLOCK resulted in very similar AUC
SparSNP makes possible rapid, lowmemory analysis of massive SNP datasets
SparSNP consistently outperformed the other methods when fitting models (Figure 2). We ran all methods on random subsets of the celiac disease dataset, consisting of randomly selected subsets of the data with p = {50,000, 250,000, 500,000} SNPs and N = {1000, 5000, 10,000} samples, a total of nine subsets. This process was independently repeated 10 times. Only SparSNP and LIBLINEARCDBLOCK could fit models to datasets with >5000 samples and 250,000 SNPs, and they were the only tools that could fit models to >1000 samples and 500,000 SNPs on a machine with 32GiB RAM. It is important to note that the aforementioned data sizes would be considered quite small by current standards. Also note that in contrast with SparSNP, LIBLINEARCDBLOCK does not implement an _{ℓ1}penalised model but a standard _{ℓ2}penalised support vector machine (SVM), which is not a sparse model, and does not produce solutions over a grid of model sizes; instead, a computationally expensive scheme such as recursive feature elimination (RFE) [12] would be required in order to find sparse models, but we did not use RFE here. Of the remaining methods, LIBLINEAR and glmnet did not complete all experiments due to running out of memory (on a 32GiB RAM machine) or due to the data exceeding the limit on matrix sizes in (a maximum of ^{231}−1 elements). HyperLasso took much longer to complete: ∼ 2 hours for the 1000 sample/500,000 SNP subset and ∼ 69 hours for the 10,000 sample/500,000 SNP subset. Therefore, the timing for HyperLasso is not shown.
We emphasise that these results are for one run over the data — in practice, crossvalidation is used to guide model selection and evaluate the generalisation error of a model. Run times for crossvalidation would be higher yet — 3fold crossvalidation repeated 10 times would take approximately 20 times longer, ∼ 22 and ∼ 4 hours for LIBLINEARCDBLOCK and SparSNP, respectively, over the largest subset (with SparSNP also outperforming LIBLINEARCDBLOCK in terms of prediction as we show in the next section) — making the differences in speed even more important. Also note the difference in the number of models fitted: both SparSNP and glmnet use a warm restart strategy, computing a separate model for each penalty in a grid (we used 20 penalties), resulting in a path of 20 separate models with different sizes, whereas LIBLINEAR, LIBLINEARCDBLOCK, and HyperLasso computed only one model based on one penalty.
SparSNP produces models of better or comparable predictive ability
We used the Finnish subset of the celiac disease dataset (N = 2476 samples, p = 516,504 SNPs) to evaluate predictive performance of the models in 3fold crossvalidation. We measured predictive ability with the area under the receiver operating characteristic curve (AUC) [13], where AUC ranges from 0 (perfectly wrong prediction) to 1 (perfect prediction), with AUC = 0.5 being equivalent to random prediction (no predictive power). AUC also has the probabilistic interpretation as the probability of correctly ranking the risk of the cases higher than the risk for the controls, for a randomly selected pair of cases and controls. From the AUC we also estimated the explained proportion of phenotypic variance [11], assuming a population prevalence for celiac disease of K=1%. We did not evaluate the predictive ability over the entire celiac dataset, as it consists of several populations of different ethnic background, and case/control status may be confounded by effects such as population stratification.
SparSNP induced models with AUC of up to 0.9 and explained phenotypic variance of up to ∼40%(Figure 3), almost identical to that of glmnet, except for small differences at the extremes of the λpath; the differences may be due to the fact that SparSNP and glmnet use different loss functions and have different parameters such as convergence tolerances. LIBLINEAR showed maximum AUC similar to glmnet and SparSNP, but much lower AUC for smaller number of SNPs in the model. LIBLINEARCDBLOCK showed consistently lower AUC over the range of costs used: a grid of 30 costs C∈[1^{0−4},1^{03}]. Varying the costs did not substantially change the AUC. Since LIBLINEARCDBLOCK used an _{ℓ2}SVM, which does not induce sparse models and does not natively produce a range of model sizes, we show results for a model with all 516,504 SNPs, averaged over all penalties.
Due to the high computational cost of running HyperLasso, we were not able to run as comprehensive a grid search; therefore, we performed only two replications of 3fold crossvalidation, using the DE prior with parameter λ={2,4,8,10,12,14,16,18,20} over 10 iterations (10 posterior modes), and averaged the AUC over the modes.
Importantly, while SparSNP achieved AUC better than or comparable to the other approaches, the resources consumed were far from being equal — SparSNP performed 3fold crossvalidation using a total of about 1 GiB of RAM, whereas LIBLINEAR required about 24GiB, and glmnet used up to 27GiB (the total number of samples used in the crossvalidation training phase is ∼ 1650, or 2/3 of the total Finnish subset). Both LIBLINEARCDBLOCK and HyperLasso used low amounts of memory: LIBLINEARCDBLOCK used about 210MiB of RAM (using 50 diskbased blocks), and HyperLasso used a maximum of only 2GiB (roughly the size of the training data), however, it was by far the slowest.
We also evaluated how the SNPs found by each method agreed with each other, when tuned to select 128 SNPs on the entire Finnish dataset (Figure 1 in Additional file 2), Overall, the sparse methods had high correlations of their SNP weights (≥0.86), with lower correlations for LIBLINEARCDBLOCK (0.21–0.23). SparSNP and glmnet shared a high proportion of the 128 nonzero weight SNPs (0.71), whereas LIBLINEARL1 and HyperLasso shared lower proportions of their selected SNPs (0.23–0.43), and LIBLINEARCDBLOCK shared the lowest proportions with the others (0.09–0.12). The differences in the selected SNPs are likely due to differences in the loss functions, differences in the penalisation (_{ℓ1} versus _{ℓ2}), the numerical properties of each optimisation method, and high LD, as the lasso tends to select one SNP of out a group of highly correlated SNPs.
Discussion
Many genetic datasets have assayed thousands of samples over hundreds of thousands of SNPs. Currently, sample collections are expanding in multiple ways: by increasing sample numbers, by imputing millions of SNPs with finescale reference panels, and by performing wholegenome sequencing. There is thus a pressing need for analytical tools which are capable of handling such massive amounts of data. Here, we have presented SparSNP, a tool to rapidly perform phenotype prediction and variance estimation from massive amounts of SNP data.
The main bottleneck in the analysis is the large amounts of RAM required to fit models, which may not be feasible or accessible to many users. SparSNP incorporates multiple computational strategies to minimise the amount of RAM required. Even when such memory is available, the time taken to read the data from disk becomes the bottleneck, rather than the fitting process itself. Thus, the time taken to analyse the data may be long enough to preclude a comprehensive analysis of the data, such as multiple rounds of crossvalidation or experimenting with various model parameters. In contrast, SparSNP makes it possible to rapidly analyse such datasets — 10 replications of 3fold crossvalidation of a 10,000sample/500,000 SNP dataset can be performed in about 2 hours, requiring only ∼ 1GiB RAM. This time can be further reduced by running multiple instances in parallel on a compute cluster. While the celiac disease dataset analysed here is quite large, recent genomewide studies are larger still, involving 1–6 million SNPs, either by direct assay or by imputation from HapMap [14,15] or 1000Genomes [16]. The number of samples in current datasets is larger as well, and likely to continue growing into the hundreds of thousands. For such studies, fitting multivariable models using current methods is not feasible with standard tools. SparSNP is scalable in terms of memory requirements, and yet is faster than comparable approaches, making it suitable for analysing such datasets. Further, SparSNP achieved similar or better predictive ability than other approaches. Our future work will include adding the ability to include external variables such as age and sex and other clinical phenotypes in the model, having multiple genetic models (additive, dominant/recessive, heterozygous), the ability to model multiple phenotypes, and potentially implementing different loss functions such as Cox survival models.
Implementation
SparSNP implements an efficient outofcore version of the cyclical coordinate descent method [6,17] for minimising _{ℓ1}penalised loss functions. Here, we briefly discuss the main steps in th fitting process. See Additional File 2 for the details of the computational procedure.
_{ℓ1}penalised loss functions
The problem of fitting linear models can be cast as minimising a convex loss function L. The squared loss function over N samples in p variables is used for linear regression and is defined as
where are the inputs for the ith sample, is the Nvector of outputs, is the intercept, is a pvector of model weights. and λ≥ is the userchosen penalty. Another loss function useful in classification is squaredhinge loss, which is equivalent to a leastsquares support vector machine with a linear kernel [18], and defined as
where _{yi}∈{−1, + 1}. SparSNP uses the squared hinge loss as the classification model for case/control data.
Outofcore coordinate descent
_{ℓ1}regression is a convex optimisation problem. However, in general, it has no analytical solutions, and must be solved numerically. We use a variant of coordinate descent to numerically minimise the loss function.
In coordinate descent [6,17,19], each variable is optimised with respect to the loss function using a univariable Newton step, while holding the other variables fixed. Since the updates are univariable, computation of the first and second derivatives is fast and simple (we assume that all our loss functions are twicedifferentiable). The _{ℓ1}penalisation is achieved using soft thresholding[17] of each estimated weight
where is the Newton step with respect to _{βj} and S(·,·) is the soft thresholding operator
Model selection
The λpenalty tunes the model complexity, and can be selected in several ways. The simplest way is to leave it fixed at some arbitrary value, however, this may result in suboptimal performance if the number of selected variables is too small or too large. A second way is to prespecify the number of nonzero SNPs required, and then perform binary search for the λpenalty that produces the required number of SNPs [3]; SparSNP does not support this option. The third and recommended way is to use crossvalidation, as implemented in SparSNP, and choose a model or set of models that maximise the crossvalidated AUC. These models can then be tested on an independent validation dataset to get unbiased estimates of AUC. We have found crossvalidation to work well for genetic datasets of moderate to large size such as those used here, such that the training and test subsets are large enough and the case/control phenotypes are roughly balanced in the data. For highly imbalanced classes, other schemes such as stratified crossvalidation may be required; these are currently not implemented in SparSNP.
Experimental setup
We employed two experimental setups, one for timing comparisons and another for comparisons of predictive ability.
Timing experiments
For input, SparSNP used the PLINK BED/FAM files, glmnet read unpacked binary versions of the BED files into (one genotype per byte), LIBLINEAR read text files in LIBSVM sparse format, and HyperLasso read text files in its requisite text format. The time taken to convert the PLINK BED files into the appropriate formats was considerable, but not included in the timings reported here. For SparSNP, the timings include the time to scale the genotypes to zeromean/unitvariance, fit the model, and write the model weights to disk. For glmnet, LIBLINEAR, and HyperLasso, the timings include the time to read the genotypes from disk, fit the model, and write the model to disk where appropriate. For LIBLINEARCDBLOCK, timing included the time to split the data into blocks and fit the model to the data. Since HyperLasso was considerably slower than the other methods, we only ran two simulations with it. The largest models allowed was 2048 nonzero variables (excluding the intercept). For LIBLINEAR, we used one default cost C = 1. LIBLINEARCDBLOCK used m = 50 blocks with warm restarts. For HyperLasso, we used the double exponential (DE) with hyperparameter λ=1, for one iteration.
We ran all timing experiments on a 2.6Ghz dualCPU dualcore AMD Opteron 2218 machine with 32GiB RAM, running 64bit Ubuntu Linux 8.04.4 with local disk drives.
Prediction experiments
For SparSNP and glmnet we used a grid of 20 decreasing lasso penalties λ, such that decreasing penalties induce models with more nonzero variables up to the maximum allowed number of 1024. For LIBLINEAR we used a grid of 30 costs C∈[1^{0−4},1^{03}], since the penalty is expressed as the cost C=1/λ. Inputs for SparSNP and glmnet were scaled to zero mean and unit variance. For LIBLINEAR and LIBLINEARCDBLOCK, inputs were scaled to the range [1, +1] using from LIBSVM.
We ran all prediction experiments on an Intel Xeon X5550 2.67Ghz machine with 48GiB RAM, running Red Hat Enterprise Linux Server 5.6, with networked disk drives.
Data preparation
As with any association method, genotypephenotype associations that are in the data but not truly of biological origin, such as batch effects, may confound the analysis, potentially resulting in the detection of spurious associations and inflation of the apparent predictive ability. The use of an independent validation dataset is highly recommended.
Missing data
For convenience, SparSNP implements simple random imputation for missing genotypes, where missing genotypes are randomly replaced with a genotype {0, 1, 2} (with probability 1/3 each). When the proportion of missingness is small and the genotypes are missing at random (for example, no differential missingness between cases and control), such a simple approach does not substantially affect the predictive ability and does not introduce significant spurious associations. However, when missingness is high or differentiated between cases and controls, spurious associations can arise and we recommend either using PLINK to filter SNPs and samples with high missingness, or alternatively, imputing the missing data using a more sophisticated method such as Beagle [20], IMPUTE [21], or MACH [22].
Confounding effects
SparSNP does not account for possible batch effects, which must be accounted for at the quality control stage. Nor does SparSNP currently account for confounders such as population stratification, admixing, or cryptic relatedness; EIGENSTRAT [23] and PLINK can be used to detect these and to filter the data accordingly.
Genetic Models
SparSNP implements models additive in the minor allele dosage {0, 1, 2}. Other models, such as dominant/recessive models or interaction models are currently not supported.
Applying models to new data
SparSNP produces text files containing the model weights for each SNP, and can be used in prediction mode to read these weights, together with another BED file, to produce predictions for other datasets. Model weights are with respect to the minor allele dosage for the training data, and the reference allele may be different in another dataset, possibly resulting in reversal in the sign of the SNP effect. In addition, both the discovery and validation datasets must contain the same SNPs in the same ordering (marker names are not important). We recommend using PLINK to ensure that both the discovery and validation datasets contain the same SNPs and are encoded using the same reference alleles.
Availability and requirements
Project name: SparSNP Project home page:http://www.genomics.csse.unimelb.edu.au/SparSNP webciteOperating system(s): 64bit Linux and Mac OS X Programming language: NA Other requirements: Bash, License: binaries only, redistribution is allowed,see website. Any restrictions to use by nonacademics:no restrictions
Endnotes
^{a}http://www.genomics.csse.unimelb.edu.au/SparSNP webcite^{b}http://cran.rproject.org/web/packages/glmnet webcite^{c}http://www.csie.ntu.edu.tw/∼cjlin/liblinear webcite^{d}http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/cdblock webcite^{e}http://www.ebi.ac.uk/projects/BARGEN/download/HyperLasso webcite
Abbreviations
SNP, Single nucleotide polymorphism; GiB, Gibibyte (230 bytes); AUC, Area under receiver operating characteristic curve; I/O, Input/output; RAM, Random access memory.
Competing Interests
The authors declare that they have no competing interests.
Author’s contributions
GA, AK, JZ, and MI conceived and designed the experiments. GA implemented the software and performed the experiments. GA, AK, JZ, and MI analysed the data. All authors read and approved the manuscript.
Acknowledgements
Thanks to David van Heel (QMUL) for supplying the celiac disease data, and to the Victorian Life Sciences Computing Initiative (VLSCI) for providing computing facilities under project VR0126. Funding: MI was supported by an NHMRC Postdoctoral Fellowship (no. 637400). This work was supported by the Australian Research Council, and by the NICTA Victorian Research Laboratory. NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications, and the Digital Economy, and the Australian Research Council through the ICT Centre of Excellence program. This work was made possible through Victorian State Government Operational Infrastructure Support and Australian Government NHMRC IRIIS.
References

Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM: Finding the missing heritability of complex diseases.

Tibshirani R: Regression Shrinkage and Selection via the Lasso.

Wu TT, Chen YF, Hastie T, Sobel E, Lange K: Genomewide association analysis by lasso penalized logistic regression.

Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for wholegenome association and populationbased linkage analyses.

Dubois PCA, Trynka G, Franke L, Hunt KA, Romanos J, Curtotti A, Zhernakova A, Heap GAR, Ádány R, Aromaa A, Bardella MT, van den Berg LH, Bockett NA, de la Concha EG, Dema B, Fehrmann RSN, FernándezArquero M, Fiatal S, Grandone E, Green PM, Groen HJM, Gwilliam R, Houwen RHJ, Hunt SE, Kaukinen K, Kelleher D, KorponaySzabo I, Kurppa K, Macmathuna P, Mäki M, Mazzilli MC, Mccann OT, Mearin ML, Mein CA, Mirza MM, Mistry V, Mora B, Morley KI, Mulder CJ, Murray JA, Núñez C, Oosterom E, Ophoff RA, Polanco I, Peltonen L, Platteel M, Rybak A, Salomaa V, Schweizer JJ, Sperandeo MP, Tack GJ, Turner G, Veldink JH, Verbeek WHM, Weersma RK, Wolters VM, Urcelay E, Cukrowska B, Greco L, Neuhausen SL, McManus R, Barisani D, Deloukas P, Barrett JC, Saavalainen P, Wijmenga C, van Heel DA: Multiple common variants for celiac disease influencing immune gene expression.

Friedman J, Hastie T, Tibshirani R: Regularization Paths for Generalized Linear Models via Coordinate Descent.

R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2011.

Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ: LIBLINEAR: A Library for Large Linear Classification.

Yu HF, Hsieh CJ, Chang KW, Lin CJ: Large linear classification when data cannot fit in memory. In 16th ACM KDD. , ; 2010.

Hoggart CJ, Whittaker JC, Iorio MD, Balding DJ: Simultaneous analysis of all SNPs in genomewide and resequencing association studies.
PLoS Genet 2008, 4::e1000130. Publisher Full Text

Wray NR, Yang J, Goddard ME, Visscher PM: The Genetic Interpretation of Area under the ROC Curve in Genomic Profiling.
PLoS Genet 2010, 6::e1000864. Publisher Full Text

Guyon I, Weston J, Barnhill S, Vapnik V: Gene Selection for Cancer Classification using Support Vector Machines.

Hanley JA, McNeil BJ: The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve.

International HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs.

International HapMap 3 Consortium: Integrating common and rare genetic variation in diverse human populations.

1000 Genomes Project Consortium: A map of human genome variation from populationscale sequencing.

Friedman J, Hastie T, Höfling H, Tibshirani R: Pathwise coordinate optimization.

Chang KW, Hsieh CJ, Lin CJ: Coordinate Descent Method for Largescale L2loss Linear Support Vector Machines.

Van der Kooij AJ: Prediction Accuracy and Stability of Regression with Optimal Scaling Transformations.
PhD thesis. Faculty of Social and Behavioural Sciences, Leiden University 2007.
[http://openaccess.leidenuniv.nl/dspace/handle/1887/12096 webcite]

Browning SR, Browning BL: Rapid and accurate haplotype phasing and missingdata inference for wholegenome association studies by use of localized haplotype clustering.

Howie BN, Donnelly P, Marchini J: A Flexible and Accurate Genotype Imputation Method for the Next Generation of GenomeWide Association Studies.
PLoS Genet 2009, 5::e1000529. Publisher Full Text

Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR: MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes.

Price AL, Patterson NJ, Plenge RM, Weinblatt ME, et al.: Principal components analysis corrects for stratification in genomewide association studies.