Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig, Härtelstr. 16–18, D-04107 Leipzig, Germany

Faculdade de Economia e Gestão & CEGE, Catholic University of Portugal, Rua Diogo Botelho 1327, 4169-005 Porto, Portugal

Abstract

Background

Identification of causal SNPs in most genome wide association studies relies on approaches that consider each SNP individually. However, there is a strong correlation structure among SNPs that needs to be taken into account. Hence, increasingly modern computationally expensive regression methods are employed for SNP selection that consider all markers simultaneously and thus incorporate dependencies among SNPs.

Results

We develop a novel multivariate algorithm for large scale SNP selection using CAR score regression, a promising new approach for prioritizing biomarkers. Specifically, we propose a computationally efficient procedure for shrinkage estimation of CAR scores from high-dimensional data. Subsequently, we conduct a comprehensive comparison study including five advanced regression approaches (boosting, lasso, NEG, MCP, and CAR score) and a univariate approach (marginal correlation) to determine the effectiveness in finding true causal SNPs.

Conclusions

Simultaneous SNP selection is a challenging task. We demonstrate that our CAR score-based algorithm consistently outperforms all competing approaches, both uni- and multivariate, in terms of correctly recovered causal SNPs and SNP ranking. An R package implementing the approach as well as R code to reproduce the complete study presented here is available from

Background

Genome-wide associations studies (GWAS) are now routinely conducted to search for genetic factors indicative of or even causally linked to disease. Typically, the aim of such a study is to identify a small subset of single nucleotide polymorphisms (SNPs) associated with a phenotype of interest. From an analysis point of view the screening for relevant biomarkers is best cast as a problem of statistical variable selection. In GWAS variable selection is very challenging as the full set of SNPs is often very large while both the effect of each potentially causal SNP as well as their number is very small (e.g.

To date, most GWAS are based on single-SNP analyzes where each SNP is considered independently of all others and association with the phenotype is computed using a univariate test statistic such as variants of the

In order to increase statistical efficiency and to exploit the correlation among predictive SNPs several authors have recently started to investigate simultaneous SNP selection using fully multivariate approaches. This was pioneered for GWAS in the seminal paper of

Recently, to address the problem of variable importance and selection under correlation in genomics, we have introduced two novel statistics, the correlation-adjusted

Here, we develop a novel multivariate algorithm for large scale SNP selection using CAR score regression. Specifically, we propose a computationally efficient procedure that allows for shrinkage estimation of CAR scores even for very high-dimensional data sets. Subsequently, we conduct a systematic comparison of state-of-the-art simultaneous SNP selection procedures using data from the GAW17 consortium

Methods

Univariate ranking of SNPs

The basic setup we consider here is a linear regression model for a set of **
X
**= {

If there is no correlation among SNPs (i.e. **
P
**=

CAT and CAR score

In many important settings the correlations **
P
**do not vanish but rather represent additional structure relating the predictors. In the case of SNPs the correlation may be rather large, e.g. due to linkage effects

To this end we have proposed a simple modification of the

where **
P
**

Correspondingly, in

The squared CAR scores sum up to the squared multiple correlation coefficient

also known as coefficient of determination or proportion of variance explained. Because of this decomposition property CAT and CAR scores allow to assign importance not only to individual SNPs but also to groups of SNPs. Moreover, both CAT and CAR score share a grouping property that leads to similar scores for highly correlated SNPs. In addition they protect against antagonistic SNPs, i.e. if two SNPs are highly correlated and one has a protective and the other a risk effect, then both SNPs are assigned low scores.

For model selection using CAT and CAR scores, i.e. for identification of those SNPs that do not contribute to predict the response

In previous work we have shown for synthetic data as well as for data from metabolomic and gene expression experiments that CAT and CAR scores are effective multivariate criteria for obtaining compact yet highly predictive feature sets. Independently, in the study of

However, with increasing dimension **
P
** becomes prohibitively large both to compute and to handle effectively. As a result, in high dimensions direct calculation of CAT and CAR scores using Eq. 1 and Eq. 2 is not possible. Thus, for application in high-dimensional data such as from GWAS an alternative means of computation must be developed.

Computationally efficient calculation of shrinkage estimators of CAT and CAR scores

If the number of observations **
P
**. A simple shrinkage estimator

where **
R
**

Using singular value decomposition the empirical correlation matrix can be written **
R
**

Following
**
R
**using

This implies we only have to compute the matrix power of the **
I
**

Consequently, Eq. 3 allows to obtain shrinkage estimates of CAT and CAR scores effectively even in high dimensions as none of the matrices employed in Eq. 3 is larger than **
R
**.

Results and discussion

We now compare the proposed CAR score approach to simultaneous SNP selection with competing methods and determine its effectiveness in finding true causal SNPs.

For this purpose we use the mini-exome data set compiled for the GAW17 workshop held 13-16 October 2010 in Boston (

In order to facilitate replication of our results we provide complete R code

GAW 17 unrelated data

The compilation and simulation of phenotypes for the GAW17 mini-exome data set is described in detail in

Preprocessing

In the preprocessing of the sequences we first recoded the alleles in the raw data into 0, 1, 2 assuming an additive effects model. Second, we standardized the data matrix to column mean zero and column variance 1. Subsequently, we removed duplicate predictors so that 15,076 unique SNPs remained. The set of true causal SNPs for both Q1 and Q2 also contains each a duplicate, reducing the number of true unique SNPs to 38 and 71. Finally, we further filtered out synonymous SNPs, as we are interested only in non-synonymous mutations. The resulting predictor matrix **
X
** is of size 697 × 8,020, i.e.

For preprocessing the response variables Q1, Q2, and Q4 we removed the influence of the three non-genetic covariates sex, age, and smoking by linear regression. The resulting residuals were standardized to mean zero and variance 1 which yielded

SNP selection methods included in the comparison study

For each of the

• CAR: variable ranking by shrinkage CAR scores

• NEG: regression with normal exponential gamma (NEG) prior

• MCP: regression with MCP penalty

• BOOST: boosting

• LASSO: lasso regression

The corresponding software implementations are listed in Table

• COR: univariate SNP ranking by marginal correlation, and

**Method**

**Software**

**Reference**

The R packages are available from the R software archive CRAN at

CAR

R package

COR

R package

NEG

MCP

R package

BOOST

R package

LASSO

R package

RND: random ordering of all SNPs.

All methods except CAR and COR combine regularization with variable selection. Thus, for determining model sizes for CAR scores and COR we adaptively estimated a threshold from the data using a local FDR cutoff of 0.5 as recommended in

Generally, all software were run with default settings. The regularization parameters required by the NEG, MCP, BOOST and CAR approaches were set to fixed values optimizing the overall performance of each method. Specifically, for CAR and MCP we employed

Relative performance of investigated methods

The aim of this study is to compare simultaneous SNP selection methods with regard to their ability to discover the true known SNPs. For this purpose we investigated the respective SNP rankings and the corresponding true positives, the size of the selected models, and the variability across the 200 repetitions.

In Figure

Average true positives resulting from SNP rankings of the investigated approaches for phenotype Q1 (top row) and Q2 (bottom row)

**Average true positives resulting from SNP rankings of the investigated approaches for phenotype Q1 (top row) and Q2 (bottom row).** For Q1 there are 38 true SNPs and for Q2 71 true SNPs.

**Results**

**Comparisons**

**Method**

**Model size**

**TP**

**TP**

**TP**

**TP**

**Median (IQR)**

**Method**

**CAR**

**COR**

**RND**

For comparison, the last three columns show the average true positives at the specified model size for CAR, COR and RND. The best performing method is shown in bold, the second best in italic.

Q1

CAR

51 (53)

**5.85**

**5.85**

0.23

COR

176 (108)

**8.99**

0.88

NEG

1390 (118)

**17.57**

14.38

6.60

MCP

20 (5)

**4.19**

3.95

0.12

BOOST

53 (5)

**5.91**

5.50

0.25

LASSO

37 (31)

**5.21**

4.89

0.18

Q2

CAR

31 (38)

**2.93**

**2.93**

0.29

COR

1 (7)

**0.38**

**0.38**

0.00

NEG

1632 (755)

20.21

**28.08**

14.50

MCP

29 (5)

2.75

**2.82**

0.28

BOOST

59 (6)

**4.34**

3.82

0.59

LASSO

15 (36)

1.50

**1.97**

0.14

In Table

Q4

Model Size

CAR

COR

NEG

MCP

BOOST

LASSO

Median

34

0

1900

27

59

1

IQR

40

1

2713

4

6

6

In further investigation of these results we identified the actual true SNPs recovered by each SNP selection approach. Specifically, we counted which of the 38 respectively 71 true causal SNPs for Q1 and Q2 were found among the first 100 top ranking SNPs using the 200 repetitions available for each phenotype. The result is shown as a heatmap in Figure

Frequency of occurrence of each true SNP among the top 100 SNPs selected by each approach for phenotype Q1 (top row) and for Q2 (lower row) for the 200 repetitions

**Frequency of occurrence of each true SNP among the top 100 SNPs selected by each approach for phenotype Q1 (top row) and for Q2 (lower row) for the 200 repetitions.** Note that the SNPs are ordered according to the first column.

In Table

**SNP**

**Frequency**

**MAF**

**BETA**

**Correlation**

The last column shows the average absolute correlation among all SNPs for Q1 and Q2 as well as the average absolute correlation for the SNPs belonging to one gene.

Q1

0.014

ARNT | C1S6533

88

0.011478

0.56190

FLT1 | C13S431

110

0.017217

0.74136

0.147

FLT1 | C13S522

200

0.027977

0.61830

0.147

FLT1 | C13S523

200

0.066714

0.64997

0.147

FLT1 | C13S524

164

0.004304

0.62223

0.147

KDR | C4S1877

145

0.000717

1.07706

0.111

KDR | C4S1878

101

0.164993

0.13573

0.111

KDR | C4S1884

95

0.020803

0.29558

0.111

VEGFA | C6S2981

69

0.002152

1.20645

VEGFC | C4S4935

91

0.000717

1.35726

Q2

0.008

BCHE | C3S4869

54

0.000717

1.01569

0.001

BCHE | C3S4875

59

0.000717

1.09484

0.001

LPL | C8S442

69

0.015782

0.49459

SIRT1 | C10S3048

54

0.002152

0.83224

0.330

SIRT1 | C10S3050

72

0.002152

0.97060

0.330

VNN1 | C6S5380

138

0.170732

0.24437

VNN3 | C6S5441

59

0.098278

0.27053

0.066

VNN3 | C6S5449

57

0.010043

0.66909

0.066

The last column in Table

Finally, in Table

Q1

Proportion (%)

CAR

COR

NEG

MCP

BOOST

LASSO

Common

0.56

0.71

0.63

0.74

0.71

0.73

Rare

0.44

0.29

0.37

0.26

0.29

0.27

Q2

Proportion (%)

CAR

COR

NEG

MCP

BOOST

LASSO

Common

0.28

0.41

0.36

0.44

0.42

0.43

Rare

0.72

0.59

0.64

0.56

0.58

0.57

Conclusions

Large scale simultaneous SNP selection is a statistically and computationally very challenging task. To this end, we have introduced here a novel algorithm based on CAR score regression that can be applied effectively in high dimensions. Subsequently, in a comparison study we have investigated five multivariate regression-based SNP selection approaches with regard to their ability to correctly recover causal SNPs and corresponding SNP rankings.

As overall best method we recommend using CAR scores since this method was the only approach not only consistently outperforming the competing other multivariate SNP selection procedures in terms of identified true positives but also the only approach uniformly improving over simple univariate ranking by marginal correlation. In addition we have shown that CAR scores also are successful in detecting rare variants which recently have been recognize to be important indicators for human disease

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

VZ, PDS, and KS jointly developed the algorithm. VZ performed the analyzes. VZ and KS wrote the manuscript. All authors read and approved the manuscript.

Acknowledgements

We thank Peter Ahnert, Arndt Groß, Holger Kirsten, Abdul Nachtigaller, and Markus Scholz for helpful discussion. Part of this research was supported by BMBF grant no. 0315452A (HaematoSys project). The Genetic Analysis Workshop is supported by NIH R01 GM031575. Preparation of the Genetic Analysis Workshop 17 simulated exome data set was supported in part by NIH R01 MH059490 and used sequencing data from the 1000 Genomes Project (