Medical Genetics Institute, Cedars-Sinai Medical Center, 8635 West Third Street, Los Angeles, CA 90048, USA

Abstract

To analyze multiple single-nucleotide polymorphisms simultaneously when the number of markers is much larger than the number of studied individuals, as is the situation we have in genome-wide association studies (GWAS), we developed the iterative Bayesian variable selection method and successfully applied it to the simulated rheumatoid arthritis data provided by the Genetic Analysis Workshop 15 (GAW15). One drawback for applying our iterative Bayesian variable selection method is the relatively long running time required for evaluation of GWAS data. To improve computing speed, we recently developed a Bayesian classification with singular value decomposition (BCSVD) method. We have applied the BCSVD method here to the rheumatoid arthritis data distributed by GAW16 Problem 1 and demonstrated that the BCSVD method works well for analyzing GWAS data.

Background

Genome-wide association studies (GWAS) evaluate genetic variants throughout the entire genome with the goal of identifying susceptibility genes for diseases or conditions of interest. While a large number (

Methods

The BCSVD method

The BCSVD method used a binary probit model. Assuming that a latent variable is described by a linear regression model, the binary probit model can be expressed as:

where _{n × 1 }is a vector of latent variables, _{n × p }is the design matrix, _{p × 1 }is a vector of parameters to be estimated, and I_{n }is an

where L = FD and ^{th }SNP effect from the raw data and ^{th }SNP effect from the ^{th }shuffled. Let us define _{0}: _{i }= 0), the statistic Λ_{i }follows the standard normal distribution when k is large: _{i }(

Association analysis

The evaluation of the BCSVD method for association analysis was performed in two steps. As the first step, we performed a genome-wide single SNP association analysis using the logistic regression model option in PLINK. The PLINK analysis results served for two purposes, one was for the comparison with the results from BCSVD method, and the other was for the selection of genomic regions. Even though the BCSVD method can be applied to the whole genome-wide association data, the requirement on computer memory is still a limiting factor. We therefore focused on chromosome regions selected through PLINK analysis results in our BCSVD analysis.

Study sample

We used the whole-genome association data of the North American Rheumatoid Arthritis Consortium (NARAC) in GAW16 Problem 1. There were 2,062 subjects in the study, including 868 cases and 1,194 controls. Quality control on genotype data was performed with PLINK software. We eliminated 133,616 SNPs that failed the following quality control criteria: ^{-5 }for Hardy-Weinberg equilibrium (HWE) test, minor allele frequency <1%, or missing data >10%. As a result, 411,464 SNPs were included in the PLINK association analysis.

Imputation

For BCSVD analysis, we first selected chromosomes that have SNPs with ^{-7}. The best SNP on each chromosome that has the smallest ^{2}) < 0.3 were excluded.

BCSVD study sample

Analyzing all selected SNPs simultaneously for 2,062 samples requires tremendous computer memory that our current computers cannot yet handle. We therefore generated two data sets based on the imputed data: one had 1,000 subject (500 cases and 500 controls) randomly selected from 868 cases and 1,194 controls; the other had 200 subjects (100 cases and 100 controls) randomly selected from the above selected 1,000 subjects.

Results

Step 1. Single SNP association from GWAS

GWAS analysis results from PLINK were summarized in Figure ^{-7}. The best peak was observed on chromosome 6, followed by chromosomes 1, 17, 5, 20, 9, 18, 4, and 10.

GWAS analysis results of RA data from PLINK

**GWAS analysis results of RA data from PLINK**.

Step 2. Evaluating multiple SNPs simultaneously with the BCSVD method

Nine chromosomes (1, 4, 5, 6, 9, 10, 17, 18, 20) were identified that had SNPs with ^{-7}, we used 8 (all except chromosome 4) in the BCSVD analysis due to time limitation and extensive time required for imputation. A total of 18,728 SNPs, with 2037, 1957, 4804, 1940, 1396, 1581, 2258, and 2755 for chromosome 1, 5, 6, 9, 10, 17, 18 and 20, respectively, were evaluated simultaneously in BCSVD analysis for datasets with 200 and the 1,000 samples. The association results were summarized in Figure _{10}(

Association analysis results from BCSVD method

**Association analysis results from BCSVD method**. a, BCSVD association analysis results for 1,000 subjects. _{10}(_{10}(

Conclusion

The BCSVD method was applied to RA case-control data from Problem 1 of GAW16 for 8 selected regions. When we evaluated the association between RA affection status and all SNPs in selected regions simultaneously using BCSVD, significant associations were detected for all the 8 chromosomal regions, and the highest peak was observed on chromosome 6, which were consistent with the PLINK results. Even though the magnitude of significance [-log_{10}(

List of abbreviations used

BCSVD: Bayesian classification with singular value decomposition; GAW: Genetic Analysis Workshop; GWAS: Genome-wide association studies; HWE: Hardy-Weinberg equilibrium; IBVS: Iterative Bayesian variable selection; NARAC: North American Rheumatoid Arthritis Consortium; RA: Rheumatoid arthritis; SNP: Single-nucleotide polymorphism; SVD: singular value decomposition

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

SK participated in the design and analysis and drafted the manuscript. JC participated in data cleaning and GWAS analysis. SR helped with data analysis and manuscript writing. DT helped to manage the data and analysis. JIR helped to draft the manuscript. XG participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This study was supported partially by grants DK046763, GM008243, HL088457, and the Cedars-Sinai Board of Governors Chair in Medical Genetics.

This article has been published as part of