Institute of Information Science, Academia Sinica, Taipei 115, Taiwan

Department of Computer Science, National Taiwan University, Taipei 106, Taiwan

Department of Psychiatry, Washington University, St. Louis, Missouri, USA

Information Science Institute, University of Southern California, Marina del Rey, California, USA

Department of Genetics, Rutgers University, Piscataway, New Jersey, USA

Abstract

Background

Decades of genome-wide association studies (GWAS) have accumulated large volumes of genomic data that can potentially be reused to increase statistical power of new studies, but different genotyping platforms with different marker sets have been used as biotechnology has evolved, preventing pooling and comparability of old and new data. For example, to pool together data collected by 550K chips with newer data collected by 900K chips, we will need to impute missing loci. Many imputation algorithms have been developed, but the posteriori probabilities estimated by those algorithms are not a reliable measure the quality of the imputation. Recently, many studies have used an imputation quality score (IQS) to measure the quality of imputation. The IQS requires to know true alleles to estimate. Only when the population and the imputation loci are identical can we reuse the estimated IQS when the true alleles are unknown.

Methods

Here, we present a regression model to estimate IQS that learns from imputation of loci with known alleles. We designed a small set of features, such as minor allele frequencies, distance to the nearest known cross-over hotspot,

Results

We construct a

Conclusion

Reliable estimation of IQS will facilitate integration and reuse of existing genomic data for meta-analysis and secondary analysis. Experiments show that it is possible to use a small number of features to regress the IQS by learning from different training examples of imputation and IQS pairs.

Background

In the past decade, the data sets collected for genome wide association studies (GWAS) have grown geometrically. Reusing these valuable data in new studies is difficult because they are collected through different study designs and on different platforms. Various imputation algorithms (

The IQS for each imputed SNP is computed by two scores, the proportion of observed agreement (_{o}_{c}_{ij}_{o }_{o}_{c }_{i.}_{.i}_{.. }are defined in Table _{o }_{c }

Marginal cross classification of the genotypes used for the computation of IQS

**True genotypes**

**Imputed Genotypes**

**
AA
**

**
AB
**

**
BB
**

**
Total
**

_{11}

_{12}

n_{13}

_{1.}

_{21}

_{22}

_{23}

_{2.}

_{31}

_{32}

_{33}

_{3.}

_{.1}

_{.2}

_{3}

Assessment of _{o}_{c}

However, exhausting all populations and combinations of imputation loci to establish such a database of all useful IQS may take considerable resources. Here, we try to develop a computational method to estimate IQS without known true genotypes. We assess whether or not it is possible to build a regression model from imputations of SNP sites with known alleles, and then use the regression model to estimate IQS for SNPs with unknown alleles. The idea is to use additional statistical information to build a regression model to predict the IQS. Also, in practice, people work with specific sets of variants and this method will facilitate creation of a database of the IQS of those variants.

Methods and materials

**-Support vector regression**

In a multi-dimensional regression problem, we have a data set of _{i }^{d}_{i }_{i }_{i }_{i}_{i }_{i}_{i}^{d }^{d }^{d }_{i}_{j}_{i}_{j}

Many models and algorithms have been developed to search for the parameters

The parameter _{i }

The

Moreover, the parameter

We chose LibSVM _{o }_{c }

Features generation

Other regression models can also be used but the key to the success is to identify a set of variables that influence the imputation quality as the input features _{i }

1. Chromosome position: The chromosome where the SNP located.

2. Physical position: The position of the imputed SNP in bp.

3. Minor allele frequency (MAF): Previously,

4. B allele frequency: This is derived from the allele signal intensity measurement for each locus of each individual in the raw CEL files. The raw CEL files are available from the Hapmap samples

5. MAF in the reference panel: In addition to using the available MAF provided by the annotation file, we also consider the MAF in the reference panel.

6. Ratio of genotypes AA/AB: It is used to to indicate the proportion of genotype AA for each imputed SNP in the reference panel.

7. Ratio of genotypes BB/AB: Similar to feature 6, it is used to to indicate the proportion of genotype BB for each imputed SNP in the reference panel.

8. Distance to the nearest genotyped SNP: This is to capture an indication that the imputation quality will be better if the nearest genotyped SNP in the inference panel is closer.

9. Distance to the nearest recombination hotspot: The distance to the nearest recombination hotspot also plays an important role in the quality of the imputation. We used the recombination rates and hotspots available in the release version phase II build b35 to GRCh37 from the International HapMap Project

10. The nearest recombination hotspot's recombination rate (cM/Mb, centiMorgans per megabase): This variable is important in the imputation process. The IMPUTE2 program uses it explicitly as a required input for the imputation

11. Posterior probability estimated by the imputation program: This variable is available from the output of the imputation program. The Beagle program provides the genotype probabilities file and the genotype dosage file. We used the mean values of the posterior probabilities estimated for all the individuals in the inference panel.

12. B-allele dosage: Given the posterior genotype probabilities for a SNP (Pr(

It is worth mentioning that the posterior probability estimated by the imputation program and the B-allele dosage are highly correlated to predicting the IQS under the statistical correlation analysis. These features will be used in the regression model for the IQS as well as the regression for the observed agreement _{o }_{c}

Data preparation

We prepared three data sets to evaluate the performance of our regression models. These data sets contain genotyping results of samples chosen to cover different ethnic backgrounds collected in different disease studies. We selected recent data sets genotyped with advanced platforms that cover a large number of SNPs so that we can flexibly keep those SNPs covered by old, obsolete platforms (with less SNPs probed) and hold out the rest to impute. Meanwhile, since we have their true genotypes, we can use the true genotypes of these SNPs as the gold standard to evaluate imputation quality and regression.

The Merlion Lung Cancer Study 2 DNA

Regression performance evaluation

We designed scenarios to simulate the imputation of missing SNPs in a data set genotyped using an old platform to the large set of SNPs on the Affymetrix SNP 6.0 array. These scenarios involve a

To create both training and test sets, we basically divided the SNPs on the Affymetrix SNP 6.0 array into two sets. One contains those SNPs genotyped in both an old platform and Affymetrix SNP 6.0 array. This set simulates SNPs with "known" genotypes to be used to impute other SNPs. The other contains the remaining SNPs covered only by the Affymetrix SNP 6.0 array. This set simulates "missing" SNPs to be imputed.

Table

Summary of training set composition for different evaluation scenarios

**Scenarios**

**Ethnic population**

**Samples**

**from Platform**

**to Platform**

Scenario 1

Western European

Lung cancer

from Affymetrix 500k

to Affymetrix 500k

Scenario 2

Western European

Lung cancer

from Illumina 550k

to Illumina 550k

Scenario 3

East Asian

Lung cancer

from Affymetrix 500k

to Affymetrix 500k

Scenario 4

East Asian

Lung cancer

from Affymetrix 500k

to Affymetrix 500k

Summary of test set composition for different evaluation scenarios

**Scenarios**

**Ethnic population**

**Samples**

**from Platform**

**to Platform**

Scenario 1

Western European

Lung cancer

from Affymetrix 500k

**to Affymetrix SNP 6.0**

Scenario 2

Western European

Lung cancer

**from Affymetrix 500k**

**to Affymetrix SNP 6.0**

Scenario 3

**Western European**

Lung cancer

from Affymetrix 500k

**to Affymetrix SNP 6.0**

Scenario 4

East Asian

**Oral cancer**

from Affymetrix 500k

**to Affymetrix SNP 6.0**

High-lighted fields are the settings that are different from the training set used in the corresponding scenarios.

In Scenario 2, the generalization performance of our IQS regression model was evaluated when it was trained using "known" and "missing" SNPs covered by platforms different from those to be used in testing. We used the WE lung cancer sample again but used the Illumina 550k array instead of the Affymetrix SNP 6.0 array to choose SNPs. There are 41,304 SNPs of the WE lung cancer sample on the Illumina 550k array. After the regression model is constructed, we then used the same test set created in Scenario 1.

In Scenario 3, our IQS regression model is applied to different ethnic populations. We used the EA lung cancer sample to create the training set, resulting in 37,611 SNPs of the EA lung cancer sample on the Affymetrix 500k array. The regression model constructed by the EA lung cancer samples was used to predict the IQS of SNPs of the WE lung cancer samples as in Scenario 1.

Scenario 4 tests if our regression model can be generalized across samples collected for different disease studies. We used the same training set as in the scenarios above and used the EA Oral Squamous Cell Carcinoma sample as the test set. This test set also simulates imputation from the Affymetrix mapping 500k array to the Affymetrix SNP 6.0 array and consists of 320,172 SNPs.

For all scenarios, we chose the imputation program Beagle. Beagle is based on the Hidden Markov Model (HMM)

The 1000 Genomes Project samples (August 2010 release) served as the reference panel. As the larger reference panel has developed, researchers have become more confident to combine two studies or extend a specific study on different platforms

Results and discussion

Table

Summary of the IQS regression results for each scenario

**IQS regression results**

**Scenario**

**Mean Squared Error**

**Correlation Coefficient**

Scenario 1

0.0182

0.740

Scenario 2

0.0174

0.748

Scenario 3

0.0178

0.736

Scenario 4

0.0197

0.751

IQS regression results, (A) Scenario 1, evaluating the regression result on the same platform

**IQS regression results, **(A) Scenario 1, evaluating the regression result on the same platform. (B) Scenario 2, evaluating the regression result on different platforms. (C) Scenario 3, evaluating the regression result on the different ethnic population. (D) Scenario 4, under the same ethnic population, evaluating the regression result on the different disease samples.

The best performance was accomplished in Scenario 2, where the regression model was trained with a set of SNPs derived from different platforms from the test, suggesting that training with a wider variety of SNPs might allow the model to generalize better. The worst performance was from Scenario 4, where samples from studies of different diseases were tested. Nevertheless, the performance difference was not significant.

Tables _{o }_{c}_{c }_{c }_{o }_{c }

Summary of the _{o }

_{o }

**Scenario**

**Mean Squared Error**

**Correlation Coefficient**

Scenario 1

0.00248

0.840

Scenario 2

0.00249

0.838

Scenario 3

0.00256

0.835

Scenario 4

0.00301

0.831

Summary of the _{c }

_{c }

**Scenario**

**Mean Squared Error**

**Correlation Coefficient**

Scenario 1

0.00062

0.990

Scenario 2

0.00072

0.988

Scenario 3

0.00071

0.989

Scenario 4

0.00099

0.984

_{o }

**P**_{o}**regression ****results**, (A) Scenario 1, evaluating the regression result on the same platform. (B) Scenario 2, evaluating the regression result on different platform. (C) Scenario 3, evaluating the regression result on the different ethnic population. (D) Scenario 4, under the same ethnic population, evaluating the regression result on the different disease samples.

_{c }

**P**_{c}**regression ****results**, (A) Scenario 1, evaluating the regression result on the same platform. (B) Scenario 2, evaluating the regression result on different platform. (C) Scenario 3, evaluating the regression result on the different ethnic population. (D) Scenario 4, under the same ethnic population, evaluating the regression result on the different disease samples.

We also performed a test to evaluate whether we can use the regression results to filter out false positives in a GWAS. Previously, _{o}

ROC curve at the threshold = 0.5, (A) Scenario 1, AUC(Predicted IQS):0.9617, AUC(True Imputation Accuracy):0.9718, and AUC(Predicted Imputation Accuracy):0.9354 (B) Scenario 2, AUC(Predicted IQS):0.9739, AUC(True Imputation Accuracy):0.9783, and AUC(Predicted Imputation Accuracy):0.9539 (C) Scenario 3, evaluating the regression result on the different ethnic population, AUC(Predicted IQS):0.9642, AUC(True Imputation Accuracy):0.9677, and AUC(Predicted Imputation Accuracy):0.9072 (D) Scenario 4, AUC(Predicted IQS):0.9656, AUC(True Imputation Accuracy):0.9758, and AUC(Predicted Imputation Accuracy):0.9223

ROC curve at the threshold = 0.5, (A) Scenario 1, AUC(Predicted IQS):0.9617, AUC(True Imputation Accuracy):0.9718, and AUC(Predicted Imputation Accuracy):0.9354 (B) Scenario 2, AUC(Predicted IQS):0.9739, AUC(True Imputation Accuracy):0.9783, and AUC(Predicted Imputation Accuracy):0.9539 (C) Scenario 3, evaluating the regression result on the different ethnic population, AUC(Predicted IQS):0.9642, AUC(True Imputation Accuracy):0.9677, and AUC(Predicted Imputation Accuracy):0.9072 (D) Scenario 4, AUC(Predicted IQS):0.9656, AUC(True Imputation Accuracy):0.9758, and AUC(Predicted Imputation Accuracy):0.9223

ROC curve at the threshold = 0.9, (A) Scenario 1, AUC(Predicted IQS):0.8269, AUC(True Imputation Accuracy):0.9883, and AUC(Predicted Imputation Accuracy):0.8041 (B) Scenario 2, AUC(Predicted IQS):0.8082, AUC(True Imputation Accuracy):0.9848, and AUC(Predicted Imputation Accuracy):0.8030 (C) Scenario 3, AUC(Predicted IQS):0.8230, AUC(True Imputation Accuracy):0.9892, and AUC(Predicted Imputation Accuracy):0.7890 (D) Scenario 4, AUC(Predicted IQS):0.8620, AUC(True Imputation Accuracy):0.9967, and AUC(Predicted Imputation Accuracy):0.8399

ROC curve at the threshold = 0.9, (A) Scenario 1, AUC(Predicted IQS):0.8269, AUC(True Imputation Accuracy):0.9883, and AUC(Predicted Imputation Accuracy):0.8041 (B) Scenario 2, AUC(Predicted IQS):0.8082, AUC(True Imputation Accuracy):0.9848, and AUC(Predicted Imputation Accuracy):0.8030 (C) Scenario 3, AUC(Predicted IQS):0.8230, AUC(True Imputation Accuracy):0.9892, and AUC(Predicted Imputation Accuracy):0.7890 (D) Scenario 4, AUC(Predicted IQS):0.8620, AUC(True Imputation Accuracy):0.9967, and AUC(Predicted Imputation Accuracy):0.8399

Conclusion

We propose a

Our future work includes an effort to extend the feature set to improve the regression performance for predicting _{o }

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

YHH and CNH developed methods and designed the experiments. YHH and CNH drafted the manuscript. JPR, SFS, JLA, YA and JAT participated in the design of the study. JLA and JAT helped to revise the manuscript. CNH was responsible for all aspects of the project.

Acknowledgements

This work was supported in part by NIMH/NIH Grant Number MH068457 (CGSMD).

This article has been published as part of