Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, Indiana 46202, USA

Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana 46202, USA

Regenstrief Institute, Indianapolis, Indiana 46202, USA

Division of Clinical Pharmacology, Department of Medicine, Indiana University School of Medicine, Indianapolis, Indiana 46202, USA

Abstract

Background

Genome-wide association studies (GWAS) have identified hundreds of genetic variants associated with complex human diseases, clinical conditions and traits. Genetic mapping of expression quantitative trait loci (eQTLs) is providing us with novel functional effects of thousands of single nucleotide polymorphisms (SNPs). In a classical quantitative trail loci (QTL) mapping problem multiple tests are done to assess whether one trait is associated with a number of loci. In contrast to QTL studies, thousands of traits are measured alongwith thousands of gene expressions in an eQTL study. For such a study, a huge number of tests have to be performed (

Results

The results show that PEB has an edge over NPEB. The proposed methodology has been applied to human liver cohort (LHC) data. Our method enables to discover more significant SNPs with FDR<10% compared to the previous study done by Yang et al. (

Conclusions

In contrast to previously available methods based on p-values, the empirical Bayes method uses local false discovery rate (lfdr) as the threshold. This method controls false positive rate.

Introduction

Genome-wide association studies (GWASs) have done a remarkable progress in searching for susceptibility genes. In GWAS, instead of one gene at a time, variation across the entire genome is tested for association with disease risk. GWASs exploit the linkage disequilibrium (LD) relationships among single nucleotide polymorphisms (SNPs), making it possible to assay genome by testing a finite number of SNPs. Till date, the signals that can be discovered through GWAS has not been reported exhaustively. It is important to annotate SNPs information on expression for the better understanding of the genes and mechanisms driving the association. In many situations, there are more common variants truly associated with disease. These variants are highly likely to be expression quantitative trait loci (eQTLs). eQTLs are derived from polymorphisms in the genome that result in differential measurable transcript levels. Microarrays are used to measure gene expression levels across genetic mapping populations. For at least a subset of complex disorders, gene expression levels could be used as a surrogate/biomarker for classical phenotypes. The gene underlying the eQTL is considered to be an excellent candidate for phenotypic QTL.

eQTL mapping is a statistical technique to locate genomic intervals, that are likely to regulate the expression of each transcript, by correlating quantitative measurements of mRNA expression with genetic polymorphisms segregating in a population. In a GWAS, millions of SNPs are tested at once. Associations that initially appear to be significant must be statistically adjusted to account for the large number of tests being performed. A large number of false positives will result in if this correction is ignored. The multiple-testing correction, however, sets a very high threshold for genome-wide significance, on the order of

Two closely related inferential procedures for multiple testing have been discussed in this work-afrequentist approach based on Benjamini and Hochberg's (

One of the fundamental statistical problems in microarray gene expression analysis is the need to reduce dimensionality of the transcripts. This can be achieved by identifying differentially expressed (DE) genes under different conditions or groups. Regulatory network can be obtained by associating differential expressions with the genotype of molecular markers. It is possible to have a large number of DE genes that influences a certain phenotype while their relative proportion is very small. It is very important to identify these DE genes from among the number of recorded genes

The development of the empirical Bayes methodologies that improve the power to detect DE genes essentially reduces to the choice of whether gene-specific effects should be modeled as fixed or random

The paper is organized as follows. In the Methods section we introduce the necessary notations for our additive genetic model along with the notions of false discovery rate (fdr). In this section we have tried to establish the relationship between fdr and empirical Bayes. Methods section also describes, the proposed Expectation/Conditional Maximization Either (ECME) (Liu and Rubin

Methods

In a microarray experiment, we obtain several thousand expression values, one or many for each gene. These studies offer an unprecedented ability to do large-scale studies of gene expression. Let us define _{i}i _{j}

where ^{th }percentile of all

When expression measurements between two groups are compared for any transcript, the observations are partitioned into two user defined groups of sizes

where _{i }_{ij }

where _{ij }

For any transcript and any SNP there may be three possible relations: no association, positive association and negative association. Extending the idea of two component mixture model, the distribution of the test statistics is modeled by the following mixture model:

Where

with

Full Bayesian analysis of (4) will require prior specifications of

Empirical Bayes, false discovery rates (fdr) and local false discovery rate (lfdr)

False discovery rate (fdr) is defined as the expected proportion of errors committed by falsely rejecting null hypotheses. Benjamini and Hochberg's

The empirical Bayes approach suggests a local version of the fdr called local false discovery rate (

Analytically,

For the above set up in (3),

and hence

All other parameters will be estimated by EM algorithm assuming

Nonparametric empirical Bayes (NPEB)

The main difference between parametric empirical Bayes (PEB) and nonparametric empirical Bayes (NPEB) is the way in which

ECME algorithm

To fit a mixture model, EM algorithm is widely used. In case of

For the

where

and

McLachlan and Krishnan

then marginally,

Following the above definition, the complete data likelihood

where

and

E-Step

To compute the E-step of the proposed algorithm, at (t+1)th step we need to calculate

where

and

which is the posterior probability that

Similarly,

Where

CM-step

In usual M-step parameters

and

To get an efficient algorithm, let us partition

**CM-Step 1**. Keeping

**CM-Step 2**. Now fix

Furthermore to make the algorithm more efficient, after the first CM-step, we replace the E-step with

Simulation study

To assess the proposed methodology, a small sample simulation study has been performed. This gives an idea whether or not the parameters are well estimated and most importantly, they provide information of false discovery rates.

First we simulated a dominant model with 10,000 transcripts and 10 SNPs. The equivalently expressed (EE) transcripts are generated from N(0,1) after log-transformation. We have simulated the data under three choices of proportions of differentially expressed (DE) transcripts (

A part of the simulated data for

**A part of the simulated data for **

The impact of minor allele frequency (MAF) on the distributions under null has also been studied. Under null, for a t-distribution, the only parameter to be estimated is its degrees of freedom. The comparison has been made by computing different quantiles for six choices of MAFs. For the lower quantiles, they almost overlapped with each other. Very small deviations are observed for upper quantiles (Figure

Effect of minor allele frequency (MAF) on the null distribution

**Effect of minor allele frequency (MAF) on the null distribution**. Only upper quantiles (from 80%) have been considered as lower quantiles showing almost no difference.

For the 10 SNPs, we fitted the null distribution using permutation method in a balanced way. From each group, randomly selecterd 35 samples are shifted from one group to the other and the value of the statistic is noted. This process is repeated 40 times and histograms are plotted. From the histograms, the degrees of freedom corresponding to the null distribution for eack SNP is estimates. To get an idea about the goodness-of-fit, Q-Q plots are done (Figure

QQ-plot for eight SNPs

**QQ-plot for eight SNPs**.

Parameters related to the mixture model (4) are estimated using proposed ECME algorithm after estimating the null distribution using permutation method. Then FDR is computed under both proposed parametric empirical Bayes and nonparmetic empirical Bayes setup and the result is given in Table

The True FDR Performance of Controlled FDR in EB Models

**True fraction of DE**

**Controlled FDR**

**Nonparametric empirical Bayes**

**Parametric empirical Bayes**

0.01

0.05

0.10

0.01

0.05

0.10

0.01

0.004

0.029

0.067

0.005

0.042

0.090

0.05

0.006

0.041

0.079

0.006

0.045

0.094

0.10

0.007

0.043

0.087

0.008

0.047

0.097

It is evident from the above table that the nonparmateric empirical Bayes is much conservative compared to its parametric alternative. For parametric set up, the true FDR is very much close to the controlled one, whereas, for nonparametric empirical Bayes these values are not so close as the true fraction of DE transcripts increases.

HLC data analysis

We applied the empirical Bayes model to analyze a sequencing data publicly available. In the current study, we have started with liver tissue data of 213 Caucasian samples from apreviously described human liver cohort (LHC) (Yang et al. ^{-5},) we are left with 173 samples, 472,000 SNPs and 30,000 expressions.

The distribution of minor allele frequency (MAF) over SNPs is given in the histogram (Figure

Minor allele frequency (MAF) distribution

**Minor allele frequency (MAF) distribution**. X axis corresponds to minor allele frequency 25% to 50%.

Conclusion

To compare our result with

Number of eQTL pairs after crossing the threshold of FDR

**Gene symbol**

**No. of SNPs (FDR<10%)**

**No. of cis-SNP **

**No. of cis-eSNP (FDR<10%) by Yang et al. (2010)**

CYP3A5

263

62

56

CYP2D6

264

67

54

CYP4F12

392

55

46

CYP2E1

130

45

31

CYP2U1

549

45

26

CYP1B1

168

21

13

CYP2C18

90

13

9

CYP4F11

169

15

7

CYP4V2

159

25

3

CYP2F1

324

10

2

CYP39A1

448

17

2

CYP26C1

154

29

1

CYP2C19

356

7

1

CYP2C9

413

20

1

CYP2S1

319

10

1

CYP46A1

430

7

1

CYP4A11

461

4

1

CYP4X1

151

3

1

Discussion

In contrast to previously available methods based on p-values, the empirical Bayes method uses local false discovery rate (lfdr) as the threshold. This method controls false positive rate. For a particular SNP, the lfdr is computed for the site-specific evidence whereas the FDR averages over other sites with stronger evidence. There are some limitations of using FDR which may result in misleading inferences in genome studies. In such a situation, it is better to use lfdr which is a bit difficult to estimate compared to FDR.However there is still one computational problem which needs much attention. Due to the high dimensionality in the data, sometimes existing algorithms fail. This necessitates the need to find some more efficient algorithms. The choice of threshold FDR value is an important deciding factor in such studies. It would be interesting to see, how number of cis-SNPs vary with the change in FDR threshold. In this way FDR criterion can be used to estimate number of SNPs that we may need to consider.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

This work is supported by the U.S. National Institutes of Health grants R01 GM74217 (Lang Li) and AHRQ Grant R01HS019818-01 (MalazBoustani)

Declarations

The publication costs were funded by the authors through P50 CA113001 (Huang, T.M.), R01 GM088076 (Skaar, T.), R01 HS019818 (Dexter).

This article has been published as part of