Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

This article is part of the supplement: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM 2013): Genomics

Open Access Research

Simultaneous inferences based on empirical Bayes methods and false discovery rates ineQTL data analysis

Arindom Chakraborty12, Guanglong Jiang12, Malaz Boustani3, Yunlong Liu12, Todd Skaar4 and Lang Li124*

Author Affiliations

1 Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, Indiana 46202, USA

2 Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana 46202, USA

3 Regenstrief Institute, Indianapolis, Indiana 46202, USA

4 Division of Clinical Pharmacology, Department of Medicine, Indiana University School of Medicine, Indianapolis, Indiana 46202, USA

For all author emails, please log on.

BMC Genomics 2013, 14(Suppl 8):S8  doi:10.1186/1471-2164-14-S8-S8

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2164/14/S8/S8


Published:9 December 2013

© 2013 Chakraborty et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Abstract

Background

Genome-wide association studies (GWAS) have identified hundreds of genetic variants associated with complex human diseases, clinical conditions and traits. Genetic mapping of expression quantitative trait loci (eQTLs) is providing us with novel functional effects of thousands of single nucleotide polymorphisms (SNPs). In a classical quantitative trail loci (QTL) mapping problem multiple tests are done to assess whether one trait is associated with a number of loci. In contrast to QTL studies, thousands of traits are measured alongwith thousands of gene expressions in an eQTL study. For such a study, a huge number of tests have to be performed (<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M1">View MathML</a>). This extreme multiplicity gives rise to many computational and statistical problems. In this paper we have tried to address these issues using two closely related inferential approaches: an empirical Bayes method that bears the Bayesian flavor without having much a priori knowledge and the frequentist method of false discovery rates. A three-component t-mixture model has been used for the parametric empirical Bayes (PEB) method. Inferences have been obtained using Expectation/Conditional Maximization Either (ECME) algorithm. A simulation study has also been performed and has been compared with a nonparametric empirical Bayes (NPEB) alternative.

Results

The results show that PEB has an edge over NPEB. The proposed methodology has been applied to human liver cohort (LHC) data. Our method enables to discover more significant SNPs with FDR<10% compared to the previous study done by Yang et al. (Genome Research, 2010).

Conclusions

In contrast to previously available methods based on p-values, the empirical Bayes method uses local false discovery rate (lfdr) as the threshold. This method controls false positive rate.

Introduction

Genome-wide association studies (GWASs) have done a remarkable progress in searching for susceptibility genes. In GWAS, instead of one gene at a time, variation across the entire genome is tested for association with disease risk. GWASs exploit the linkage disequilibrium (LD) relationships among single nucleotide polymorphisms (SNPs), making it possible to assay genome by testing a finite number of SNPs. Till date, the signals that can be discovered through GWAS has not been reported exhaustively. It is important to annotate SNPs information on expression for the better understanding of the genes and mechanisms driving the association. In many situations, there are more common variants truly associated with disease. These variants are highly likely to be expression quantitative trait loci (eQTLs). eQTLs are derived from polymorphisms in the genome that result in differential measurable transcript levels. Microarrays are used to measure gene expression levels across genetic mapping populations. For at least a subset of complex disorders, gene expression levels could be used as a surrogate/biomarker for classical phenotypes. The gene underlying the eQTL is considered to be an excellent candidate for phenotypic QTL.

eQTL mapping is a statistical technique to locate genomic intervals, that are likely to regulate the expression of each transcript, by correlating quantitative measurements of mRNA expression with genetic polymorphisms segregating in a population. In a GWAS, millions of SNPs are tested at once. Associations that initially appear to be significant must be statistically adjusted to account for the large number of tests being performed. A large number of false positives will result in if this correction is ignored. The multiple-testing correction, however, sets a very high threshold for genome-wide significance, on the order of <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M2">View MathML</a> when a million SNPs are tested. In the vast majority cases, however, association studies have achieved only limited success. Large sample sizes are needed to achieve sufficient statistical power to detect risk alleles with effects weak enough to have escaped detection in the past; the disease risk alleles identified by GWASs so far do have weak effects, each with odds ratios of 1.1 or 1.2 [1].

Two closely related inferential procedures for multiple testing have been discussed in this work-afrequentist approach based on Benjamini and Hochberg's ([2]) false discovery rate procedure, and an empirical Bayes methodology developed in Efron et al. [3,4]. These two methods are not only very closely related, they can be used to support each other. In a classic two-sample problem in a microarray experiment, these approaches have been discussed by Efron and Tibshirani[5]. However, they have considered nonparametric empirical Bayes (NPEB) model. Parametric Bayesian modeling has been considered by Newton et al. [6], Lee et al. [7], Kendziroski et al. [8-10], Gelfond et al. [11]. Hierarchical models like gamma-gamma [6] or lognormal-normal [8] are used quite often in PEB procedures. These models suffer from a serious drawback that the variation is constant among genes. An extension has been done to these models by considering gene specific variations[12]. The application of empirical Bayes has been somehow not very common in literature. The obvious reason is that, experimenters have not brought us many data sets having the parallel structure necessary for empirical Bayes to do its stuff. Because of the recent surge in high-throughput ([13]) technologies and genome projects, many genome studies are now underway. These studies have become a major data generator in the post-genomics era. Empirical Bayes procedures seem to be particularly well-suited for combining information in expression data.

One of the fundamental statistical problems in microarray gene expression analysis is the need to reduce dimensionality of the transcripts. This can be achieved by identifying differentially expressed (DE) genes under different conditions or groups. Regulatory network can be obtained by associating differential expressions with the genotype of molecular markers. It is possible to have a large number of DE genes that influences a certain phenotype while their relative proportion is very small. It is very important to identify these DE genes from among the number of recorded genes [6,7,9,14,15]. Empirical Bayes methods provide a natural approach to reduce the dimensionality significantly [16,17]. Following the empirical Bayes approach DE genes are identified using the posterior probability for differential expression. EB approaches detect a DE gene by sharing information across the whole genome.

The development of the empirical Bayes methodologies that improve the power to detect DE genes essentially reduces to the choice of whether gene-specific effects should be modeled as fixed or random [18]. Both mean and error variance can be of either of these two: fixed or random. Fixed mean and random error variance has been considered by Wright and Simon [19] and Cui et al. [20] whereas Lonnstedt et al. [21], Tai and Speed [22], Lonnstedt and Speed [23] have considered both the parameters to be random. Random mean effect with homogeneous fixed error variance has been considered by Newton et al. [6,24], Kendziroski et al. [9] and Kendziroski et al. [10]. However an extension to this fixed error variance has been considered by Gelfond et al. [11]. They have considered discrete uniform prior for the variance component.

The paper is organized as follows. In the Methods section we introduce the necessary notations for our additive genetic model along with the notions of false discovery rate (fdr). In this section we have tried to establish the relationship between fdr and empirical Bayes. Methods section also describes, the proposed Expectation/Conditional Maximization Either (ECME) (Liu and Rubin [25]) in details. This algorithm generalizes the Expectation-Maximization algorithm with better convergence rate. A simulation study has been performed and described in the Results section. We show that proposed parametric empirical Bayes performs better compared to nonparametric empirical Bayes in terms of controlled fdr. In the Results section, as an application, we have applied the proposed methodology to the Liver Cohort (LHC) dataset. We conclude the article the Discussion section.

Methods

In a microarray experiment, we obtain several thousand expression values, one or many for each gene. These studies offer an unprecedented ability to do large-scale studies of gene expression. Let us define Gii = 1.....l as the genomic marker(i.e. SNP), and Tj(j = 1......J) as the transcripts. The identified eQTLs refer to the significant Gs that are associated with Ts. These associations can be found using a test statistics based on all <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M3">View MathML</a> samples. The genetic model for this association can be one of the three models: dominant, recessive and additive. Under the dominance model, we can have two genotypes for each of the SNPs. However for an additive model, three genotype groups are available. A transcript <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M4">View MathML</a> is assumed to be associated with marker <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M5">View MathML</a> if the mean level of expression of transcript <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M4">View MathML</a> for one genotype group is different from that of the other genotype group corresponding to that marker. Let <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M6">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M7">View MathML</a> be the group means corresponding to the genotypes <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M5">View MathML</a>. To test the hypothesis <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M8">View MathML</a>, a few test statistics are proposed for microarray data analysis[26]. The present work is based on the statistic proposed by Efron et al. [4]. The test statistic is defined as

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M9">View MathML</a>

(1)

where <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M10">View MathML</a> is the usual standard deviations and <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M11">View MathML</a> is defined to minimize the difference in the coefficient of variation of <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M12">View MathML</a> within classes of genes with approximately equal variance. A drawback of calculating <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M11">View MathML</a> is the computational cost. Note that if <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M13">View MathML</a>, this reduces to usual t-statistic. Here <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M11">View MathML</a> is considered to be 90th percentile of all <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M10">View MathML</a> values (Efron el al. [4]).

When expression measurements between two groups are compared for any transcript, the observations are partitioned into two user defined groups of sizes <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M14">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M15">View MathML</a> with <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M16">View MathML</a>. If there is no significant difference between the group means, the transcript is assumed to be equivalently expressed (EE). On the contrary, if significant difference is observed, the transcript is termed as differentially expressed (DE). For any transcript <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M4">View MathML</a> and SNP <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M5">View MathML</a> it may be either of these two: DE or EE. This uncertainty can be modeled by a mixture of two distributions as follows:

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M17">View MathML</a>

(2)

where <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M18">View MathML</a> is the mxining proportion of EE transcripts and <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M19">View MathML</a> is the proportion of DE transcripts, <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M20">View MathML</a> is a vector parameters involved to characterize the distributions. Let Fi be the minor allele frequency of the ith SNP then we model the distribution of Zij as a mixture model of the form:

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M21">View MathML</a>

(3)

where <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M30">View MathML</a> denotes the distribution of Zij for nonzero associations between <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M5">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M4">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M29">View MathML</a> denotes the distribution of <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M12">View MathML</a> for the zero associations. <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M22">View MathML</a> isdefined as

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M23">View MathML</a>

For any transcript and any SNP there may be three possible relations: no association, positive association and negative association. Extending the idea of two component mixture model, the distribution of the test statistics is modeled by the following mixture model:

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M24">View MathML</a>

(4)

Where

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M25">View MathML</a>

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M26">View MathML</a>

with <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M27">View MathML</a>. Mixing proportions <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M28">View MathML</a> are nonnegative constantsand sum to one for fixed i. <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M29">View MathML</a> corresponds to distribution for no associationwhereas <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M30">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M31">View MathML</a> correspond to distributions related to positive and negativeassociation respectively. In a recent work, Noma and Matsui [27], have used semiparametric hierarchical mixture model where the distribution of mean expression level of a transcript is considered to be a three-component mixture distribution.

Full Bayesian analysis of (4) will require prior specifications of <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M32">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M33">View MathML</a>. However, one can use the massively parallel structure of microarray data to estimate an empirical Bayes estimate of the posterior probability. These huge data motivates to be quite empirical rather than specifying a-priori models in favor of data-based investigations [27].

Empirical Bayes, false discovery rates (fdr) and local false discovery rate (lfdr)

False discovery rate (fdr) is defined as the expected proportion of errors committed by falsely rejecting null hypotheses. Benjamini and Hochberg's [2]fdr criterion has very close relation with the empirical Bayes analysis. This relation improved the connection between Bayesian and frequentist testing theory. The close connection between fdr and the empirical Bayes methodology follows directly from Bayes theorem and this has been established by the "Equivalence theorem"[28]. Tail area rejection regions like <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M34">View MathML</a> are common in the frequentist framework. According to this theorem, if the tail area rejection region is taken to be as large as possible subject to the constraint that the estimated Bayes proportions of false discoveries is less than <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M35">View MathML</a>, then the frequentist expected proportion of false discoveries is also less than <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M35">View MathML</a>.

The empirical Bayes approach suggests a local version of the fdr called local false discovery rate (<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M36">View MathML</a>). The Bayes probability that a transcript <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M4">View MathML</a> for SNP <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M37">View MathML</a> is "EE" given the test statistic <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M38">View MathML</a>, is known as <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M39">View MathML</a> and it is defined as

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M40">View MathML</a>

Analytically, <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M41">View MathML</a> is a conditional expectation of <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M42">View MathML</a> defined as

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M43">View MathML</a>

For the above set up in (3), <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M44">View MathML</a> represents the local false discovery rate (lfdr) and fdr can be estimated:

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M45">View MathML</a>

(5)

and hence

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M46">View MathML</a>

(6)

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M47">View MathML</a> is estimated bypermutation method (Efron et al. [4]) and <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M48">View MathML</a> is estimated from the nonnegative constraint

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M49">View MathML</a>

All other parameters will be estimated by EM algorithm assuming <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M50">View MathML</a> to be known. There are some practical difficulties with the lfdr that relies on densities. The estimation of null becomes more problematic in the far tails. It is relatively easier to work with cumulative distribution function than work with densities. Identification of discoveries by lfdr may not be reproducible for a new data. Therefore, even in empirical Bayes framework, fdr should be preferred.

Nonparametric empirical Bayes (NPEB)

The main difference between parametric empirical Bayes (PEB) and nonparametric empirical Bayes (NPEB) is the way in which <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M51">View MathML</a> are treated. In PEB model, the functional form of <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M51">View MathML</a> are known, i.e., we have a parametric family of priors. In contrast, the NPEB does not assume the functional form to be known. Though NPEB methods are quite powerful, these are more suitable for large sample analyses. To compute the fdr under NPEB setup, we have followed the algorithm proposed by Efron et al. [4].

ECME algorithm

To fit a mixture model, EM algorithm is widely used. In case of t distribution the mean parameter <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M52">View MathML</a> and variance component <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M53">View MathML</a> can easily be estimated by EM algorithm assuming that degrees of freedom <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M54">View MathML</a> is known. However when <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M54">View MathML</a> is unknown EM still can be used as demonstrated by Lange, Little and Taylor [29]. But this method appears to be very slow (Liu and Rubin [30]) and an extension has been proposed by Meng and Rubin [31] as ECM algorithm. This is a generalization of EM algorithm where the E step remains the same butthe M step is replaced by CM (constrained or conditional maximization) step. ECM algorithm is basically a generalized EM (GEM) as shown by Meng and Rubin [31]. Incidentally, the rate of convergence, in terms of iterations, for this ECM algorithm is slower compared to EM. To overcome this computational problem, Liu and Rubin [30] propose an efficient algorithm ECME which is again an extension of ECM algorithm. Though this is not a GEM, it converges faster.

For the <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M55">View MathML</a> -th SNP, the complete data is defined as

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M56">View MathML</a>

where

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M57">View MathML</a>

and <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M58','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M58">View MathML</a> s are independently distributed gamma variables.

McLachlan and Krishnan [32] have already discussed the application of the EM algorithm for ML estimation in case of single component t distribution. In ECME algorithm, this result has been extended to cover the present set up of a 3-component mixture of t distribution. For the sake of brevity, in this section we omit the suffix ij for all the variables. To define t distribution with mean <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M52">View MathML</a>, variance <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M53">View MathML</a> and degrees of freedom <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M54">View MathML</a>, we proceed as follows:

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M59','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M59">View MathML</a>

then marginally, <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M60','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M60">View MathML</a>.

Following the above definition, the complete data likelihood <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M61','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M61">View MathML</a> can be factorized a product of three terms-marginal densities of <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M62','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M62">View MathML</a> s, the conditional densities of <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M63','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M63">View MathML</a>, and conditional densities of <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M64','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M64">View MathML</a>. In notation, the log-likelihood of the complete-data can be expressed as

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M65','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M65">View MathML</a>

(7)

where

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M66','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M66">View MathML</a>

(8)

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M67','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M67">View MathML</a>

(9)

and

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M68','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M68">View MathML</a>

(10)

E-Step

To compute the E-step of the proposed algorithm, at (t+1)th step we need to calculate <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M69','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M69">View MathML</a>, the current conditional expectation of the complete-data log likelihood function <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M70','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M70">View MathML</a>. From equation (4) to (7), we can write

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M71">View MathML</a>

(11)

where

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M72','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M72">View MathML</a>

(12)

and

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M73','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M73">View MathML</a>

(13)

which is the posterior probability that <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M74">View MathML</a> belongs to the k-th component of the mixture based on current fit <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M75','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M75">View MathML</a>.

Similarly,

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M76','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M76">View MathML</a>

(14)

Where

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M77','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M77">View MathML</a>

(15)

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M78','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M78">View MathML</a> is a digamma function and

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M79','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M79">View MathML</a>

(16)

CM-step

In usual M-step parameters <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M80','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M80">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M81','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M81">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M82">View MathML</a> can be estimated by considering equations (10) - (12) independently. The new updates for <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M83">View MathML</a> can be obtained as a closed form solution whereas for <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M81','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M81">View MathML</a> an iterative procedure may be used using the following equations:

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M84','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M84">View MathML</a>

(17)

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M85','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M85">View MathML</a>

(18)

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M86','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M86">View MathML</a>

(19)

and <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M87','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M87">View MathML</a> is the solution of the following equation

<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M88','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M88">View MathML</a>

(20)

To get an efficient algorithm, let us partition <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M89','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M89">View MathML</a> as <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M90','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M90">View MathML</a> where <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M91','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M91">View MathML</a> contains all the parameters except parameters corresponding to degree of freedom of t-distributions. The above M-step is replaced by two CM-steps, as follows.

CM-Step 1. Keeping <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M92','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M92">View MathML</a> fixed, i.e. <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M81','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M81">View MathML</a> is fixed at <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M93','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M93">View MathML</a>, maximize <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M94','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M94">View MathML</a> to get <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M95','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M95">View MathML</a>

CM-Step 2. Now fix <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M96','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M96">View MathML</a> at <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M97','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M97">View MathML</a> and calculate <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M98','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M98">View MathML</a> by maximizing <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M99','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M99">View MathML</a>

Furthermore to make the algorithm more efficient, after the first CM-step, we replace the E-step with <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M100','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M100">View MathML</a> instead of <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M101','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M101">View MathML</a>.

Simulation study

To assess the proposed methodology, a small sample simulation study has been performed. This gives an idea whether or not the parameters are well estimated and most importantly, they provide information of false discovery rates.

First we simulated a dominant model with 10,000 transcripts and 10 SNPs. The equivalently expressed (EE) transcripts are generated from N(0,1) after log-transformation. We have simulated the data under three choices of proportions of differentially expressed (DE) transcripts (<a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M102','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M102">View MathML</a>). We have taken <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M102','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M102">View MathML</a> to be (0.01, 0.05, 0.10). If the transcript is DE, it has to be generated from N(4,0.5) after log-transformation. The controlled fdr are also assumed to be (0.01, 0.05, 0.10) for these data sets. For <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M103','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M103">View MathML</a>, the simulated data is given in Figure 1.

thumbnailFigure 1. A part of the simulated data for <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M106','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M106">View MathML</a>.

The impact of minor allele frequency (MAF) on the distributions under null has also been studied. Under null, for a t-distribution, the only parameter to be estimated is its degrees of freedom. The comparison has been made by computing different quantiles for six choices of MAFs. For the lower quantiles, they almost overlapped with each other. Very small deviations are observed for upper quantiles (Figure 2).

thumbnailFigure 2. Effect of minor allele frequency (MAF) on the null distribution. Only upper quantiles (from 80%) have been considered as lower quantiles showing almost no difference.

For the 10 SNPs, we fitted the null distribution using permutation method in a balanced way. From each group, randomly selecterd 35 samples are shifted from one group to the other and the value of the statistic is noted. This process is repeated 40 times and histograms are plotted. From the histograms, the degrees of freedom corresponding to the null distribution for eack SNP is estimates. To get an idea about the goodness-of-fit, Q-Q plots are done (Figure 3). These plots show that the null distribution is well approximated by the standardized t-distribution with appropriate degrees of freedom.

thumbnailFigure 3. QQ-plot for eight SNPs.

Parameters related to the mixture model (4) are estimated using proposed ECME algorithm after estimating the null distribution using permutation method. Then FDR is computed under both proposed parametric empirical Bayes and nonparmetic empirical Bayes setup and the result is given in Table 1.

Table 1. The True FDR Performance of Controlled FDR in EB Models

It is evident from the above table that the nonparmateric empirical Bayes is much conservative compared to its parametric alternative. For parametric set up, the true FDR is very much close to the controlled one, whereas, for nonparametric empirical Bayes these values are not so close as the true fraction of DE transcripts increases.

HLC data analysis

We applied the empirical Bayes model to analyze a sequencing data publicly available. In the current study, we have started with liver tissue data of 213 Caucasian samples from apreviously described human liver cohort (LHC) (Yang et al. [33]). To get the genotypes and gene expression profiles, DNA and RNA have been isolated. Illumina platform is used to get the expressions. After putting some filtration (MAF>5%, HWE<10-5,) we are left with 173 samples, 472,000 SNPs and 30,000 expressions.

The distribution of minor allele frequency (MAF) over SNPs is given in the histogram (Figure 4). For all possible SNP-transcript combinations, test statistic, <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M104','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M104">View MathML</a> s are computed. We fit the mixture model using the ECME algorithm in R 2.15.1 after estimating the null distribution using permutation method. However, due to high dimension data, it becomes very difficult to fit a mixture model using the proposed algorithm. For the sake of parsimony, we further filtered the data and ECME algorithm is used for only top SNPs with <a onClick="popup('http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M105','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2164/14/S8/S8/mathml/M105">View MathML</a>. For these top SNPs, the mixture model is fitted and estimates are obtained. To compute lfdr and FDR from (5) and (6) respectively, these estimates are used.

thumbnailFigure 4. Minor allele frequency (MAF) distribution. X axis corresponds to minor allele frequency 25% to 50%.

Conclusion

To compare our result with [33], we focus on 18 of the 54 P450 genes used in the study. These are CYP3A5, CYP2D6, CYP4F12, CYP2E1, CYP2U1, CYP1B1, CYP2C18, CYP4F11, CYP4V2, CYP2F1, CYP39A1, CYP26C1, CYP2C19, CYP2C9, CYP2S1, CYP46A1, CYP4A11 and CYP4X1.However our method fails to identify a single SNP with FDR<10% for CYP2R1 and that gene symbol has been excluded from the table (Table 2). It can be seen from the table (Table 2) that for a threshold of 10% FDR number of significant eQTL pairsis4916.Since we have considered only top SNPs, this may be an overestimate. SNPs which are within <1-Mb distance from gene location are defined as cis-SNPs. It is interesting to note that, among these 18 genes, the first five (CYP3A5, CYP2D6, CYP4F12, CYP2E1 and CYP2U1) having more than 40 cis-SNPs. In all cases FDR based analysis results in identifying more cis-SNPs for these 18 genes compared to that of Yang et al. (2010) [33].

Table 2. Number of eQTL pairs after crossing the threshold of FDR

Discussion

In contrast to previously available methods based on p-values, the empirical Bayes method uses local false discovery rate (lfdr) as the threshold. This method controls false positive rate. For a particular SNP, the lfdr is computed for the site-specific evidence whereas the FDR averages over other sites with stronger evidence. There are some limitations of using FDR which may result in misleading inferences in genome studies. In such a situation, it is better to use lfdr which is a bit difficult to estimate compared to FDR.However there is still one computational problem which needs much attention. Due to the high dimensionality in the data, sometimes existing algorithms fail. This necessitates the need to find some more efficient algorithms. The choice of threshold FDR value is an important deciding factor in such studies. It would be interesting to see, how number of cis-SNPs vary with the change in FDR threshold. In this way FDR criterion can be used to estimate number of SNPs that we may need to consider.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

This work is supported by the U.S. National Institutes of Health grants R01 GM74217 (Lang Li) and AHRQ Grant R01HS019818-01 (MalazBoustani)

Declarations

The publication costs were funded by the authors through P50 CA113001 (Huang, T.M.), R01 GM088076 (Skaar, T.), R01 HS019818 (Dexter).

This article has been published as part of BMC Genomics Volume 14 Supplement 8, 2013: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM 2013): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S8.

References

  1. Liu , Chunyu : Brain expression quantitative trait locus mapping informs genetic studies of psychiatric diseases.

    Neurosci Bull 2011, 27(2):123-133. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing.

    Journal of the Royal Statistical Society. Series B (Methodological) 1995, 289-300. OpenURL

  3. Efron B, Storey J, Tibshirani R: Microarrays, empirical Bayes methods, and false discovery rates.

    Stanford Technical Report 2001. OpenURL

  4. Efron B, Tibshirani R, Storey JD, Tusher V: Empirical Bayes analysis of a microarray experiment.

    Journal of the American statistical association 2001, 96(456):1151-1160. Publisher Full Text OpenURL

  5. Efron B, Tibshirani R: Empirical Bayes methods and false discovery rates for microarrays.

    Genetic epidemiology 2002, 23(1):70-86. PubMed Abstract | Publisher Full Text OpenURL

  6. Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW: On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data.

    Journal of Computational Biology 2001, 8:37-52. PubMed Abstract | Publisher Full Text OpenURL

  7. Lee MLT, Kuo FC, Whitmore GA, Sklar J: Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations.

    Proceedings of the National Academy of Sciences 2000, 97(18):9834-9839. Publisher Full Text OpenURL

  8. Kendziorski CM, Zhang Y, Lan H, Attie A: The efficiency of MRNA pooling in microarray experiments.

    Biostatistics 2003, 4:465-477. PubMed Abstract | Publisher Full Text OpenURL

  9. Kendziorski CM, Newton MA, Lan H, Gould MN: On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles.

    Stat Med 2003, 22:3899-33914. PubMed Abstract | Publisher Full Text OpenURL

  10. Kendziroski CM, Chen M, Yuan M, Lan H, Attie AD: Statistical methods for expression quantitative trait loci (eQTL) mapping.

    Biometrics 2006, 62(1):19-27. PubMed Abstract | Publisher Full Text OpenURL

  11. Gelfond JAL, Ibrahim JG, Zou F: Proximity Model for Expression Quantitative Trait Loci (eQTL) Detection.

    Biometrics 2007, 63:1108-1116. PubMed Abstract | Publisher Full Text OpenURL

  12. Lo K, Gottardo R: Flexible empirical Bayes models for differential gene expression.

    Bioinformatics 2007, 23(3):328-335. PubMed Abstract | Publisher Full Text OpenURL

  13. Sánchez-Linares I, Pérez-Sánchez H, Cecilia JM, García JM: High-Throughput parallel blind Virtual Screening using BINDSURF.

    BMC Bioinformatics 2012, 13(Suppl 14):S13. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  14. Bergemann TL, Wilson J: Proportion statistics to detect differentially expressed genes: a comparison with log-ratio statistics.

    BMC Bioinformatics 2011, 12:228. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  15. Ruan L, Yuan M: An Empirical Bayes' Approach to Joint Analysis of Multiple Microarray Gene Expression Studies.

    Biometrics 2011, 67(4):1617-1626. PubMed Abstract | Publisher Full Text OpenURL

  16. Efron B, Morris C: Combining possibly related estimation problems (with discussion).

    Journal of the Royal Statistical Society, Series B 1973, 35:379-421. OpenURL

  17. Efron B, Morris C: Stein's paradox in statistics.

    Scientific American 1977, 236:119-127. Publisher Full Text OpenURL

  18. Bar H, Booth J, Schifano E, Wells MT: Laplace approximated EM microarray analysis: an empirical Bayes approach for comparative microarray experiments.

    Statistical Science 2010, 25(3):388-407. Publisher Full Text OpenURL

  19. Wright GW, Simon RM: A random variance model for detection of differential gene expression in small microarray experiments.

    Bioinformatics 2003, 19:2448-2455. PubMed Abstract | Publisher Full Text OpenURL

  20. Cui X, Hwang JG, Qiu J, Blades NJ, Churchill GA: Improved statistical tests for differential gene expression by shrinking variance components estimates.

    Biostatistics 2005, 6(1):59-75. PubMed Abstract | Publisher Full Text OpenURL

  21. Lönnstedt I, Grant S, Begley G, Speed TP: Microarray analysis of two interacting treatments: a linear model and trends in expression over time. Technical Report, Department of Mathematics, Uppsala University, Sweden; 2001.

  22. Tai YC, Speed TP: A multivariate empirical Bayes statistic for replicated microarray time course data.

    The Annals of Statistics 2006, 34(5):2387-2412. Publisher Full Text OpenURL

  23. Lonnstedt I, Speed T: Replicated microarray data.

    StatisticaSinica 2002, 12:31-46. OpenURL

  24. Newton MA, Kendziorski CM: Parametric empirical Bayes methods for microarrays.

    The analysis of gene expression data: methods and software 2003, 254-271. OpenURL

  25. Liu C, Rubin DB: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence.

    Biometrika 1994, 81(4):633-648. Publisher Full Text OpenURL

  26. Cui X, Churchill GA: Statistical tests for differential expression in cDNA microarray experiments.

    Genome Biol 2003, 4(4):210. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  27. Noma H, Matsui S: The optimal discovery procedure in multiple significance testing: an empirical Bayes approach.

    Statistics in Medicine 2012, 31(2):165-176. PubMed Abstract | Publisher Full Text OpenURL

  28. Efron B: Robbins, empirical Bayes and microarrays.

    The annals of Statistics 2003, 31(2):366-378. Publisher Full Text OpenURL

  29. Lange KL, Little RJ, Taylor JM: Robust statistical modeling using the t distribution.

    Journal of the American Statistical Association 1989, 84(408):881-896. OpenURL

  30. Liu C, Rubin DB: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence.

    Biometrika 1994, 81(4):633-648. Publisher Full Text OpenURL

  31. Meng XL, Rubin DB: Maximum likelihood estimation via the ECM algorithm: A general framework.

    Biometrika 1993, 80(2):267-278. Publisher Full Text OpenURL

  32. McLachlan G, Krishnan T: The EM Algorithm and Extensions.

    Wiley Series in Probability and Statistics 1997. OpenURL

  33. Yang X, Zhang B, Lum PY: Systematic genetic and genomic analysis of cytochrome P450 enzyme activities in human liver.

    Genome Research 2010, 20(8):1020-1036. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL