Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA

Abstract

Background

High-throughtput technologies enable the testing of tens of thousands of measurements simultaneously. Identification of genes that are differentially expressed or associated with clinical outcomes invokes the multiple testing problem. False Discovery Rate (FDR) control is a statistical method used to correct for multiple comparisons for independent or weakly dependent test statistics. Although FDR control is frequently applied to microarray data analysis, gene expression is usually correlated, which might lead to inaccurate estimates. In this paper, we evaluate the accuracy of FDR estimation.

Methods

Using two real data sets, we resampled subgroups of patients and recalculated statistics of interest to illustrate the imprecision of FDR estimation. Next, we generated many simulated data sets with block correlation structures and realistic noise parameters, using the Ultimate Microarray Prediction, Inference, and Reality Engine (UMPIRE) R package. We estimated FDR using a beta-uniform mixture (BUM) model, and examined the variation in FDR estimation.

Results

The three major sources of variation in FDR estimation are the sample size, correlations among genes, and the true proportion of differentially expressed genes (DEGs). The sample size and proportion of DEGs affect both magnitude and precision of FDR estimation, while the correlation structure mainly affects the variation of the estimated parameters.

Conclusions

We have decomposed various factors that affect FDR estimation, and illustrated the direction and extent of the impact. We found that the proportion of DEGs has a significant impact on FDR; this factor might have been overlooked in previous studies and deserves more thought when controlling FDR.

Introduction

With the advent of high throughput technologies, research has focused on the systematic genome-wide study of biological systems. Microarray technology has been used to measure the mRNA expression of thousands of genes simultaneously. Concurrently, new statistical methods have been developed to analyze the data generated by these experiments. These methods involve both data preprocessing (background correction, data transformation, normalization, etc.) and specific tools for different types of studies (e.g., class discovery, class prediction, or class comparison).

The canonical class comparison problem involves the identification of lists of DEGs. The evolving consensus

Our study sheds more light on possible reasons for the (lack of) precision in the estimated FDR. Our results provide two concrete examples of this imprecision. First, we look at an example where univariate Cox proportional hazards (CPH) models are used to determine which genes appear to be related to survival. By resampling ~ 100 patients at a time (out of a set of ~ 200 patients), we find that the estimates of the percentage of genes that appear to be related to survival range from 0% to 20%. In a simpler example of univariate t-tests between two groups of samples, we find that the estimate of the percentage of DEGs ranges between 13% and 43%. This range of estimates is much wider than one would anticipate from fitting a distribution based on thousands of

Throughout this paper, we estimate FDR using a BUM model for the distribution of

In many cases, it is difficult to evaluate analytical methods for microarray data because of the complex— and unknown—nature of the underlying biological phenomena. Thus, simulated data sets with known “ground truth” are needed in order to assess the performance of computational algorithms for the analysis of high throughput data. To address this problem, many groups have developed microarray simulation software

We have already introduced a package of microarray simulation software called UMPIRE

Methods

Public data sets

The Affymetrix microrray data were collected as part of a study to predict survival in follicular lymphoma patients

The two-color fluorescent cDNA microarray data were collected as part of a study to identify clinically relevant subtypes of prostate cancer

Simulated data sets

Genes could be correlated when they are involved in active biological pathways, or are regulated by the same set of factors. We consider the correlation in gene expression to be “clumpy”, meaning that there are gene groups with high correlation within groups but no or little correlation between groups. In order to mimic this correlation feature, we applied block structure. Both the block sizes and the correlations within a block vary in order to mimic different sized pathways/networks, and loosely or strongly correlated genes within a particular pathway/network. Distinct blocks are assumed to be independent. Please refer to our previous publication for detailed description of the block structure

Using the UMPIRE package, additive and multiplicative noise were incorporated, and correlated blocks were implemented. We simulated normal samples as a homogeneous population with _{g}

Results

Survival in follicular lymphoma

Dave and colleagues

Distribution of gene-by-gene

**Distribution of gene-by-gene p-values from follicular lymphoma data.** Histogram of gene-by-gene

Two-sample t-tests comparing prostate tumor with normal prostate

In order to study the variability of ^{–7} and 0.0434 when using ten samples per group, with a median of 0.00628. The cutoff ranged from 0.0089 to 0.0439, with a median of 0.0229, when using twenty samples per group. If we tried to use the median cutoff across all random resamplings from the full data set, we also found that the effective FDR varied widely from one data set to another (Figure

FDR and BUM results from prostate data

**FDR and BUM results from prostate data. ****(Top)** Distribution of the effective false discovery rate (FDR) at the median **(Bottom)** Range of beta-uniform mixture (BUM) models of the distribution of

Simulations

The two practical data sets were used to demonstrate the variability in the distribution of p-values and FDR estimation observed in real data. However, due to the limited sample size and unknown ground truth, the practical data sets lack the flexibility needed for testing analytical methods. Thus, realistic simulations were used to disentangle different factors contributing to the variation in FDR estimation.

We simulated 128 sets of normal and cancer data, using four different sample sizes (

Precision of parameters estimates in the BUM model

Pounds and Morris

for 0 <

Both the mean and the variance of

Boxplot of

Boxplot of

Figure

Boxplot of

Boxplot of

From

The remaining

We also studied the correlation between

**Boxplot of pearson correlation between **

Click here for file

False discovery proportion as a function of p-value

We simulated cancer samples with a proportion of DEGs having altered mean expression values. For given (nominal)

One factor that dramatically affects FDP is

FDP for different block sizes and

**FDP for different block sizes and ѱ.** Solid lines represent the mean FDPs for particular block size and

To illustrate the influence of sample size on FDP estimation, we performed another set of simulations with different sample sizes and block sizes, but fixed

**FDP for different sample sizes and block sizes** Solid black lines represent the mean FDPs from all simulated data for the same sample size. Dashed lines represent standard deviations of FDPs for different block sizes that are distinguished by colors shown in legend of bottom-right figure.

Click here for file

Efron’s dispersion variate and the standard deviation of the correlation density

Efron

**Boxplot of Â grouped by sample size and block size**

Click here for file

One novel finding from our study is that the dispersion variate

Boxplot of dispersion variate A under different combinations of block sizes and

Boxplot of dispersion variate A under different combinations of block sizes and

We also found that FDP is negatively correlated with the dispersion variate

Scatter plot of dispersion variate A and FDP

**Scatter plot of dispersion variate A and FDP.** Colors are used to distinguish results from different

Also following Efron, we calculated the standard deviation of the empirical correlation densities (

**Boxplot of **

Click here for file

Conclusions

From the two concrete examples, we observed a lack of precision in the estimation of FDR. In order to study the sources of variation during FDR estimation, we simulated microarray data with more realistic parameters. In our simulation, block-wise structure with different block sizes and intra-block correlation are used to mimic the molecular networks or biological functional groups where large scale correlation of gene expression arises. A particular block of genes could be transcriptionally active or inactive depending on specific biological conditions. When the block of genes are transcriptionally active, their expression levels follow a multi-variate normal distribution with parameters estimated from real microarray data. Certain portion of genes will be differentially expressed between normal and cancer samples, where the magnitude of changes follows a gamma distribution which allows some large magnitude changes while the majority have two fold change on average. With this setting, we simulated microarray data sets with different sample sizes, correlation structure, portion of negtively correlated genes within a block, and portion of DEGs between two biological conditions.

For each simulated data set, we have recorded the parameters related to FDR estimation. Different portions of negtively correlated genes within a block do not affect the parameter estimations. Thus, the three major sources of variation in FDR estimation are the sample size, correlation structures and the portion of DEGs. With large sample size, the variances of all parameters decrease due to increased estimation power. However, the percentage of non-DEGs is always under-estimated, even though it approaches the true portion with larger sample size. Large block size results in less precise estimation of all the parameters due to less independent measurements. However, the block size does not affect the mean estimation of the parameters. Thus the FDR estimation are less precise with more correlation, but the average FDR estimation is not affected.

Our study suggests that an important factor affecting FDR estimation is the portion of DEGs. With larger portion of DEGs, the distribution of test statistic is widened by the larger portion of true positives, thus resulting in smaller FDR and more precise FDR estimation.

In summary, the correlation structure is not the only factor affecting FDR estimations. The portion of DEGs, which varies under different biological conditions contributes to both the precision and the magnitude of FDR estimation.

List of abbreviations used

FDR: False Discovery Rate; FDP: False Discovery Proportion; UMPIRE: Ultimate Microarray Prediction, Inference, and Reality Engine; BUM: beta-uniform mixture; DEGs: differentially expressed genes; CPH: Cox proportional hazards

Competing interests

The authors have no competing interests.

Author’s contributions

JZ performed the simulations, gathered the simulation results, and completed the first draft of the manuscript. KRC initiated the project, provided ideas, developed UMPIRE, supervised the progression, and was involved in the manuscript development.

Acknowledgements

This research was supported by NIH/NCI grants P30 CA016672, P50 CA140388, and R01 CA123252.

This article has been published as part of