Faculty of Bioresource Sciences, Akita Prefectural University, Akita 010-0195, Japan

Abstract

Background

Microarray technology has enabled the measurement of comprehensive transcriptomic information. However, each data entry may reflect trivial individual differences among samples and also contain technical noise. Therefore, the certainty of each observed difference should be confirmed at earlier steps of the analyses, and statistical tests are frequently used for this purpose. Since microarrays analyze a huge number of genes simultaneously, concerns of multiplicity, i.e. the family wise error rate (FWER) and false discovery rate (FDR), have been raised in testing the data. To deal with these concerns, several compensation methodologies have been proposed, making the tests very conservative to the extent that arbitrary tuning of the threshold has been introduced to relax the conditions. Unexpectedly, however, the appropriateness of the test methodologies, the concerns of multiplicity, and the compensation methodologies have not been sufficiently confirmed.

Results

The appropriateness was checked by means of coincidence between the methodologies' premises and the statistical characteristics of data found in two typical microarray platforms. As expected, normality was observed in within-group data differences, supporting application of t-test and F-test statistics. However, genes displayed their own tendencies in the magnitude of variations, and the distributions of p-values were rather complex. These characteristics are inconsistent with premises underlying the compensation methodologies, which assume that most of the null hypotheses are true. The evidence also raised concerns about multiplicity. In transcriptomic studies, FWER should not be critical, as analyses at higher levels would not be influenced by a few false positives. Additionally, the concerns for FDR are not suitable for the sharp null hypotheses on expression levels.

Conclusions

Therefore, although compensation methods have been recommended to deal with the problem of multiplicity, the compensations are actually inappropriate for transcriptome analyses. Compensations are not only unnecessary, but will increase the occurrence of false negative errors, and arbitrary adjustment of the threshold damages the objectivity of the tests. Rather, the results of parametric tests should be evaluated directly.

Background

Microarray technology has enabled the acquisition of comprehensive quantitative information about mRNA, the transcriptome, in a tissue sample. Because the functions of a cell are primarily determined by expression of the genome, we can assess the state of a cell by examining its transcriptome. However, microarray data may contain irrelevant individual differences as well as noise arising from artifacts of measurement. Indeed, the quality of data generated by microarray assays has been questioned

The test methodologies should be consistent with the data characteristics and the purpose of the test. As with other statistical methods, the principle of a test methodology is based on some assumptions; for accurate analyses, the assumptions should be consistent with the characteristics of the data and the consistency should be checked. Additionally, application of the methodology should be adequate for the purpose of the test

For the tests of gene expression levels, parametric methods such as Student’s t-test or analysis of variance (ANOVA) are frequently used. Generally, these methodologies estimate a p-value, which is the probability that a difference larger than that observed would occur by chance, when actually no difference among populations exists. If the p-value is less than a predetermined threshold, then the observed difference is considered to be significant. Both in t-test and ANOVA, the p-value is calculated by assuming that within group differences are normally distributed; if this assumption does not hold, we cannot accurately evaluate the observed differences among the groups.

Microarray methodology simultaneously measures the expression levels of a large number of genes, and the expression levels of several genes are frequently analyzed collectively. Accordingly, some concerns related to multiple tests

Multiplicity of tests can increase FWER when we group a set of tests together as a family

As the number of tested subjects increases, FDR, the number of false positives among the declared positives, may also increase when large numbers of true null hypotheses are expected

Despite these efforts to find a practical solution, the methodologies would inevitably make the tests very conservative, increase the false negatives, and reduce the overall information obtained. To deal with the strictness and to regain some of information that may be lost, extremely relaxed thresholds of the tests (10-20%) were recommended

Both FWER and FDR assume high prior probabilities to the null hypotheses; i.e., the population means are identical. In addition, in a recently published book that featured microarray data _{0}) is high in large-scale inferences, because most of the cases have small, uninteresting, but non-zero differences. This argument may sound useful for gene selection; indeed, his purpose was to "reduce a vast collection of possibilities to a much smaller set of scientifically interesting prospects". However, this is not necessarily consistent with the current demands of microarray data analyses; since many genes have functional relationships, significance can be tested on such cell functions as well. Interesting functions can be easily found and tested by pathway analysis using databases _{0}) scenario unnecessarily increases false negatives, it could limit important information that could be used at higher levels of analyses. Moreover, to negate these small differences, renovation of the null hypothesis and test statistics are required. Nevertheless, Efron did not give any alternative methods, and the complex concept of "interesting" therefore introduced ambiguity in the application of the test. Regardless, in both principle and application, evidence for estimation of Pr(H_{0}) is critically important.

We note a trend in the transition of proposed methodologies and the applications described above in that the tightened conditions to deal with the proposed multiplicity have been relaxed enough to employ the unusual handling of the threshold. While it is true that such relaxed application of the test can reduce the number of false negatives, the arbitrariness in choosing both the methodologies and the threshold can damage the objectivity of a test. Indeed, as the transition proceeded, the appropriateness of any of the premises in the methodologies was not confirmed. Additionally, the suitability of the methodologies to the purpose of the test has been left unexamined. For example, no concrete reason has been proposed to explain why the multiplicity should be considered. As will be discussed below, handling of plural test results simultaneously is not a sufficient reason for compensations of the multiplicity

Methods

Data sources

Several sets of Agilent 44K chip data

**List of data ID used in the figures.** The list of GEO ID of the data used in the calculations.

Click here for file

Data analysis

Statistical significances in gene expression levels between groups were estimated by using the t-test with Welch’s approximation on normalized gene data. Those were also estimated by two-way ANOVA on normalized perfect match (PM) data of Affymetrix GeneChips, under the assumption that differences in PM data were the sum of group effects and probe sensitivity

The integrated distribution of gene-wise data variations were compared against the normal distribution using quantile-quantile (QQ) plots. For each gene of the high calorie-fed group, normalized data - normalized within each chip - was collected (n=5). Agilent platform data were selected because an artifact could produce a normal distribution if the average of many PM cell data produced on the Affymetrix GeneChip platform were used, according to the central limit theorem. The collected data were further z-normalized using their mean and standard deviation (SD) to cancel the differences in expression levels and SDs among genes. The renormalized data were then ranked from 1 to 5 according to the signal intensity among the repeats in each gene. In each of the ranks, distribution of the renormalized data was presented at the corresponding theoretical quantiles by using boxplots. The boxes and bars represent the quartiles, and whiskers represent extreme data points that are no more than 1.5 times the interquartile range from the box.

Within-group SD values among the Agilent chip data were estimated by using normalized z-scores. Within-group SD values among Affymetrix GeneChips, which measure a transcript using multiple PM probes, was estimated as the root mean square of the SDs for the probes. The degrees of correlations between the SDs were estimated in Spearman’s ρ by using cor(stats) function of the R.

Data simulation

A virtual dataset was produced for simulating a scenario in which genes share a common level of noise. The virtual dataset was used to estimate within-group standard deviations and p-values. Each imaginary level was generated by summing the group effect, probe sensitivity, and noise component; these components were produced by generating normally distributed random numbers, of which SDs were set to be identical to the root mean square of the SDs observed in each of the genes of real data. Scripts for the R is available as the Additional File

**Scripts for the R.** Scripts used to perform 2 way ANOVA and the simulations.

Click here for file

Results

Variation in biological replications obeys normal distribution

The inconvenience of using parametric methods is that their premise assumes a certain distribution of the population, i.e., in cases of t-tests and ANOVAs, data variation should be normally distributed. However, it is possible to confirm the actual distribution of data when considering the potential suitability of methodologies. A gene-wise distribution of variation can be verified by comparing the quantiles of real data with their corresponding theoretical values on a quantile-quantile (QQ) plot (Figure

The general trend of data variations found in an experimental group of mice fed a standard diet.

**The general trend of data variations found in an experimental group of mice fed a standard diet.** The data

The general trend of these distributions will be revealed by integrating the gene-wise QQ plots. The integration was performed using expression data further normalized among individual genes, and then determining the distributions of the renormalized expression data for each rank among individual genes (see Methods). The data distribution for each of the ranks was presented using a box and whisker plot and compared with the theoretical value of normal distribution (Figure

The compensating method and the number of declared positive genes

To determine the effects of the FWER and FDR compensating methodologies, the test results were compensated accordingly, and the numbers of significant genes were compared (Table

Numbers of positive genes found under the indicated conditions

Agilent

Affymetrix

PM data

-Resv.

+Resv.

parametric

2,104

1,969

3,338

93

10,061

1,035

Bonferroni

16

5

11

0

4,869

179

Holm

16

5

11

0

4,897

179

FDR

230

136

334

0

8,680

370

Each gene exhibits a unique tendency in stability of expression levels

To select the proper methodology of testing, the noise level of the microarray technique must be known. If data variations are primarily attributed to technical noise, a constant level of noise can be expected among the genes, although the variations observed for each gene will either be over- or underestimated simply by chance. Consequently, a test can be recognized as a part of the repetitions performed under the same conditions, coinciding with Neyman's perspective _{0}) must be high. This could be a valid reason to group a family from the whole set of a sample. Conversely, if the microarray assay is sufficiently accurate and shows individual differences between samples, then each gene will exhibit unique tendencies with respect to the stability of expression levels. If this scenario is true, a correlation in the gene-wise variation of different groups will be apparent. In this case, p-values will show some evidence of variation, and grouping of the family would be unnecessary, negating the FWER scenario.

Such a correlation can be evaluated using the standard deviation (SD) within experimental groups; because the data variation is normally distributed (Figure

Characteristics of within-group SDs found in each gene of mice fed with different diets.

**Characteristics of within-group SDs found in each gene of mice fed with different diets.** A: Between group comparisons in Agilent chip data for standard (Cy-5) and high-calorie (Cy-3) diets

Distribution of p-values is complex

Distribution of estimated p-values will give important information for selecting suitable methodologies for the test, since the origin of data variation can also be estimated from the distribution. If variations in the data can primarily be attributed to technical noise, which is a suitable case for high Pr(H_{0}) scenario, then the distribution of p-values can be simulated by using random numbers (Figure

Frequency distribution histograms for P-values.

**Frequency distribution histograms for P-values.** The distributions for certain combinations of experimental groups in

Discussion

Variations in the expression levels of each gene within a group were normally distributed (Figure

The gene-wise tendency observed for within-group SDs (Figure

As has been described before, the main purpose of testing significance of a gene is to reduce uncertain signals in higher level of analyses. Even if the technical noise is low, individual organisms have biological differences, and some genes may frequently and drastically change their expression levels according to biological requirements. To observe between-group differences for such genes, the tested data may lack a sufficient number of biological repeats. Such volatility or stability of a gene can be estimated from within-group differences found in the forms of SDs (Figure

Therefore, the suitability of the definition of a family by the gene contents of the microarray data should be reconsidered. Actually, although it is a very crucial decision, there are no fixed rules for how we determine a family

The appropriateness for the concerns of increasing FDR should also be reconsidered. Originally, the concern over FDR was based on the high probability of a true null hypothesis

The idea that compensation is unnecessary would also be true with respect to data obtained in sequencing-based methodologies, such as RNA-seq _{0}) would not be uniformly high but a function of the numbers of reads, the FDR

We should not compensate for multiplicity of tests unless there is a good reason for doing so. It is now obvious that the high Pr(H_{0}) scenario is against the evidence presented here. This means that the currently proposed problems for multiplicity in microarray data, FWER

A far more important problem should concern the design and management of experiments. As was discussed, the principal source of noise is in individual differences among samples, but not in the measuring technique. Since experiments are performed by using a limited number of replicated experiments, any small differences arising in experimental conditions among groups can introduce significant biases that may manifest as a global level of false positives. Unfortunately, such experiment-based false positives cannot be controlled by any of statistical methods in principle, since what was observed actually occurred in that experiment. To control for such biases, experimental groups should be randomized (e.g., placement of cages or pots in experiments) beyond groups, to avoid being treated in any specific order.

Conclusions

Microarray analysis is accurate enough to observe individual differences among samples, and performing parametric tests for the results is recommended to confirm the significance of transcriptomic differences among groups. It should be noted that, in most of the cases, FWER or FDR should not be considered with respect to the tests; these procedures are inappropriate for global transcriptome analyses and will increase false negative errors, eliminating information that would otherwise be obtained. Rather, strict control for false positive errors should be considered in higher levels of analyses, but not in the gene-wise case. A more important source of problems would be in the design and management of the experiment, since any biological differences of conditions among groups will produce false biases in the data.

Competing interests

The author declares that he has no competing interests.

Acknowledgements

I would like to thank Dr. N. Mitsuda in AIST for bringing up discussions about this issue and Dr. S. Youssefian for his precious comments on the manuscript.

This article has been published as part of