Bioinformatics Program, Memphis, TN 38152, USA

Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN 38105, USA

Department of Mathematical Sciences, Memphis, TN 38152, USA

Department of Biology, University of Memphis, Memphis, TN 38152, USA

Abstract

Background

Gene expression data are noisy due to technical and biological variability. Consequently, analysis of gene expression data is complex. Different statistical methods produce distinct sets of genes. In addition, selection of expression p-value (EPv) threshold is somewhat arbitrary. In this study, we aimed to develop novel literature based approaches to integrate functional information in analysis of gene expression data.

Methods

Functional relationships between genes were derived by Latent Semantic Indexing (LSI) of Medline abstracts and used to calculate the function cohesion of gene sets. In this study, literature cohesion was applied in two ways. First, Literature-Based Functional Significance (LBFS) method was developed to calculate a p-value for the cohesion of differentially expressed genes (DEGs) in order to objectively evaluate the overall biological significance of the gene expression experiments. Second, Literature Aided Statistical Significance Threshold (LASST) was developed to determine the appropriate expression p-value threshold for a given experiment.

Results

We tested our methods on three different publicly available datasets. LBFS analysis demonstrated that only two experiments were significantly cohesive. For each experiment, we also compared the LBFS values of DEGs generated by four different statistical methods. We found that some statistical tests produced more functionally cohesive gene sets than others. However, no statistical test was consistently better for all experiments. This reemphasizes that a statistical test must be carefully selected for each expression study. Moreover, LASST analysis demonstrated that the expression p-value thresholds for some experiments were considerably lower (p < 0.02 and 0.01), suggesting that the arbitrary p-values and false discovery rate thresholds that are commonly used in expression studies may not be biologically sound.

Conclusions

We have developed robust and objective literature-based methods to evaluate the biological support for gene expression experiments and to determine the appropriate statistical significance threshold. These methods will assist investigators to more efficiently extract biologically meaningful insights from high throughput gene expression experiments.

Background

Gene expression data are complex, noisy, and subject to inter- and intra-laboratory variability

Even with reliable gene expression data, statistical analysis of microarray experiments remains challenging to some degree. Jeffery and coworkers found a large discrepancy between gene lists generated by 10 different feature selection methods, including significance analysis of microarrays (SAM), analysis of variance (ANOVA), Empirical Bayes, and t-statistics

FDR is determined by several factors such as proportion of DEGs, gene expression variability, and sample size

Recently, Chuchana et al. integrated gene pathway information into microarray data to determine the threshold for identification of DEGs

A number of groups have developed computational methods to measure functional similarities among genes using annotation in Gene Ontology and other curated databases

Previously, we developed a method which utilizes Latent Semantic Indexing (LSI), a variant of the vector space model of information retrieval, to determine the functional relationships between genes from Medline abstracts

Methods

Gene-document collection and similarity matrix generation

All titles and abstracts of the Medline citations cross-referenced in the mouse, rat and human Entrez Gene entries as of 2010 were concatenated to construct gene-documents and gene-gene similarity scores were calculated by LSI, as previously described

Calculation of literature-based functional significance (LBFS)

This study is an extension of our previous work on gene-set cohesion analysis

Overview of the LBFS algorithm

**Overview of the LBFS algorithm**. A statistical test was applied to get differentially expressed genes (DEGs) from the original labeled (OL) and permutated labeled (PL) samples. Subsets of 50 genes were randomly selected 1000 times from each pool of DEGs. Then literature p-values (LPvs) were calculated for each 50 gene-set. A Fisher's Exact test was used to determine if the proportion (called LCI) of subsets with LPv <0.5 in the OL group was significantly different from that obtained from PL group.

Literature aided statistical significance threshold (LASST)

Now suppose a differential expression p-value (EPv) is computed for each probe (probeset) by a proper statistical test. A statistical significance threshold (an EPv cutoff) can be determined by considering the relationship between the EPv and the LCI for a given DEG set. First, a grid of EPv cutoffs is specified such as 0.001, 0003, 0.005, 0.01, ⋯, 1, to generate a DEG set at each cutoff value. Next, the LCI is calculated for each DEG set using the sub-sampling procedure as described above. Apart from some random fluctuations, the LCI value is typically a decreasing function of the EPv threshold and assumes an L shape (Figure

Relationship between EPV and LCI

**Relationship between EPV and LCI**. The fraction of gene sets with LPv < 0.05 (y-axis) was plotted at various expression p-value (EPv) thresholds (x-axis) for 3 different datasets. Inset shows magnified view for EPv < 0.10.

(1) Specify an increasing sequence of EPv statistical significance thresholds α_{1}, ⋯, α_{m }and generate DEG sets at these specified significance levels.

(2) For each DEG set generated in (1), estimate the LCI using the sub-sampling procedure described above, to obtain pairs (α_{i}, L_{i}), i = 1, 2, ⋯, m.

(3) Choose an integer m_{0 }(3 by default) and perform two-piece linear fits to the curve as follows: For k = m_{0}, m_{0}+1, ⋯, m-m_{0}, fit a straight line by lease square to the points (α_{j}, L_{j}), j = 1, 2, ⋯, k (the left piece) to obtain intercept and slope _{j}, L_{j}), j = k+1, 2, ⋯, m (the right piece) to obtain intercept and slope

(4) Let k* be the first local maxima of V_{k }(k == m_{0}, m_{0}+1, ⋯, m-m_{0}), that is,

(5) Take the k*_{th }entry on the α sequence specified in (1) as the EPv significance cutoff.

Microarray data analysis

To test the performance of our approach, we randomly chose three publicly available microarray datasets from Gene Expression Omnibus (GEO): 1) interleukin-2 responsive (IL2) genes

Results

Comparison of various statistical tests using LBFS

The goal of our study was to develop a literature based method to objectively evaluate the biological significance of differentially expressed genes produced by various statistical methods applied to gene expression experiments. Previously, we developed a method and web-tool called Gene-set Cohesion Analysis Tool (GCAT) which determines the functional cohesion of gene sets using latent semantic analysis of Medline abstracts

Literature based functional significance (LBFS) of gene sets generated by four statistical tests for three different microarray experiments.

**LCI**

**LBFS**

**Gene list**

**PGC-1beta**

**IL2**

**ET1**

**PGC-1beta**

**IL2**

**ET1**

Welch t-Test

0.34

0.34

0.17

7.08E-06

0.0004

0.45

Mann-Whitney

0.2

0.2

0.13

0.118

0.0075

1

Student t-Test

0.38

0.38

0.1

1.24E-07

0.071

1

Empirical Bayes

0.4

0.19

0.05

1.36E-08

0.11

1

For comparison the Literature Cohesion Index (LCI) which is used to calculate LBFS is displayed for each experiment.

**Number of DE genes (with 0.05 EPv) and percentage of having abstracts that generated from different tests for PGC-1beta, IL2 and ET1 datasets**.

Click here for file

Determination of EPv threshold using LASST

In the above analysis, DEGs were selected using an arbitrary statistical threshold of p<0.05, as is the case for many published expression studies. However, in reality, there is no biological reason why this threshold is selected for experiments. Once the appropriate statistical test was chosen by application of LBFS above, we tested if literature cohesion could be applied to determine the EPv cutoff. We developed another method called Literature Aided Statistical Significance Threshold (LASST) which determines the EPv by a two-piece linear fit of the LCI curves as a function of EPv as described in Methods. LASST was applied to p-values produced by Empirical Bayes for PGC-1beta experiment and Welch t-test for the IL2 and ET1 experiments. DEGs were produced at each point on a grid of unequally-spaced statistical significance levels (α = 0.001, 0.003, 0.005,⋯). In computing the LCI, the LPv level was set to 0.05, and the size of the gene subsets from the DEG pool was set to 50 in the sub-sampling procedure as described in Methods. The LCI of a DEG set was plotted against various α levels of the EPv (Figure

While computing LCIs in the above analysis, the LPv threshold was set at 0.05. We wondered if different LPv thresholds would affect LASST results. Therefore, we calculated LCI at different LPv thresholds such as 0.01, 0.03, 0.05, 0.06, 0.08 and 0.1. We found that the shape of the LCI curves were similar with respect to EPv values (Figure

Relationship between EPV and LCI at various thresholds

**Relationship between EPV and LCI at various thresholds**. The LCI at various LPv thresholds ranging from 0.01 to 0.1 (y-axis) was plotted against various EPv thresholds (x-axis) for PGC-1beta dataset. Inset shows magnified view for EPv < 0.10. The shapes of the curves are similar at various LPv thresholds.

We next compared the LASST results with several popular multiple hypothesis testing correction procedures along with the unadjusted p-value threshold of 0.05 in a student t-test (Table

Number of significant genes identified by student t-test after correction for multiple hypotheses testing

**# of tests**

**# of genes with p <0.05**

**Storey pFDR q<0.1**

**BH FDR <0.1**

**Bonferroni FWER <0.1**

**Westfall Young Permutation**

IL2

20558

5001

5955

3827

32

95

PGC-1beta

17633

2618

1

1

1

1

ET1

20477

1559

0

0

0

0

Discussion

Although microarray technology has become common and affordable, analysis and interpretation of microarray data remains challenging. Experimental design and quality of the data can severely affect the results and conclusions drawn from a microarray experiment. Using our approach, we found that some datasets (e.g., PGC-1beta) produced more functionally cohesive gene sets than others (e.g., ET1). There can be many biological or technological reasons for the lack of cohesion in any microarray dataset. For instance, it is possible that the experimental perturbation (or signaling pathway) simply did not alter mRNA expression levels in that system as hypothesized. It is also possible that the data are noisy due to technical or biological variations, which result in false differential expression. Although our method will not identify the causes of this variation, it can help in assessment of the overall quality of the experiment and provide feedback to the investigators in order to adjust the experimental procedures. For example, after observing a low LBFS value, the investigator may choose to remove outlier samples or add more replicates into the study design.

It is important to note that a low cohesion value could be due to a lack of information in the biomedical literature. In other words, it is possible that the microarray experiment has uncovered new gene associations which have not been previously reported in the literature. This issue would affect any method that relies on human curated databases or natural language processing of biomedical literature. However, our LSI method presents a unique advantage over other approaches because it extracts both explicit and implicit gene associations, based on weighted term usage patterns in the literature. Consequently, gene associations are ranked based on their conceptual relationships and not specific interactions documented in the literature. Thus, we posit that LSI is particularly suited for analysis of discovery oriented genomic studies which are geared toward identifying new gene associations. Further work is necessary to be able to determine exactly how (whether explicitly or implicitly) a subset of functionally cohesive genes are related to one another in the LSI model.

A major challenge in microarray analysis involves selection of the appropriate statistical tests, which have different assumptions about the data distribution and result in different DEG sets. For instance, parametric methods are based on the assumption that the observations adhere to a normal distribution. The assumption of normality is rarely satisfied in microarray data even after normalization. Nonparametric methods are distribution free and do not make any assumptions of the population from which the samples are drawn. However, nonparametric tests lack statistical power with small samples, which is often the case in microarray studies. In this study, we found that although Mann-Whitney nonparametric test identified the largest number of DEGs for PGC-1beta experiment, the DEGs were not functionally significant (Table

Several groups have developed methods to assess functional cohesion or refine feature selection by incorporating biological information from either the primary literature or curated databases

Assuming that microarray experiment is of high quality and an appropriate statistical test has been selected for a microarray experiment, selection of the expression p-value cutoff still remains arbitrary for nearly all published studies. In our work, we found a positive correlation between literature cohesion index and EPv (Figure

Finally, another major challenge for microarray analysis is the propensity for high false discovery rate (FDR) caused by multiple hypothesis testing. Correction of multiple hypothesis testing including family wise error rate (FWER) are often too stringent which may lead to a large number of false negatives. As with EPv cutoff concerns above, setting the FDR threshold at levels 0.01, 0.05, or 0.1 does not have any biological meaning

Conclusions

In this study, we developed a robust methodology to evaluate the overall quality of microarray experiments, to compare the appropriateness of different statistical methods, and to determine the expression p-value thresholds using functional information in the biomedical literature. Using our approach, we showed that the quality, as measured by the biological cohesion of DEGs, can vary greatly between microarray experiments. In addition, we demonstrate that the choice of statistical test should be carefully considered because different tests produce different DEGs with varying degrees of biological significance. Importantly, we also demonstrated that procedures that control false positive rates are often too conservative and favor larger DEG sets without considering biological significance. The methods developed herein can better facilitate analysis and interpretation of microarray experiments. Moreover, these methods provide a biological metric to filter the vast amount of publicly available microarray experiments for subsequent meta-analysis and systems biology research.

Abbreviations

ANOVA: analysis of variance; DEGs: differentially expressed genes; EPv: expression p-value; ET1: Endothelin-1 responsive; FDR: False Discovery Rate; GCA: gene-set cohesion analysis; GCAT: Gene-set Cohesion Analysis Tool; GEO: Gene Expression Omnibus; IL2: interleukin-2 responsive; LASST: Literature aided statistical significance thresholds; LBFS: literature-based functional significance; LCI: literature cohesion index; LPv: literature cohesion p-value; LSI: Latent Semantic Indexing; MAQC: Microarray Quality Control; PGC-1beta: PGC-1beta related; SAM: significance analysis of microarrays; SVD: singular value decomposition;

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

L. Xu developed the algorithm, carried out the data analyses, performed all of the evaluation and wrote the manuscript. C. Cheng developed the literature aided statistical significance thresholds method and wrote part of the manuscript. E.O. George provided statistical supervision of the study. R. Homayouni conceived, co-developed the methods, supervised the study and wrote the manuscript.

Acknowledgements

We thank Dr. Kevin Heinrich (Computable Genomix, Memphis, TN) for providing the gene-gene association data. This work was supported by The Assisi Foundation of Memphis and The University of Memphis Bioinformatics Program.

This article has been published as part of