Genome Science and Technology Program, University of Tennessee, Knoxville, Tennessee, USA

Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, Tennessee, USA

Oak Ridge National Laboratory, Systems Genetics Group, Biosciences Division, Oak Ridge, Tennessee, USA

Department of Animal Science, University of Tennessee, Knoxville, Tennessee, USA

Abstract

Background

Network and clustering analyses of microarray co-expression correlation data often require application of a threshold to discard small correlations, thus reducing computational demands and decreasing the number of uninformative correlations. This study investigated threshold selection in the context of combinatorial network analysis of transcriptome data.

Findings

Six conceptually diverse methods - based on number of maximal cliques, correlation of control spots with expressed genes, top 1% of correlations, spectral graph clustering, Bonferroni correction of p-values, and statistical power - were used to estimate a correlation threshold for three time-series microarray datasets. The validity of thresholds was tested by comparison to thresholds derived from Gene Ontology information. Stability and reliability of the best methods were evaluated with block bootstrapping.

Two threshold methods, number of maximal cliques and spectral graph, used information in the correlation matrix structure and performed well in terms of stability. Comparison to Gene Ontology found thresholds from number of maximal cliques extracted from a co-expression matrix were the most biologically valid. Approaches to improve both methods were suggested.

Conclusion

Threshold selection approaches based on network structure of gene relationships gave thresholds with greater relevance to curated biological relationships than approaches based on statistical pair-wise relationships.

Introduction

To extract gene networks from microarray data, correlations are often used as a measure of gene co-expression. A typical microarray with 20,000 gene probes will produce 200 million correlations. Correlations below a threshold value, closer to zero, will be less meaningful. Hard and soft threshold approaches have been applied to biological data. Hard thresholds discard gene pairs with correlation below the threshold, while soft thresholds use the correlation value to weight gene network relationships. Zhang and Horvath

We focus on relevance networks, created by applying a hard threshold to the gene expression correlation matrix

Current approaches to threshold selection are typically statistically based, and do not fully reflect the connectivity of the data

Some studies used an arbitrary threshold correlation such as 0.80

However, using connectivity of the data to derive thresholds has been suggested. Langston et al.

Here two threshold selection methods based on correlation graph structure are compared with common statistically based methods. The graph based methods used spectral properties

Methods

Datasets

Three yeast

Software

Software written by Langston and colleagues (University of Tennessee) was used, including Datagen version 1.4a for computing correlations, maximal clique enumeration code version 2.0.1

Threshold Estimation

Six conceptually different approaches were evaluated:

1) Numbers of maximal cliques were calculated at each potential correlation threshold, starting at r = 0.99. The threshold was lowered, in steps of 0.01, and number of maximal cliques increased due to greater connections among genes. When clique number increased two times (Maximal Clique-2) or three times (Maximal Clique-3) the previous value, that correlation was chosen as the threshold.

2) For each potential threshold correlation value, spectral graph theory

3) Correlations of control spots with all other genes on the array were calculated, creating a null distribution. The 99th percentile correlation value (absolute value) of this distribution gave the threshold.

4) The top 1% of all correlations (absolute value) among genes was used to estimate a threshold

5) A p-value for every correlation was computed, testing if the correlation was zero (Fisher's z-transformation). Threshold estimate was the correlation value corresponding to the critical Bonferroni p-value, 0.05/number of correlations. This threshold will remove any correlations that are statistically equal to zero.

6) Statistical power calculations were used to find the correlation value that gave an 80% chance of rejecting the null hypothesis, Ho: correlation = 0. Type I error rate in these calculations was Bonferroni-adjusted to correct for multiple testing.

Further details on computing these threshold estimation methods are in the Additional file

**Methodology for Threshold Estimation**. Details on the six threshold estimation methods are presented in a computationally oriented manner.

Click here for file

Performance Evaluation

Performance of the threshold estimation methods was evaluated by comparison to a biologically based Gene Ontology threshold. GO data used was gene_ontology_edit.obo.2008-05-01.gz. The biological meaning for each correlation bin (in 0.01 increments) was the average of functional similarity scores for all gene pairs within that correlation bin. Functional similarity for a pair of genes was defined as log(n/N)/log(2/N), where n is the number of genes in the lowest GO category that contained both genes, and N is the total number of genes annotated for the organism. The formula normalizes Functional similarity to a 0 to 1 range, and a value of 1 means the GO category contained only the two genes being considered (perfect similarity). GO threshold estimate was defined as the correlation at which change in average functional similarity exceeded median change plus half its standard deviation, thus identifying where biological information begins to accumulate.

To study stability of the methods, 10,000 block bootstrap samples were created by sampling arrays with replacement from each block. Blocks were defined to be 2 or 3 adjacent time periods, such that each block contained 3 or 4 arrays. Block bootstrapping was necessary to preserve as much as possible the time-course dependency structure of the experiments

Results

Functional similarity scores for the three datasets are displayed in Figure

Change in GO functional similarity score across correlation values

**Change in GO functional similarity score across correlation values**. Lines represent Anoxia dataset (solid line), Reoxygenation dataset (dashed line) and Alpha dataset (dotted line).

Estimated threshold for each method by dataset, with methods sorted by the sum of absolute deviations from the GO functional similarity threshold.

**Method**

**Anoxia**

**Reoxygenation**

**Alpha**

**Absolute deviations from GO threshold**

GO Functional Similarity

0.97

0.92

0.85

Spectral Clustering

0.93

**0.97**
^{a}

**0.89**

0.04+0.05+0.04 = 0.13

Maximal Clique-2

0.90

0.91

0.74

0.07+0.01+0.11 = 0.19

Power

0.88

**0.94**

**0.96**

0.09+0.02+0.11 = 0.22

Bonferroni adjustment

0.85

**0.93**

**0.95**

0.12+0.01+0.10 = 0.23

Control-Spot

0.93

0.83

0.70

0.04+0.09+0.15 = 0.28

Maximal Clique-3

0.87

0.89

0.60

0.10+0.03+0.25 = 0.38

Top 1 Percent

0.81

0.81

0.72

0.16+0.11+0.13 = 0.40

^{a}Thresholds above the GO functional similarity threshold are in bold.

Estimated thresholds obtained by each method are listed in Table

The estimated threshold derived for selected methods for each dataset is compared to bootstrap distributions in Table

Summary of bootstrap results compares the estimated threshold with the bootstrap distribution for the four selected methods.

**Method**

**Dataset**

**Estimated Threshold**

**Bootstrap Mean**

**Difference**
^{
a
}

**Bootstrap Standard Deviation**

**Maximal Clique-2**

Anoxia

0.90

0.91

-0.01

0.015

Reoxy

0.91

0.93

-0.02

0.009

Alpha

0.74

0.78

-0.04

0.057

**Spectral Clustering**

Anoxia

0.93

0.95

-0.02

0.012

Reoxy

0.97

0.97

0.00

0.011

Alpha

0.89

**0.95

-0.06

0.017

**Top 1 Percent**

Anoxia

0.81

0.83

-0.02

0.011

Reoxy

0.81

0.84

-0.03

0.016

Alpha

0.72

**0.79

-0.07

0.027

**Control Spot**

Anoxia

0.93

0.95

-0.02

0.015

Reoxy

0.83

**0.90

-0.07

0.034

Alpha

0.70

**0.82

-0.12

0.043

^{a }Estimated threshold minus bootstrap mean.

** Estimated threshold is more than 2 std. deviations from bootstrap mean.

Discussion

The two network-based methods, Maximal Clique-2 and Spectral Clustering, performed very well in terms of bootstrap stability and biological validity. Though Maximal Clique-2 method gave thresholds close to the biological threshold, and always below, the method had slightly higher bootstrap standard deviations. The robustness of the Maximal Clique-2 algorithm could be enhanced by exclusion of smaller cliques in the graph, for example cliques of size 3. Spectral Clustering thresholds were on average closer to biological thresholds, but too often exceeded it. However, if all thresholds for Spectral Clustering were lowered by 0.05, it would have been clearly the best method. Further fine-tuning of the parameters in the algorithm (size of sliding window, different tolerance levels for cluster formation) may improve the method's validity. In a recent paper, Almendral and Díaz-Guilera

The results from this study complement the work of Zhang and Horvath

The GO similarity measure of biological validity we have used, however, is by no means perfect and is just one way of quantifying biological information. Khatri and Draghici

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

BRB wrote code for the analyses, summarized results, and drafted the paper. All authors were involved in study design, and read and approved the final manuscript.

Acknowledgements

This research has been supported in part by the National Institutes of Health under grants P01DA015027-01, R01HD052472-02, R01MH074460-01, U01AA013512 and U01AA013641-04 and by the UT-ORNL Science Alliance. Dr. E.J. Chesler was supported by NIAAA Integrative Neuroscience Initiative on Alcoholism under grants U01AA13499 and U24AA13513. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. Additional support was provided by the University of Tennessee Genome Science and Technology program. John Eblen, Andy Perkins, Gary Rogers and Yun Zhang helped with basic issues of algorithm synthesis. Drs. Bing Zhang and Roumyana Yordanova provided valuable comments on certain aspects of this study.