Open Access Open Badges Methodology article

Maximizing capture of gene co-expression relationships through pre-clustering of input expression samples: an Arabidopsis case study

F Alex Feltus12*, Stephen P Ficklin2, Scott M Gibson3 and Melissa C Smith3

Author Affiliations

1 Department of Genetics & Biochemistry, Clemson University, 105 Collings Street, Clemson, SC 29634, USA

2 Plant and Environmental Sciences, Clemson University, 105 Collings Street, Clemson, SC 29634, USA

3 The Holcombe Department of Electrical and Computer Engineering, Clemson, SC 29634, USA

For all author emails, please log on.

BMC Systems Biology 2013, 7:44  doi:10.1186/1752-0509-7-44

The electronic version of this article is the complete one and can be found online at:

Received:18 October 2012
Accepted:14 May 2013
Published:5 June 2013

© 2013 Feltus et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.



In genomics, highly relevant gene interaction (co-expression) networks have been constructed by finding significant pair-wise correlations between genes in expression datasets. These networks are then mined to elucidate biological function at the polygenic level. In some cases networks may be constructed from input samples that measure gene expression under a variety of different conditions, such as for different genotypes, environments, disease states and tissues. When large sets of samples are obtained from public repositories it is often unmanageable to associate samples into condition-specific groups, and combining samples from various conditions has a negative effect on network size. A fixed significance threshold is often applied also limiting the size of the final network. Therefore, we propose pre-clustering of input expression samples to approximate condition-specific grouping of samples and individual network construction of each group as a means for dynamic significance thresholding. The net effect is increase sensitivity thus maximizing the total co-expression relationships in the final co-expression network compendium.


A total of 86 Arabidopsis thaliana co-expression networks were constructed after k-means partitioning of 7,105 publicly available ATH1 Affymetrix microarray samples. We term each pre-sorted network a Gene Interaction Layer (GIL). Random Matrix Theory (RMT), an un-supervised thresholding method, was used to threshold each of the 86 networks independently, effectively providing a dynamic (non-global) threshold for the network. The overall gene count across all GILs reached 19,588 genes (94.7% measured gene coverage) and 558,022 unique co-expression relationships. In comparison, network construction without pre-sorting of input samples yielded only 3,297 genes (15.9%) and 129,134 relationships. in the global network.


Here we show that pre-clustering of microarray samples helps approximate condition-specific networks and allows for dynamic thresholding using un-supervised methods. Because RMT ensures only highly significant interactions are kept, the GIL compendium consists of 558,022 unique high quality A. thaliana co-expression relationships across almost all of the measurable genes on the ATH1 array. For A. thaliana, these networks represent the largest compendium to date of significant gene co-expression relationships, and are a means to explore complex pathway, polygenic, and pleiotropic relationships for this focal model plant. The networks can be explored at Finally, this method is applicable to any large expression profile collection for any organism and is best suited where a knowledge-independent network construction method is desired.

Gene network; Arabidopsis; Systems biology; Relevance network


Cellular processes underlying expression of complex traits have a major impact on health, agriculture, and our understanding of general biology. The gene co-expression network (GCN) models coordinated gene expression across a series of input data sets such as microarrays [1,2]. In a GCN, nodes represent genes, and edges describe significant gene co-expression relationships. The GCN also exhibits properties common to most naturally occurring networks such as scale-free, small world and hierarchical topology [3,4]. Due to the availability of large quantities of publically available expression data and the relative ease of construction, GCNs have been constructed for a broad array of organisms including human [2,5,6], yeast [7-9], Arabidopsis [10-13], rice [14,15], maize [16], potato [17] and many more. These networks have elucidated gene sets involved in varied biological systems including cell wall biosynthesis [13], mouse weight [18], and complex trait expression [19-22].

A GCN is constructed by performing pair-wise correlation analysis of every gene on the array. Typically, Pearson’s Correlation Coefficient (PCC) is used, and a large n x n matrix of PCC values is obtained where n is the number of measurable genes. An analytical method is then employed to identify the level at which correlation values should be thresholded to yield biologically meaningful co-expression relationships. Often, co-expression networks are built for analysis of differentially connected genes between two or more experimental conditions (environment, disease state, genotype, or tissue type) [23-25]. In these cases networks are constructed separately for each condition and the context of the connectivity can then be examined to identify modules (or gene sets) putatively causal for phenotypic traits. Genes with known causality can be used to guide selection of modules.

In some cases, global co-expression networks are created using a large number of samples from publicly available repositories—the goal being to mine knowledge across the compendium of samples not previously identified through individual experiments. For thousands or tens-of-thousands of samples two challenges occur. First, classification of samples into conditions becomes manually intractable and sample descriptions are often insufficient to automate proper classification. Therefore, conditional context is often ignored, samples are combined and similarity measurements are calculated across all samples. However, this treats all genes and samples (with varying conditions) equally when calculating similarity scores. As the number of samples with different conditions increases in the input dataset, the number of significant co-expression relationships decreases [26]. Therefore, networks built from a set of samples with a large number of conditions tend to be small. Cheng and Church recognized the illogic of weighting all genes and samples the same for similarity calculation and proposed biclustering of expression data [27] to generate biclusters of genes with similar expression in similar contexts (conditions). Many types of biclustering methods have been developed [28]. However, for large disparate sample sets, the difficulty in classifying samples into conditional groups makes biclustering difficult.

A second challenge for network construction is identification of a proper significance threshold. Many methods have been employed for significance thresholding. These include ad hoc methods [1,29-31], permutation testing [5], linear regression [13], rank-based methods [32,33], Fisher’s test of homogeneity [34], spectral graph theory [35], Random Matrix Theory (RMT) [36,37], Partial Correlation and Information Theory (PCIT) [26], methods that use topological properties [38], and supervised machine learning [39,40]. In some cases a constant threshold is applied to the entire network. While significant relationships can be found using a constant threshold, a dynamic threshold is better suited for context-dependent co-expression variability. Dynamic thresholding can increase sensitivity and decrease false-positives. Rank-based methods, PCIT and supervised machine learning methods all employ a form of dynamic thresholding.

To address the challenges of large sample, large conditional network construction, we use a method of segregating input samples into groups before network construction. Without knowledge of sample conditions, we approximate expression conditions by pre-clustering of input gene-expression profiles into groups, and apply dynamic significance thresholding through independent network construction for each sample group. We refer to the network from each group as a Gene Interaction Layer (GIL). The GIL compendium represents an attempt to maximize the capture of all possible interactions across all samples (Figure  1). The GIL compendium also allows for gene and gene-gene interactions to exist in multiple GILs and therefore provides a framework for the analysis of intersecting biological pathways and potentially pleiotropic interactions. To demonstrate the effectiveness of this approach, we generated a GIL collection for the model plant Arabidopsis thaliana (Arabidopsis) by pre-clustering 7,105 input samples. Our results indicate that the Arabidopsis GIL compendium represents a dramatic improvement in capture of gene co-expression relationships.

thumbnailFigure 1. Deconstructing a global network into gene interaction layers (GILs). Gene expression datasets can be used to construct a co-expression network in total (global) or sorted by expression pattern into groups prior to network construction (GILs).


Arabidopsis GILs were constructed by partitioning 7,105 publicly available Affymetrix ATH1 microarray RNA expression samples using a network construction pipeline [41] which implements Random Matrix Theory (RMT) for biological signal thresholding [9]. Samples were RMA [42] normalized prior to pre-clustering. Additionally, a typical “global” network was constructed using all normalized input samples. The global network comprised 3,297 nodes and 129,134 edges (average degree, <k> = 78) representing 15.9% of the known Arabidopsis genes as measured by the ATH1 platform. As mentioned previously, this small network size is expected given the mixture of multiple sample conditions. To improve capture of measurable genes and co-expression relationships, we partitioned the input samples into K groups with similar expression patterns using the K-means clustering algorithm [43]. Each cluster of samples were then RMA normalized again within their respective groups and co-expression networks were built for each cluster.

The effects of K size on the GIL collection

The choice of K size in k-means clustering should have an impact on the topological properties of each GIL. To quantify these differences we performed K-means clustering nine times at K sizes of 30, 60, 70, 80, 90, 120, 150, 180, and 210, yielding 990 total groups. The average array count for each K size ranged from 236.8 (K=30) to 33.8 (K=210) arrays. Changes in the various topological properties including clustering co-efficient, closeness, betweenness, page rank, etc., between different K sizes can be found in Additional file 1: Figure S1. In summary, relative to the global network, the average number of nodes in each group was lower and ranged from 2,090 (K=30) to 1,819 (K=150) (Figure  2). The average number of edges in each group was much lower relative to the global network and declined with increasing K size ranging from 12,265 (K=30) to 6,421 (K=210) (Figure  2).

Additional file 1: Figure S1. Additional topological metrics of the GIL constructed from varied partitioning sizes. Figure S2. Functional enrichment in MCL modules. Figure S3. Module discovery in GIL collections. Figure S4. Keyword assignment and shared node sorting of the K90 GIL collection.

Format: PPT Size: 540KB Download file

This file can be viewed with: Microsoft PowerPoint ViewerOpen Data

thumbnailFigure 2. Basic topological metrics of GILs constructed from varied partitioning sizes. Primary graph shows average edge (dashed line) and average node (solid line) counts for GILs constructed from K-means of cluster size K. Inset graph shows average connectivity (<k>, black circles) for GILs constructed from K-means of cluster size K.

To test if the degree of pre-clustering of samples had a negative effect on network topology, we randomly re-assigned 50% of the samples among the groups and reconstructed the networks. This re-assignment occurred at K sizes of 30, 50, 60, 70, 80, 90, and 120. Sample re-assignment had a noticeable effect on scale-free behavior as exhibited by a decrease fit to the Kronecker scale-free graph model [44] and decrease in average scaling exponent γ (Figure  3). The scaling exponent for the global network was 1.36. Therefore, the randomization of samples between pre-clustered groups seemed to generate networks with properties more similar to the “global” network. However, despite differences in fit to the Kronecker scale-free graph model and to the scaling exponent, all networks exhibited scale-free behavior; therefore each K size generated networks that appear realistic.

thumbnailFigure 3. GILs are scale free at varied partition sizes. Primary graph shows log-likelihood fit to a Kronecker scale-free graph model at k-means cluster sizes K. Clusters were randomized at 0% (solid bars) and 50% (striped bars) prior to network construction. Inset graph shows average scaling exponent (<γ>) for GILs constructed from K-means of cluster size K at 0% (solid line) and 50% (dashed line) randomization.

Identification of the best K size

To identify the best K size, we examined the network from each K group for total gene capture, functional annotation enrichment, and module count. First, we examined total node capture among the GILs from each K group. While the global network captured 15.9% of possible genes measured by the array platform, the total gene space captured across all networks of a K group increased with increasing K size (Figure  4). At K=180, for example, 99.6% of all measurable genes were captured at least once. Second, we modularized the networks using MCL [45] and tested for function term enrichment (Bonferroni p < 0.001) for each module relative to whole genome background using Gene Ontology (GO) terms [46], KEGG pathways [47], Interpro protein domains [48], Plant Ontology (PO) terms [49], and AraCyc pathways [50]. In general, the number of total enriched terms tended to increase along with K size (Additional file 1: Figure S2). However, the number of unique enriched terms tended to plateau at K=90. Third, we investigated modular behavior by examining the average number of modules for each K. For this test we used a second module detection algorithm called link-community detection [51]. The link-community method allows for nodes to be present in more than one module which is more representative of shared genes found in intersecting pathways. The average number of link-community modules (LCMs) per GIL decreased as K size increased while the number of MCL modules rose slightly (Additional file 1: Figure S3). Therefore, to balance a maximum number of nodes, edges, functional enrichment, and high module count representing both coverage (MCL) and multi-functionality (LCM), K=90 was chosen as the K-means clustering size for the Arabidopsis GIL collection.

thumbnailFigure 4. Co-expression interaction capture. Primary graph shows total unique node (solid line) and edge (dashed line) accumulation in all GIL collections at various cluster sizes, K. Dotted line at 23,244 nodes is the maximum number of genes measured by ATH1 array. Inset graph shows the success rate of possible GILs constructed from cluster size, K.

Summary of the K 90 GIL collection

In summary, the K90 GIL collection consisted of 86 GILs and contained a total of 19,588 genes (94.7% measured gene coverage) and 558,022 unique co-expression relationships. Counting the experiments contained in the K90 GIL collection revealed that each GIL was comprised of multiple NCBI GEO experiments (μ = 78.9 arrays; σ=53.7; Additional file 2: Table S1) derived from a blend of GEO series (μ=12.3; σ=10.7), indicating that GILs were not simply the product of segregating datasets into the original experiment groups. After module detection using the Markov Clustering (MCL; [45]) and link-community [51] methods, we circumscribed 38,234 MCL modules, encapsulating 94.7% of measurable genes and 26,570 link-community modules capturing 71.0% of genes, each with a median occurrence of six genes per module (min = 2 and max = 707) (Table  1). Four GILs (43, 46, 52, 70) in the K90 collection did not construct due to high levels of correlation between genes. For MCL modules, we identified 867 enriched function terms (657 unique) within the global network while the K90 GIL collection yielded 37,614 enriched function terms (4,789 unique) — a 7.2-fold improvement in representation of unique functional terms in the GILs.

Additional file 2: Table S1. GEO Experiment Assignments to K90 GILs.

Format: XLSX Size: 161KB Download fileOpen Data

Table 1. K90 GIL Collection Overview

To test the amount of unique function captured by k-means presorting, we compared the K=90 GIL set to networks constructed by randomizing the input samples at 50% and 100%, leaving cluster number and sizes intact. The total number of unique nodes (gene capture) decreased from 19,588 (at 0% randomization) to 13,568 (at 100% randomization), yet the number of unique edges increased from 558,023 (at 0%) to 705,399 (at 100%). Interestingly, this higher level of connectivity between smaller numbers of genes resulted in more MCL modules ranging from 39,931 (at 0%) to 47,623 (at 100%) as well as more total enriched function terms ranging from 37,614 (at 0%) to 64,236 (at 100%) (Additional file 1: Figure S2). However, many of these enriched functions were re-captured terms since the number of unique enriched terms was reduced from 4,789 (at 0%) to 3,518 (at 100%). Thus, K-means pre-sorting of datasets prior to network construction appears to have a positive effect on capture of unique biological function.

Exploring GIL function

Gene expression samples in the K90 GIL collection were not segregated by specific experimental conditions, but rather segregated using knowledge-independent pre-clustering. Therefore, annotation of the experimental conditions of samples may help describe the biological context in each GIL. Experimental descriptions from the samples in GEO confer limited information but common keywords from these descriptions can provide valuable insight. Therefore, we attempted to annotate each GIL for biological context by using highly represented keywords from the GEO experiment descriptions. We expected two GILs would be more similar in function if they shared genes between modules. Therefore, GILs were ordered using average linked hierarchical clustering with shared node count between GIL LCMs as the similarity metric (Additional file 1: Figure S4). This procedure resulted in groups of GILs that showed non-random similarity in assigned keywords (p < 0.05) for keywords “leaf”, “root”, ”seed”, “seedling”, “iron”, and “mutant” (significant keywords are indicated as red boxes in Additional file 1: Figure S4). However, overlap in keywords is evident between GILs indicating that GILs are not solely comprised of samples from a single experimental condition.

To determine if the GIL keyword assignment was biologically relevant, GILs were tested for an increase in relevant, enriched functional terms (Columns in Additional file 1: Figure S4). First, GILs were distributed into keyword and non-keyword groups for five different keywords (Figure  5). For example, 10 GILs with the “auxin” keyword were placed in an auxin group while the remaining 76 GILs were placed in a non-auxin group. Analogous groups were constructed for “leaf”, “light/dark”, “root” and “stress”. The number of specific enriched words from functional terms (e.g. *auxin*, *photo*, *stress*, *ribosome*; * = wildcard) found to be functionally enriched in LCMs were counted in both keyword and non-keyword groups. The ratio of these enriched terms between keyword and non-keyword groups can be found in Figure  5. For the auxin group, there was a 6.78 fold increase of enriched functional terms that contained “auxin”. In addition, we performed functional term enrichment on all genes in a GIL as final indicator of GIL context (Additional file 3: Table S2).

thumbnailFigure 5. Functional Term Enrichment Counting in Keyword Sorted GIL Sets. The occurrence of keyword sub-patterns (auxin, photo, stress, ribosome) in LCM module enriched term descriptions was counted in GILs sorted by presence of high frequency keywords. Bars show the ratio of keywords within the GIL to GILs without the keyword. GIL identifier is shown in italics.

Additional file 3: Table S2. Functional Enrichment Results for Global, K90 GIL Networks.

Format: XLSX Size: 171KB Download fileOpen Data


The partitioning method we describe dramatically increases the capture of significant co-expression relationships by segregating expression profiles into similar groups prior to correlation calculation. In effect, this approach is a dynamic significance thresholding strategy where a local hard threshold is determined for each GIL separately using the correlation matrix of each GIL. It is different from other dynamic thresholding methods which perform thresholding using the global correlation matrix. With pre-clustering of samples, correlation calculations occur locally for each GIL, allowing for existing threshold detection methods, such as RMT, to be used in a dynamic way. Naturally, there will be some false positives that surpass any thresholding method, but the lack of knowledge of true positive co-expression relationships make sensitivity and specificity difficult to quantify.

A second benefit for pre-clustering is reduction of complexity that results from mixing of samples from various conditions. It has been shown that as the number of conditions in a sample set increase, the distribution curve of correlation changes such that most correlations are centered around zero [26]. Thus, there are fewer high correlations and fewer significant co-expression relationships, and hence smaller networks as more conditions are present. Pre-sorting of samples groups them by similar overall expression levels, reduces the complexity of the dataset and mimics separation of samples by condition.

A third benefit to un-supervised pre-sorting of samples is that limited human understanding does not constrain the network. For example, grouping a set of samples into a control group (condition #1) and a disease state group (condition #2), may overlook changes in gene expression due to differences in genotype, tissue type and environmental conditions between the various individuals in the study. Even for carefully designed studies, confounding effects due to unrecognized or immeasurable conditions may hide subtle expression relationships. By pre-sorting without bias towards specific conditions, the underlying expression levels of each sample dictate membership in a group. The resulting GIL can therefore be multi-conditional which is more realistic than the idea that co-expression within a GIL is specific to a single condition. As mentioned previously, we do not see a single condition present for a GIL, although we did see some evidence that some GILs may be more representative of certain biological contexts.

Fourth, the GIL collection provides a novel framework for exploration of gene co-expression. We do not combine the co-expression relationships from all GILs into a single global network--GILs remain as separate entities. The modules within a single GIL provide a unique set of relationships for a unique biological function. A module in one GIL with a similar set of genes as a module in another GIL may have different co-expression relationships. This redundancy provides a realistic framework where genes and pathways are represented in different condition contexts. While the exact biological context of each GIL is unknown, differential connectivity between similar modules can indicate how co-expression changes between genes of interest. It is clear from our attempts to annotate the GILs that more research is required to assign conditions to the GILs and place genes and gene interactions into experimental contexts.

A disadvantage to K-means clustering for pre-sorting of input samples is that samples can only belong to a single GIL. It seems realistic to believe that the gene expression in one sample could share similarity with samples from several GILs. Biclustering draws on this idea that gene co-expression is a product of overlapping genes and samples. Also, Ruan et. al. constructed a sample co-expression network where samples were nodes and edges existed when samples shared similar expression patterns [52]. They show that module detection of a “sample network” yields modules with samples that share a similar biological context (e.g. lymphoma cell types). It seems logical to therefore use a link-community algorithm for detecting sample modules in such a sample network, from which we could therefore construct GILs where samples can be members of multiple GILs. Additionally, we used K-means clustering to segregate the expression datasets, but other clustering approaches, such as that proposed by Ruan et. al, or other commonly used clustering techniques, could also be effective in pre-sorting of samples.

Finally, is the A. thaliana interactome represented by the K90 GIL compendium comprehensive? While we detected 558,022 unique gene interactions, there were 1,584,378 total interactions, as well as gene modules that intersect through shared nodes between GILs. While an accurate protein coding gene count for A. thaliana has yet to be determined, a recent tally identified 27,416 loci and 35,386 transcript variants (TAIR10 build). The ATH1 microarray platform interrogates 20,677 of these known genes albeit with little transcript variant discriminatory power. The K90 GIL collection captured 19,588 loci in 558,022 temporal gene interactions. Extrapolating to include genes not measured by the array, we estimate the unique Arabidopsis co-expression interactome to be a minimum of 589,045 binary interactions. While this number is certain to increase with higher resolution RNA expression measurements (e.g. deep RNAseq sampling) obtained under additional experimental conditions, it does provide a baseline model of the RNA co-expression interactome.

An interactome estimate on a 105 scale is not unprecedented for A. thaliana. A recent A. thaliana study estimated the baseline protein interaction space to be 299,000 ± 79,000 (μ ± SD) without accounting for protein isoforms. This estimate was based on approximately 2% (2,661 peptides) of the predicted interactome [53]. While not statistically significant, 2.1% (237/11,374 edges) of the PPI interactions overlapped with the K90 GIL collection. It is intriguing that a PPI network predicted to represent 2% of the interactome overlaps with 2% of interactome modeled by the GIL collection. We hypothesize that the capture of interactome space represented in the K90 GIL collection approaches the maximal interaction space of the measured genes across all the input samples.


Our results indicate that pre-clustering transcriptome measurements by expression similarity prior to co-expression network construction captures more significant co-expression relationships than networks constructed without pre-clustering. In summary, we believe that this approach is simple, reduced-bias, and practical to reduce noise in large expression dataset collections. The enhanced resolution afforded by the GIL collection provides a more holistic platform for improved identification of gene modules that can be interrogated for novel biochemical pathways, used to assign new genes to known pathways, predict gene sets causal for important complex traits, and indicate pleiotropic relationships within and between modules derived from specific GILs in the GIL collection. The A. thaliana K90 GIL compendium described here is therefore the most comprehensive and we believe the most natural model of the gene co-expression interactome for a plant species. The method should be applicable to any organism where a large number of gene expression profiling datasets have been generated.


Partitioning expression datasets by K-means clustering

Expression measurements used for network construction were obtained from publicly available Affymetrix Arabidopsis ATH1 Genome Array experiments available in NCBI GEO (platform GPL198). A total of 7,158 samples were obtained in November 2011. RMA normalization [42] was performed for all samples together using the command-line utility of RMAExpress ( webcite). Sample outlier detection was performed using the arrayQualityMetrics [54] tool for Bioconductor [55]. Samples that failed two of the three outlier tests were removed from the dataset. The normalized expression file, an n × m matrix of expression values with 22,746 rows of probe sets (no control) and 7,105 columns of samples (Additional file 2: Table S1), was clustered by similarity of expression values using K-means clustering via the kmeans function of R. K-means clustering was repeated for different values of K at 30, 60, 70, 80, 90, 120, 150, 180, and 210.

Co-expression network construction

Each cluster of samples derived from the K-means clustering was used to construct individual networks—one for each cluster. Additionally, a “global” network was constructed consisting of all available samples. The network construction process was performed on each dataset cluster which included: 1) RMA normalization of the clustered datasets and not all possible arrays; 2) removal of control probes; 3) outlier sample removal; 4) removal of ambiguous probe sets; 6) calculation of a similarity matrix of pair-wise Pearson correlation values between all probe sets; 7) use of Random Matrix Modeling (RMT) [9] to identify a significant threshold to cull the similarity matrix; 8) conversion of the thresholded ATH1 probeset similarity matrix to an Arabidopsis gene network; and 9) module detection. Details specific to each step are further described.

The software package RMAExpress ( webcite) was used to normalize each cluster individually. The samples of each cluster were provided as input, and the Chip Description File (CDF) was obtained from the Affymetrix website. A file containing an n x m expression matrix of normalized expression values, where n is the number of probe sets and m is the number of samples, was generated for each cluster. The rows of the expression matrix were then reduced in size through the removal of control probes using an in-house Perl script. Outlier samples within the cluster itself were identified using the arrayQualityMetrics [54] tool for Bioconductor [55]. The columns of the expression matrix were then reduced in size through removal of the outlier samples using an in-house Perl script. The rows of the expression matrix were then reduced again to remove any ambiguous probesets. Ambiguous probe sets are those that potentially hybridize to multiple genes in the Arabidopsis genome. Ambiguous probe sets were identified using megablast from the NCBI BLAST [56] package to align a FASTA file of probe sequences obtained from the Affymetrix website with the Arabidopsis coding sequences (CDS) obtained from the TAIR website (Lamesch et. al. 2012). Parameters for megablast set the word size to 25 (-W 25, the length of the probe sequence), and disabled low-complexity filtering (-F F). An in-house Perl script counted the probe to gene mappings and identified ambiguous probe sets. Pearson correlation coefficients were calculated using optimized C-code and an m x m similarity matrix was produced, where m is the number of remaining probe sets. Random Matrix Theory, which employs threshold detection using the nearest neighbor spacing distribution of eigenvalues, was used to identify an appropriate threshold for filtering the similarity matrix [9]. Optimized C-code, produced in-house, was used for RMT thresholding. An adjacency matrix is effectively produced by ignoring values (or setting values to zero) in the similarity matrix that are less than the threshold. Using an in-house Perl script, a network file was constructed, and edges in the network correspond to probe sets in the adjacency matrix that have a non-zero correlation value. Using mappings of probe set to TAIR gene models derived earlier with megablast, the probe set-based network was converted to a gene-based network. Edge lists for all K90 GILs and the Global network can be found in Additional file 4 and online at Finally, modules, or sets of closely linked genes, were identified in each of the networks using the link community method [51] which circumscribes edges into communities using the ‘linkcomm’ binary version 1.0-4 in R [57] or MCL modules using the ‘mcl’ binary version 11-294 ( webcite). Gene assignment to modules can be found in Additional file 5: Table S3.

Additional file 4. Coexpression networks. From-To Edge Lists for Global, K90 GILs with Correlation Values.

Format: ZIP Size: 6.6MB Download fileOpen Data

Additional file 5: Table S3. Gene Assignments to Global, K90 GIL Modules.

Format: XLSX Size: 5.6MB Download fileOpen Data

Functional enrichment analysis

Functional enrichment was performed for each of the modules from each GIL. Functional terms used for enrichment include Gene Ontology (GO) [46], InterPro [58], AraCyC [50], Plant Ontology (PO) [49], and KEGG [47] terms. Mappings of GO, PO, AraCyC, and InterPro terms to genes were obtained from the TAIR website (Lamesch et. al. 2012). KEGG terms were obtained by uploading Arabidopsis coding sequences (CDS) to the KEGG/KAAS server which maps KEGG terms using a homology-based method [59]. Functional enrichment was then performed using an in-house Perl script that functions similar to the online DAVID tool [60,61]. The tool uses a Fisher’s Exact test to look for significant over-representation of terms within a module versus the TAIR10 genes represented on the ATH1 platform as background. Module functional enrichment results (Bonferroni adjusted p < 0.001) can be found in Additional file 6: Table S4. Network functional enrichment results (Bonferroni adjusted p < 0.001) can be found in Additional file 3: Table S2.

Additional file 6: Table S4. Functional Enrichment Results for Global, K90 GIL Modules.

Format: XLSX Size: 6.6MB Download fileOpen Data

Network topology analysis

Network topology parameters were determined using the Netstat, Centrality, and Kronfit ‘example applications’ implementing the Stanford Network Analysis Project (SNAP) library Ver. 2011-12-31 ( webcite). The scaling exponent (γ) was estimated using the ‘’ function in the igraph R package version 0.5.4 ( webcite) under 64-bit R version 2.10.1.

Semantic analysis

First, we counted all words in each GEO experiment description. After filtering for uninformative words from a custom dictionary, we listed the top ten words found in each GIL collection. These words were then manually inspected for RNA source (e.g. “anther” “leaf”), treatment (e.g. “light” ”dark”), “stress”, and “mutant”. These frequent keywords were then assigned to the respective GIL (Additional file 1: Figure S4; Additional file 7: Table S5).

Additional file 7: Table S5. Filtered Keywords from GEO Experiment Descriptions Used in GIL Annotation.

Format: XLSX Size: 1.1MB Download fileOpen Data

Competing interests

The authors declare no competing interests.


This work was supported by the following grant from the National Science Foundation: MCB-0820345 to F.A.F.


  1. Aoki K, Ogata Y, Shibata D: Approaches for extracting practical information from gene co-expression networks in plant biology.

    Plant Cell Physiol 2007, 48(3):381-390. PubMed Abstract | Publisher Full Text OpenURL

  2. Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS: Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks.

    Proc Natl Acad Sci U S A 2000, 97(22):12182-12186. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Barabasi AL, Ravasz E, Vicsek T: Deterministic scale-free networks.

    Physica A 2001, 299:559-564. Publisher Full Text OpenURL

  4. Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization.

    Nat Rev Genet 2004, 5(2):101-113. PubMed Abstract | Publisher Full Text OpenURL

  5. Carter SL, Brechbuhler CM, Griffin M, Bond AT: Gene co-expression network topology provides a framework for molecular characterization of cellular state.

    Bioinformatics 2004, 20(14):2242-2250. PubMed Abstract | Publisher Full Text OpenURL

  6. Aggarwal A, Guo DL, Hoshida Y, Yuen ST, Chu KM, So S, Boussioutas A, Chen X, Bowtell D, Aburatani H, Leung SY, Tan P: Topological and functional discovery in a gene coexpression meta-network of gastric cancer.

    Cancer Res 2006, 66(1):232-241. PubMed Abstract | Publisher Full Text OpenURL

  7. Conant GC, Wolfe KH: Functional partitioning of yeast co-expression networks after genome duplication.

    PLoS Biol 2006, 4(4):e109. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  8. Bhan A, Galas DJ, Dewey TG: A duplication growth model of gene expression networks.

    Bioinformatics 2002, 18(11):1486-1493. PubMed Abstract | Publisher Full Text OpenURL

  9. Carlson MR, Zhang B, Fang Z, Mischel PS, Horvath S, Nelson SF: Gene connectivity, function, and sequence conservation: predictions from modular yeast co-expression networks.

    BMC Genomics 2006, 3:7. OpenURL

  10. Ma S, Gong Q, Bohnert HJ: An Arabidopsis gene network based on the graphical Gaussian model.

    Genome Res 2007, 17(11):1614-1625. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  11. Mao L, Van Hemert J, Dash S, Dickerson J: Arabidopsis gene co-expression network and its functional modules.

    BMC Bioinformatics 2009, 10(1):346. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  12. Loraine A: Co-expression analysis of metabolic pathways in plants.

    Methods Mol Biol 2009, 553:247-264. PubMed Abstract | Publisher Full Text OpenURL

  13. Persson S, Wei H, Milne J, Page GP, Somerville CR: Identification of genes required for cellulose synthesis by regression analysis of public microarray data sets.

    Proc Natl Acad Sci U S A 2005, 102(24):8633-8638. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Ficklin SP, Luo F, Feltus FA: The Association of Multiple Interacting Genes with Specific Phenotypes In Rice (Oryza sativa) Using Gene Co-Expression Networks.

    Plant Physiology 2010, 154(1):13-24. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. Lee TH, Kim YK, Pham TT, Song SI, Kim JK, Kang KY, An G, Jung KH, Galbraith DW, Kim M, Yoon UH, Nahm BH: RiceArrayNet: a database for correlating gene expression from transcriptome profiling, and its application to the analysis of coexpressed genes in rice.

    Plant Physiol 2009, 151(1):16-33. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  16. Ficklin SP, Feltus FA: Gene coexpression network alignment and conservation of gene modules between two grass species: maize and rice.

    Plant Physiology 2011, 156(3):1244-1256. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Massa AN, Childs KL, Lin H, Bryan GJ, Giuliano G, Buell CR: The transcriptome of the reference potato genome Solanum tuberosum Group Phureja clone DM1-3 516R44.

    Plos One 2011, 6(10):e26801. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. Ghazalpour A, Doss S, Zhang B, Wang S, Plaisier C, Castellanos R, Brozell A, Schadt EE, Drake TA, Lusis AJ, Horvath S: Integrating genetic and network analysis to characterize genes related to mouse weight.

    PLoS Genet 2006, 2(8):e130. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  19. Plaisier CL, Horvath S, Huertas-Vazquez A, Cruz-Bautista I, Herrera MF, Tusie-Luna T, Aguilar-Salinas C, Pajukanta P: A systems genetics approach implicates USF1, FADS3, and other causal candidate genes for familial combined hyperlipidemia.

    PLoS Genet 2009, 5(9):e1000642. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  20. Lee I, Ambaru B, Thakkar P, Marcotte EM, Rhee SY: Rational association of genes with traits using a genome-scale gene network for Arabidopsis thaliana.

    Nature Biotechnology 2010, 28(2):149-U114. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  21. Mutwil M, Usadel B, Schutte M, Loraine A, Ebenhoh O, Persson S: Assembly of an Interactive Correlation Network for the Arabidopsis Genome Using a Novel Heuristic Clustering Algorithm.

    Plant Physiology 2010, 152(1):29-43. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  22. Chan EK, Rowe HC, Corwin JA, Joseph B, Kliebenstein DJ: Combining genome-wide association mapping and transcriptional networks to identify novel genes controlling glucosinolates in Arabidopsis thaliana.

    PLoS Biol 2011, 9(8):e1001125. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  23. Oldham MC, Horvath S, Geschwind DH: Conservation and evolution of gene coexpression networks in human and chimpanzee brains.

    Proc Natl Acad Sci U S A 2006, 103(47):17973-17978. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  24. Fukushima A, Nishizawa T, Hayakumo M, Hikosaka S, Saito K, Goto E, Kusano M: Exploring Tomato Gene Functions Based on Coexpression Modules Using Graph Clustering and Differential Coexpression Approaches.

    Plant Physiology 2012, 158(4):1487-1502. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  25. Torkamani A, Dean B, Schork NJ, Thomas EA: Coexpression network analysis of neural tissue reveals perturbations in developmental processes in schizophrenia.

    Genome Research 2010, 20(4):403-412. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  26. Reverter A, Chan EK: Combining partial correlation and an information theory approach to the reversed engineering of gene co-expression networks.

    Bioinformatics 2008, 24(21):2491-2497. PubMed Abstract | Publisher Full Text OpenURL

  27. Cheng Y, Church GM: Biclustering of expression data.

    Proc Int Conf Intell Syst Mol Biol 2000, 8:93-103. PubMed Abstract OpenURL

  28. Prelic A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E: A systematic comparison and evaluation of biclustering methods for gene expression data.

    Bioinformatics 2006, 22(9):1122-1129. PubMed Abstract | Publisher Full Text OpenURL

  29. Tsaparas P, Marino-Ramirez L, Bodenreider O, Koonin EV, Jordan IK: Global similarity and local divergence in human and mouse gene co-expression networks.

    BMC Evol Biol 2006, 6:70. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  30. Jordan IK, Marino-Ramirez L, Wolf YI, Koonin EV: Conservation and coevolution in the scale-free human gene coexpression network.

    Mol Biol Evol 2004, 21(11):2058-2070. PubMed Abstract | Publisher Full Text OpenURL

  31. Reverter A, Ingham A, Lehnert SA, Tan SH, Wang Y, Ratnakumar A, Dalrymple BP: Simultaneous identification of differential gene expression and connectivity in inflammation, adipogenesis and cancer.

    Bioinformatics 2006, 22(19):2396-2404. PubMed Abstract | Publisher Full Text OpenURL

  32. Stuart J, Segal E, Koller D, Kim S: A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules.

    Science 2003, 302(5643):249-255. PubMed Abstract | Publisher Full Text OpenURL

  33. Wolfe CJ, Kohane IS, Butte AJ: Systematic survey reveals general applicability of “guilt-by-association” within gene coexpression networks.

    BMC Bioinformatics 2005, 6:227. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  34. Nayak RR, Kearns M, Spielman RS, Cheung VG: Coexpression network based on natural variation in human gene expression reveals gene interactions and functions.

    Genome Res 2009, 19(11):1953-1962. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  35. Perkins AD, Langston MA: Threshold selection in gene co-expression networks using spectral graph theory techniques.

    BMC Bioinformatics 2009, 10(Suppl 11):S4. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  36. Luo F, Yang Y, Zhong J, Gao H, Khan L, Thompson DK, Zhou J: Constructing gene co-expression networks and predicting functions of unknown genes by random matrix theory.

    BMC Bioinformatics 2007, 8:299. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  37. Jalan S, Solymosi N, Vattay G, Li B: Random matrix analysis of localization properties of gene coexpression network.

    Phys Rev E Stat Nonlin Soft Matter Phys 2010, 81(4 Pt 2):046118. PubMed Abstract OpenURL

  38. Elo LL, Jarvenpaa H, Oresic M, Lahesmaa R, Aittokallio T: Systematic construction of gene coexpression networks with applications to human T helper cell differentiation process.

    Bioinformatics 2007, 23(16):2096-2103. PubMed Abstract | Publisher Full Text OpenURL

  39. Puelma T, Gutierrez RA, Soto A: Discriminative local subspaces in gene expression data for effective gene function prediction.

    Bioinformatics 2012, 28(17):2256-2264. PubMed Abstract | Publisher Full Text OpenURL

  40. Bassel GW, Glaab E, Marquez J, Holdsworth MJ, Bacardit J: Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on Large-Scale Data Sets.

    Plant Cell 2011, 23(9):3101-3116. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  41. Gibson SM, Ficklin SP, Isaacson S, Feltus FA, Smith MC: Massive-Scale Gene Co-expression Network Construction and Robustness Testing using Random Matrix Theory.

    PLoS ONE 2013.



  42. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data.

    Biostatistics 2003, 4(2):249-264. PubMed Abstract | Publisher Full Text OpenURL

  43. Hartigan JA, Wong MA: Algorithm AS 136: A K-Means Clustering Algorithm.

    Journal of the Royal Statistical Society Series C (Applied Statistics) 1979, 28(1):100-108. OpenURL

  44. Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C, Ghahramani Z: Kronecker Graphs: An Approach to Modeling Networks.

    Journal of Machine Learning Research 2010, 11:985-1042. OpenURL

  45. Hwang W, Cho YR, Zhang A, Ramanathan M: A novel functional module detection algorithm for protein-protein interaction networks.

    Algorithms Mol Biol 2006, 1:24. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  46. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology.

    The Gene Ontology Consortium. Nature Genetics 2000, 25(1):25-29. OpenURL

  47. Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Y: KEGG for linking genomes to life and the environment.

    Nucleic Acids Res 2008, 36(Database issue):D480-484. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  48. Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, Finn RD, Gough J, Haft D, Hulo N, Kahn D, Kelly E, Laugraud A, Letunic I, Lonsdale D, Lopez R, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Mulder N, Natale D, Orengo C, Quinn AF: InterPro: the integrative protein signature database.

    Nucleic Acids Res 2009, 37(Database issue):D211-215. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  49. Avraham S, Tung CW, Ilic K, Jaiswal P, Kellogg EA, McCouch S, Pujar A, Reiser L, Rhee SY, Sachs MM, Schaeffer M, Stein L, Stevens P, Vincent L, Zapata F, Ware D: The Plant Ontology Database: a community resource for plant structure and developmental stages controlled vocabulary and annotations.

    Nucleic Acids Res 2008, 36(Database issue):D449-454. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  50. Mueller LA, Zhang P, Rhee SY: AraCyc: a biochemical pathway database for Arabidopsis.

    Plant Physiol 2003, 132(2):453-460. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  51. Ahn YY, Bagrow JP, Lehmann S: Link communities reveal multiscale complexity in networks.

    Nature 2010, 466(7307):761-764. PubMed Abstract | Publisher Full Text OpenURL

  52. Ruan J, Dean AK, Zhang W: A general co-expression network-based approach to gene expression analysis: comparison and applications.

    BMC Syst Biol 2010, 4:8. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  53. AIMC: Evidence for network evolution in an Arabidopsis interactome map.

    Science 2011, 333(6042):601-607. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  54. Kauffmann A, Gentleman R, Huber W: arrayQualityMetrics–a bioconductor package for quality assessment of microarray data.

    Bioinformatics 2009, 25(3):415-416. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  55. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics.

    Genome Biol 2004, 5(10):R80. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  56. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

    Nucleic Acids Res 1997, 25(17):3389-3402. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  57. Kalinka AT, Tomancak P: linkcomm: an R package for the generation, visualization, and analysis of link communities in networks of arbitrary size and type.

    Bioinformatics 2011, 27(14):2011-2012. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  58. Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, Durbin R, Falquet L, Fleischmann W, Gouzy J, Hermjakob H, Hulo N, Jonassen I, Kahn D, Kanapin A, Karavidopoulou Y, Lopez R, Marx B, Mulder NJ, Oinn TM, Pagni M, Servant F, Sigrist CJ, Zdobnov EM: The InterPro database, an integrated documentation resource for protein families, domains and functional sites.

    Nucleic Acids Res 2001, 29(1):37-40. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  59. Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M: KAAS: an automatic genome annotation and pathway reconstruction server.

    Nucleic Acids Res 2007, 35(Web Server issue):W182-185. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  60. Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery.

    Genome Biol 2003, 4(5):P3. PubMed Abstract | BioMed Central Full Text OpenURL

  61. da Huang W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.

    Nat Protoc 2009, 4(1):44-57. PubMed Abstract | Publisher Full Text OpenURL