Abstract
Background
Gene coexpression networks are often constructed by computing some measure of similarity between expression levels of gene transcripts and subsequently applying a highpass filter to remove all but the most likely biologicallysignificant relationships. The selection of this expression threshold necessarily has a significant effect on any conclusions derived from the resulting network. Many approaches have been taken to choose an appropriate threshold, among them computing levels of statistical significance, accepting only the top one percent of relationships, and selecting an arbitrary expression cutoff.
Results
We apply spectral graph theory methods to develop a systematic method for threshold selection. Eigenvalues and eigenvectors are computed for a transformation of the adjacency matrix of the network constructed at various threshold values. From these, we use a basic spectral clustering method to examine the set of genegene relationships and select a threshold dependent upon the community structure of the data. This approach is applied to two wellstudied microarray data sets from Homo sapiens and Saccharomyces cerevisiae.
Conclusion
This method presents a systematic, databased alternative to using more artificial cutoff values and results in a more conservative approach to threshold selection than some other popular techniques such as retaining only statisticallysignificant relationships or setting a cutoff to include a percentage of the highest correlations.
Background
The construction of gene coexpression networks is often a necessary step in a bioinformatic analysis of microarray gene expression data. Studies have shown that genes showing a similar pattern of expression, those sharing edges in a coexpression network, tend to have similar function [1]. This principle, often referred to as "guiltbyassociation" is the idea that motivates many microarray studies. With new highthroughput sequencing technologies currently being used for digital gene expression applications, gene coexpression networks promise to continue to find wide utility in genomewide association studies and other computational analyses.
These networks are constructed by computing some similarity value between gene transcripts based upon their expression values over a set of samples. Nodes in the network represent transcripts while edges are weighted by these similarity values. A threshold is often applied to the resulting networks to retain only the most biologically significant relationships. This threshold application step is a major juncture in which errors can be introduced in the form of both false negatives and false positives. By setting this threshold too high, important relationships can be lost. Likewise, we must be sure to remove connections that do not represent "real" relationships. This task is difficult since the range of thresholds representing real biological relationships that also avoid overfiltering can be narrow.
Some of the many methods that have been applied to the threshold selection problem in various types of networks are using an arbitrary threshold [2], retaining only the top x percent of the strongest relationships [3], permutation testing [4], and filtering based upon control spot correlations [5] or the statistical significance of the relationships [57]. The method presented here makes use of initial spectral graph theorybased clusterings to help identify an appropriate threshold. Combinatorial methods such as those described in [5] will be used to analyze the final gene coexpression network, and such methods often require significant computational resources. We can justify the expense of this initial clustering by the computational resources saved by picking a suitable threshold in advance, especially one that removes most nonbiologicallyrelevant relationships, which will significantly decrease computational requirements. We know that spectral graph theory methods can give us important information on the structure of a graph, such as the number of connected components, information about random walks in the graph, and a bound on the graph diameter [8]. Various spectral methods have also been employed to identify clusters of related vertices [912]. It is these spectral clustering methods that we believe can contribute toward selecting a biologicallyrelevant threshold in coexpression networks. A more detailed initial analysis is presented in [13].
Results and discussion
Spectral properties and algebraic connectivity
We introduce a method for threshold selection based upon the spectrum of the graph at varying thresholds. That is, the eigenvalues and eigenvectors of a transform of the graph's adjacency matrix. We applied this method to yeast cell cycle data [14] and human expression values collected over many different tissue types [15]. It has been shown that the number of connected components of a network can be identified using the spectrum of the network [8]. Ding et al. observed that "nearlydisconnected" portions can also be identified by examination of the eigenvector associated with the smallest nonzero eigenvalue of the network, often called the Fiedler vector [10]. The ability to find the nearlydisconnected pieces allows us to identify those nodes sharing a wellconnected, or dense, cluster.
For this study, we analyzed the spectrum of the largest connected component in networks constructed at increasingly stringent thresholds. Figure 1, which illustrates the number of vertices belonging to the largest connected component, shows that the largest component often contains a majority of the network nodes. The exception occurs at high thresholds where the network becomes very sparse. It can be shown that the multiplicity of the zero eigenvalue is equal to the number of connected components in the graph.
Figure 1. Connected components. The percentage of network nodes contained within the largest connected component for both yeast and human data sets.
Therefore, when analyzing only the spectrum of the largest component, the smallest eigenvalue will be equal to zero while the remaining eigenvalues will be nonzero. We will use the common notation of calling this smallest nonzero eigenvalue the algebraic connectivity of the component and refer to it as λ_{1}. Figure 2 shows the algebraic connectivity for the two data sets studied. Lower connectivity values indicate the presence of nearlydisconnected components [10]. λ_{1 }reaches a minimum in yeast at t = 0.82, though it remains at a relatively low level (less than 0.05) from t = 0.75 to t = 0.87. For human, the connectivity values are much more variable, though a minimum is obtained at 0.85.
Figure 2. Algebraic connectivity. Algebraic connectivity measured at various thresholds for coexpression networks on both yeast and human data sets. Very high connectivity values falling at the extreme upper thresholds were omitted to keep the scale of the chart from overwhelming the value of other observations.
Spectral clustering
Many spectral clustering methods exist, with possibly the simplest being a spectral bipartitioning of the network such as that described in [16]. In that case, the eigenvector associated with λ_{1}, which we will refer to as v_{1}, is sorted and nodes are partitioned into two groups based upon the magnitude of their associated eigenvector value. In [10], the authors showed that sorting the eigenvector associated with λ_{1 }in ascending order often produces a step functionlike plot. They also showed that the steps in such a plot delineate transitions from one nearlydisconnected component to another. Since each eigenvector value is associated with a node in the network, individual nodes can be assigned to a cluster based upon the steps in the eigenvector values. This method allows a finer partitioning than the spectral bipartitioning methods and precludes the need for recursive application of the partitioning method. This particular spectral clustering method is particularly amenable to the threshold selection problem due to its ability to identify clusters of various sizes and because it is not necessary to specify the number of partitions desired.
A sliding window method, illustrated in Figure 3, was used to identify transitions from one cluster to another. Since these transitions are often not immediate, but occur over the span of several eigenvector values, a simple comparison of adjacent positions is not sufficient. Therefore, we compute the difference of eigenvector values some constant distance apart. Here we used a window size of five positions, which was observed to correctly identify most steps in the eigenvector plot.
Figure 3. Sliding window. An example of a sliding window comparison to detect transitions between wellconnected components.
Employing the principle of guiltbyassociation, we know that weaker relationships should connect functionally dissimilar portions of the network. Therefore as the threshold is increased, these portions will become less connected to one another, resulting in a likely increase in "nearlydisconnected" components. We select the threshold value that maximizes the number of these components and thus minimizes the number of edges connecting these pieces. Figure 4 shows the number of clusters identified at various thresholds for both data sets. Based upon the number of clusters, the spectral graph theorybased method identified potential threshold values of 0.78 in the coexpression network on yeast data and 0.83 in the human network.
Figure 4. Number of clusters. The number of clusters identified by the spectral method in yeast and human coexpression networks at various thresholds.
We can see in the previous figure that the number of clusters identified by the spectral method subsequently decreases as we proceed past the selected threshold. This is likely due to a decreasing network size overall, as well as individual clusters falling below the minimum cluster size. Similarly, the algebraic connectivity shown in Figure 2 shows an associated increase at the upper end of the threshold range due to the very small size of the largest component at these thresholds. For example, at the t = 0.98 threshold in yeast data (not shown), the network consists of only two nodes, with a single edge connecting them for a 100% edge density.
Figure 5 shows the steplike structures found for two thresholds in yeast data. At the t = 0.78 threshold identified by the spectral method, as discussed above, the steps are not as clearly delineated as at the t = 0.84 threshold, also shown. While the step functions are more defined at the higher threshold, the number of nodes remaining in the network has greatly decreased and the remaining clusters have become too small to surpass our minimum cluster size requirement.
Figure 5. Sorted eigenvectors. Eigenvector values associated with the smallest nonzero eigenvalue in yeast coexpression networks at thresholds of (a) 0.78 and (b) 0.84. For each vertex, the associated eigenvector value is plotted.
Combinatorial analysis
Paracliques [17] were computed for coexpression networks generated at the selected thresholds for both the human and yeast data sets. The Paraclique algorithm, based upon solving the complete clique problem [18], is often more appropriate for microarray data than using the basic clique method. Due to the noise inherent in such data, a small number of edge weights can drop just below the network threshold. Paraclique corrects such a situation by allowing vertices to be added to the paraclique if they are adjacent to at least g of the original clique members. For most of our analyses (except the comparison with known clusters of coexpressed yeast genes described below), we set g = 1. The Paraclique algorithm performs this adjustment while still retaining the benefits of clique such as being an unsupervised method, identifying only the densest subgraphs, and possessing a natural resistance to false positives.
In the yeast coexpression network at the t = 0.78 threshold chosen by the spectral method, 93 paracliques were found with the largest containing 21 gene transcripts. At the t = 0.55 threshold identified by choosing the top one percent of correlations, we found 636 paracliques with a maximum size of 93. The human network produced many more and larger paracliques, with 497 paracliques and the largest one containing 78 transcripts at the more conservative threshold of 0.83. The human network constructed over all tissues and replicates at the lower threshold of 0.65 contained 2, 843, 536 edges, and the Paraclique run extended for almost 2.9 hours on an Intel Pentium 4 EM64T 3.4 GHz processor. This graph contained 1283 paracliques, with the largest having 324 members.
Comparison with other results
Traditional methods
We examined the difference between the networks generated at thresholds selected by the spectral method, retaining only the strongest one percent of relationships, and filtering by statistical significance at the p <0.05 and p <0.01 levels. The statistical significance results assume all data points are present for every pair of transcripts, which may not be the case. Table 1 shows results from each one of these methods, with the "Adj. p <0.05" and "Adj. p <0.01" columns containing significance values after adjustment for multiple tests. For both data sets, the eigenvectorbased method selected a higher threshold than the other methods. While this shows that the spectral method excludes relationships that would otherwise be considered statistically significant, available computational resources and tractability of the problems involved often indicate the need to reduce the network size. For example, the network from human data at a threshold of 0.22, which would correspond to p <0.05 if all replicate tissues were averaged, has 106, 629,395 edges on 22, 283 vertices. Also, while correlations at this low magnitude would be categorized as significant, they are still rather weak and should be excluded to further reduce the false positive rate.
Table 1. Threshold values. Threshold values computed by various methods for yeast and human coexpression networks.
Table 2 lists the number of vertices and edges contained in both the yeast and human graphs at the threshold generated by the statistical significance method at both the p <0.05 and p <0.01 levels adjusted for multiple tests, as well as the spectral approach and the method of retaining the top one percent of correlations. Since the set of edges at a higher threshold is a subset of the edges at each of the lower thresholds, it is easy to see the number of relationships filtered out by each subsequent increase in threshold value. While the yeast data set is small and does not pose a significant challenge to the computation resources available, we can see that graphs constructed on the human data set become very large as the threshold is decreased. We will see that the problem becomes difficult to solve on a single processor even at the threshold selected by the relatively conservative method of retaining the top one percent of relationships.
Table 2. Vertex and edge counts. The numbers of vertices and edges in graphs constructed at thresholds identified by various methods.
To correct for multiple tests, we apply the method described in [5]. For example, the α = 0.05 significance level was divided by the number of transcripts on the array. The normal quantile function was used, followed by an inverse Fisher's z' transformation to determine the associated correlation value. This adjustment for multiple comparisons increases the standard pvalue slightly, though the significance level threshold is still very low. Such large sample sizes (n = 82, yeast; n = 158, human) tend to translate into low correlation values required for significance, even with adjustment.
Previous studies
Other spectral techniques have also previously found utility in addressing in the network threshold problem. Nearest neighbor eigenvalue spacing was used in [19] to employ random matrix theory methods for threshold selection. Here, the authors analyzed the eigenvalues of the network by examining the distribution of spacings between successive eigenvalues and determined the point at which this spacing distribution transitioned from Poisson to Gaussian Orthogonal Ensemble (GOE). [19] also studied the yeast dataset described in [14] and found that the transition began at t = 0.62 and was complete by t = 0.77. For this yeast data set, the identification of the t = 0.77 point corresponds approximately to our result of t = 0.78.
Much information is provided about the coexpression of yeast genes over the cell cycle in [14]. We compared Paraclique results with seven of the clusters of genes identified by the authors that had similar expression levels over the cell cycle. Paracliques were enumerated at the 0.78 threshold identified by the spectral method, with additional vertices being added to the paraclique if they were adjacent to at least three of the original clique members. A preliminary examination of the Paraclique results uncovered several paracliques containing portions of these clusters of genes known to be coexpressed over the cell cycle, according to [14]. A summary of the results is given in Table 3. Note that all of these comparisons were performed on an abbreviated set of cluster genes present in the heatmaps in the [14] manuscript. Single paracliques were found to contain genes from both the histone and CLB2 clusters, with 8/9 histone genes accounted for and 10/36 CLB2 genes. Paracliques also contained CLN2 and Y' cluster genes (25/58 and 19/27, respectively) though genes from each of these two clusters spanned two distinct paracliques. None of the genes from the MET, MCM, or SIC1 clusters were found. The conservative threshold value of 0.78 along with the stringency of the paraclique algorithm likely precluded the appearance of the MET, MCM, and SIC1 cluster genes. The method here was able to identify several known cell cycleregulated genes, particularly those [14] identified as forming the "tightest cluster", the histones. A more comprehensive set of comparisons utilizing all of the coexpressed genes identified in [14] as well as other sources of known coexpressed cycle cycle genes will be necessary to draw any significant conclusions. In [20], the author examined the spectral threshold selection method along with other approaches in a bootstrap analysis on three yeast data sets. The study found that the spectral threshold method produced thresholds of 0.93, 0.97, and 0.89 on yeast anoxia and reoxygenation [21] and yeast alphafactor arrest [14] data sets, respectively. Networks constructed at these thresholds contained maximum cliques of sizes 73, 17, and 15.
Table 3. Comparison with known coexpressed yeast genes
Functional comparisons
Due to the nature of the data set analyzed, genes existing in dense regions of the human coexpression network will be those that show the same pattern of expression over many tissue types, though not necessarily over or underexpressed in a single tissue type. Similarity in many samples is likely required to drive correlations to a significant level. Similarly, genes identified to be in paracliques in the yeast data set are those that vary together throughout the cell cycle. We used the GO Slim Mapper at the Saccharomyces Genome Database (SGD) [22] and Ingenuity Pathways Analysis (Ingenuity Systems, http://www.ingenuity.com webcite) to analyze some resulting paracliques in yeast and human, respectively.
In the yeast networks, we examined the biological process gene ontology category for the three largest paracliques and identified categories for which more than three genes appeared. At the t = 0.78 threshold, these paracliques were of size 21, 17, and 15. For the largest paraclique, nine of the 21 genes had unknown molecular function; 7, hydrolase activity category; 6, helicase activity; 3, RNA binding. The second paraclique showed categories of DNA binding, enzyme regulator activity, and hydrolase activity. All genes in the third appeared in the structural molecule activity category, and five in RNA binding. The three largest paracliques at the lower threshold of t = 0.55 identified by the top one percent of correlations method were of size 93, 53, and 37, respectively. Many more of these genes were found to have unknown molecular function (40, 13, and 17). The first also contained genes related to hydrolase activity, RNA binding, helicase activity, transferase activity, and nucleotidyltransferase activity. Those with more than three members in the second paraclique were transferase activity, DNA binding, hydrolase, enzyme regulator activity, protein binding, and protein kinase activity. Protein binding, hydrolase, and RNA binding were identified in the third paraclique.
For human paraclique results, we also examined the three largest paracliques at the thresholds identified by the spectral method and the "top 1%" method. At the t = 0.83 threshold, the first paraclique matched five networks containing more than three of the paraclique members. These included networks related to cellular organization, gene expression, genetic disorder, drug metabolism, and cell signaling, for example. The second paraclique matched only three networks, all related to protein synthesis. Similarly, the third network aligned with only two networks with a match of more than one gene. Both of these were related to reproductive systems development and disease, respectively, among other functions. The t = 0.65 threshold produced a maximum paraclique size of 324 which matched 14 networks with more than three genes in common, with the most enriched being related to posttranscriptional modification. The second largest paraclique matched 13 networks ranging from cellular assembly and organization, genetic disorder, to inflammatory disease, and many others. Similarly, the third paraclique matched nine networks, mostly having some relation to cancer, though some were annotated with cellular development, posttranslational modification, and developmental/genetic disorder, for example.
For yeast results, while paracliques computed at the higher threshold of t = 0.78 are understandably smaller, fewer genes are unidentified based upon their biological process. In one case, all of the genes in a paraclique fell into the same category. In paracliques on networks constructed at both the high and low thresholds, genes belonged to a wide variety of biological processes, and largely the same categories appeared within the three largest paracliques in both groups. IPA results on the three largest human paracliques shows that lower thresholds result in a larger number of networks matching the paraclique transcripts. These networks seem to be annotated with a larger range of functions compared to the relatively few networks identified at the higher thresholds. In this sense, it is possible that the higher threshold values produce paracliques that are more specific to a particular network or function, allowing us to examine the results at a finer granularity. Of course, analyzing only the three largest paracliques does not give enough information to draw definitive conclusions, and it is likely that some of the actual biological networks involved or genes belonging to these networks will have been lost by using a more stringent threshold.
Threshold effect on coexpression networks
With respect to an individual paraclique, an increase in correlation threshold can have have at least two effects, and possibly both of these: a decrease in the number of genes contained within a paraclique, or the splitting of a paraclique into two or more disjoint paracliques. These new disjoint pieces may contain additional genes that were not present in the original paraclique due to the smaller number of genes with which a new gene would have to share a connection. Both of these cases are possibly desirable when a large paraclique encompasses genes participating in a variety of biological functions. If that large paraclique is split into multiple disjoint pieces of highly connected genes, or genes connected to the paraclique at a lower correlation level are excluded, only the core set of genes putatively involved in a more focused set of biological functions or pathways remain.
We decided to identify occurrences of each of these cases in the human data set due to the availability of the rich annotation information for human results available within IPA. Using the combinatorial methods described above can become intractable at the very low threshold values corresponding to large numbers of vertices and edges identified by the statistical significance methods. Therefore, we performed a pairwise comparison between paracliques computed at the two highest threshold values selected by all of the methods studied. The degree of overlap between each paracliques in the graph constructed by choosing the highest one percent of correlations (0.65) and each of those identified at the higher spectral threshold (0.83) was found. This allowed us to determine which paracliques at the higher threshold possibly correspond to those in the graph at the lower threshold. Note that due to the nature of the Paraclique algorithm, there is not necessarily a onetoone correspondence between every paraclique in the first set with one or more paracliques in the second set.
Table 4 illustrates the case in which the number of transcripts in a paraclique was decreased when moving from a lower to a higher threshold value. A paraclique containing 24 transcripts computed at the higher threshold was found to be completely contained within a paraclique of size 47 at the lower threshold. While the paraclique at the higher threshold was much smaller, both sets of transcripts mapped to approximately the same set of IPA focus molecules, and therefore matched similar network functions. However, due to the reduced input size, the associated pvalues for enrichment in many of the biological functions were reduced (not shown). After mapping transcript IDs to gene symbols, it appeared that the increase in threshold excluded mostly a few uncharacterized genes from the paraclique.
Table 4. IPA networks from two paracliques
There are cases in which annotations for a large paraclique can be convoluted and hard to interpret. Figure 6 shows a paraclique containing 51 transcripts (representing 36 genes) at the 0.65 threshold. This paraclique shares all of the 17 transcripts (12 genes) present in a paraclique at the higher 0.83 threshold. While the larger set of genes matched five topscoring network functions in IPA, the smaller set matched only a single network related to cellular development, hematological disease, and cell morphology which was also the topscoring network at the 0.65 level. The ability to analyze these gene sets at finer levels of granularity greatly increases the confidence with which we can interpret the results.
Figure 6. Paraclique containment. The large green paraclique of 51 transcripts, A, was computed at the 0.65 threshold. Paraclique B, identified from the graph at a 0.83 threshold, contained 17 transcripts. After converting to gene symbols, A and B had an intersection of size 12 genes. The remaining genes from paraclique A may be present in other, smaller paracliques found at the 0.83 level. IPA showed that paraclique A matched several possible networks, while the smaller paraclique B matched only a single network associated with cellular development, hematological disease, and cell morphology.
The large paraclique at the left in Figure 7 was identified at the 0.65 threshold and was found to be "split" into two main components at the 0.83 threshold. While both of these components contain mostly ribosomal protein transcripts, 1170 edges were lost between the two groups by raising the threshold. Since the Paraclique method at the selected stringency requires that all but one connection exist between all paraclique members, the larger paraclique was decomposed into the two main components with a high proportion of genes overlapping with the original paraclique (54/54 and 35/42, respectively) on the right of the figure as well as several smaller pieces with an overlap of between 1 and 13 transcripts. The average correlation of the remaining edges within the two smaller paracliques was around 0.90.
Figure 7. Paraclique decomposition. The large paraclique on the left, identified at the 0.65 one percent threshold, contained 154 gene transcripts. 150 of these were contained in the core clique C1. Sixteen paracliques were found at the 0.83 spectral threshold with an intersection with this large paraclique of at least one transcript. The largest of these, labeled C and D, were of size 54 and 42, respectively. All 54 genes in paraclique C were contained in the paraclique on the left, while the intersection with D was of size 35. Edges may exist between members of the different paracliques, but are not shown for readability. Dotted lines indicate that not all connections are present between a gene and the core clique, since the paraclique requires all but one possible connection between a gene and the core maximal clique.
Conclusion
We have presented a systematic threshold selection method that makes use of spectral graph theory techniques. We have shown that in the selected data sets this method results in a more conservative approach to threshold selection than both the test of statistical significance at p <0.01 and including only the highestweighted 1% of edges, in terms of the number of relationships retained for further analysis. We believe that the primary strength of the spectral graph theorybased method presented here is that it is a systematic method for threshold selection. Both the statistical significance method and the percentage cutoff method can be adjusted to produce graphs that prove to be tractable in a combinatorial analysis and contain fewer false positives, but the need for an arbitrary cutoff value is still present in these methods. The spectral approach attempts to move beyond the need for employing these arbitrary thresholds and computes a cutoff value based upon the underlying community structure of the data rather than merely sample size or the relative distribution of correlation values. We have also shown that for the yeast cell cycle data studied, this method produces results in agreement with a previous study making use of methods from random matrix theory. Functional comparisons between networks constructed at the threshold selected by the spectral method and the method of choosing the top one percent of correlations show that the networks built at the lower threshold are often time consuming to analyze and in the yeast data set, many of the paraclique members fall into the unknown biological process category while other genes span several other GO categories. At the higher threshold, fewer of these genes fail to be categorized based upon the gene ontology. For human data, fewer networks were identified as being enriched in the paracliques, making interpretation of the results easier.
Future work may include adapting more advanced spectral clustering methods such as the kway partitioning methods described in [9,11,12] for use in threshold selection. We also plan to investigate the use of the metric of modularity [23], which serves as a quantitative measure of the proportion of intracluster edges, as a guide for determining an optimal threshold. Both of these features can be incorporated into a future graphical user interfacebased software package that can be applied to general microarray data sets to perform a spectral analysis for determining an appropriate threshold.
Methods
Microarray data sets
We studied the publiclyavailable Homo sapiens and Saccharomyces cerevisiae microarray data sets described in [15] and [14], respectively. The former contains expression values from a panel of seventynine different tissue types in human measured on Affymetrix HG_U133A gene expression microarrays at the Genomics Institute of the Novartis Research Foundation (GNF). Data was downloaded from the NCBI Gene Expression Omnibus website http://www.ncbi.nlm.nih.gov/geo/ webcite as raw CEL files and subsequently preprocessed and normalized using the R statistical software package version 2.6.1 [24] and the justRMA() function of the affy version 1.12.2 [25] Bioconductor [26] package. The latter contains expression from baker's yeast samples collected over a time period to measure changes during the cell cycle and was downloaded from the author's webpage in tab delimited format.
Network construction and representation
We constructed a gene coexpression network at increasingly stringent thresholds by beginning with a complete graph with vertices representing gene transcripts. The Pearson productmoment correlation coefficient was computed between each pair of transcripts with at least 10 data observations in common and used to weight the appropriate network edge. A highpass filter was subsequently applied to the absolute value of each edge weight, removing those edges with an absolute weight less than some threshold t. As t proceeded from 0.70 to 0.95, a coexpression network was constructed at each threshold value. Traditional nonspectral methods were used to identify connected components within the network and extract the largest for spectral analysis. The resulting unweighted graph G = (V, E) can be represented by its adjacency matrix, given by
We define a transform of the adjacency matrix, the Laplacian of the graph G, as in [8] by
where deg(i) denotes the degree of vertex i. The benefit of the Laplacian matrix is that both adjacency and degree information is readily available.
Eigenvalue and eigenvector computation
We aim to solve the eigenvalue problem on the Laplacian matrix defined above. Using notation similar to [10], this involves solving the system of equations
resulting in the eigenvalues
and associated eigenvectors
where n is the number of nodes in the component being analyzed.
The linear algebra software package MATLAB version R2008b (The Mathworks, Inc., http://www.mathworks.com webcite) was used to compute approximations to selected eigenvalues and eigenvectors of the filtered correlation network. Using the sparse matrix operations native to MATLAB and the eigs() function, the two smallest eigenvalues and their associated eigenvectors were computed. The eigenvector associated with the secondsmallest eigenvalue λ_{1 }was extracted and sorted in increasing order.
Cluster detection
The detection of "gaps" in the ordered set of eigenvector values was performed using a sliding window technique. The sliding window compares two eigenvector values windowsize positions apart, where windowsize was chosen to be five for this study. When these two values are significantly different, then the beginning of a new cluster is indicated. In this case, we define a significant difference to be greater than m + , where m is the median of all differences in positions windowsize apart and s is the standard deviation of this set of values. To prevent the many small partitions that often occur at extremely high thresholds from overwhelming the results, identified partitions less than some minimum size, in this case 10 nodes, were discarded.
Paraclique extraction
The graph theoretical algorithm Paraclique, developed by Michael A. Langston's team at the University of Tennessee and described in [17], was employed to extract dense sets of genes from resulting coexpression networks. Paraclique begins by finding a clique, or completely connected subgraph, of maximum size in the network. The maximum clique is augmented with genes connected to all but g of the clique members, with g = 1 in this case. This dense subgraph is removed from the network and the process repeats until no new paracliques larger than some minimum size can be found. For comparisons with known yeast coexpression networks, we set g = 3, which was found to incorporate more of the known coexpressed genes without significantly increasing the number of other genes present. We required the base maximum clique size to consist of at least five members for the comparison with previous yeast coexpression studies and three for all other human and yeast results.
Functional comparisons
Functional comparisons were performed using the Saccharomyces Genome Database GO Slim Viewer http://www.yeastgenome.org webcite and Ingenuity Pathways Analysis software (Ingenuity Systems, http://www.ingenuity.com webcite) for yeast and human networks, respectively.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
ADP and MAL designed the project. ADP performed the experiments, analyzed results, and drafted the paper. MAL assisted with revisions. Both authors reviewed and approved the final manuscript.
Acknowledgements
Thanks to Bhavesh Borate for helpful discussions on threshold selection in gene coexpression networks. ADP was supported by a new faculty startup package from the Department of Computer Science and Engineering, Bagley College of Engineering, and the Office of Research and Economic Development at Mississippi State University.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 11, 2009: Proceedings of the Sixth Annual MCBIOS Conference. Transformational Bioinformatics: Delivering Value from Genomes. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/10?issue=S11.
References

Wolfe CJ, Kohane IS, Butte AJ: Systematic survey revals general applicability of "guiltbyassociation" within gene coexpression networks.
BMC Bioinformatics 2005., 6(79) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Freeman TC, Goldovsky L, Brosch M, van Dongen S, Mazière P, Grocock RJ, Freilich S, Thornton J, Enright AJ: Construction, visualization, and clustering of transcription networks from microarray expression data.
PLoS Computational Biology 2007, 3(10):e206. Publisher Full Text

Ala U, Piro RM, Grassi E, Damasco C, Silengo L, Oti M, Provero P, Cunto FD: Prediction of human disease genes by humanmouse conserved coexpression analysis.
PLoS Computational Biology 2008, 4(3):e1000043. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS: Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks.
Proceedings of the National Academy of Sciences of the United States of America 2000, 97(22):1218212186. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Voy BH, Scharff JA, Perkins AD, Saxton AM, Borate B, Chesler EJ, Branstetter LK, Langston MA: Extracting gene networks for lowdose radiation using graph theoretical algorithms.
PLoS Computational Biology 2006, 2(7):e89. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P: Coexpression analysis of human genes across many microarray data sets.
Genome Res 2004, 14:10851094. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Moriyama M, Hoshida Y, Otsuka M, Nishimura S, Kato N, Goto T, Taniguchi H, Shiratori Y, Seki N, Omata M: Relevance network between chemosensitivity and transcriptome in human hepatoma cells.
Molecular Cancer Therapeutics 2003, 2:199205. PubMed Abstract  Publisher Full Text

Chung FRK: Spectral Graph Theory.
Regional Conference Series in Mathematics, Providence: American Mathematical Society 1994., 92

Alpert CJ, Kahng AB, Yao SZ: Spectral partitioning with multiple eigenvectors.
Discrete Applied Mathematics 1999, 90(1–3):326. Publisher Full Text

Ding CHQ, He X, Zha H: A spectral method to separate disconnected and nearlydisconnected web graph components.
Proceedings of the Seventh ACM International Conference on Knowledge Discovery and Data Mining: 26–29 August 2001; San Francisco 2001.

Ng AY, Jordan MI, Weiss Y: On spectral clustering: analysis and an algorithm.
Advances in Neural and Information Processing Systems: 3–8 December 2001; Vancouver 2001.

Ruan J, Zhang W: Identifying network communities with a high resolution.

Perkins AD: Addressing challenges in a graphbased analysis of highthroughput biological data. PhD thesis. University of Tennessee, Department of Electrical Engineering and Computer Science; 2008.

Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycleregulated genes of the yeast Saccaromyces cerevisiae by microarray hybridization.
Molecular Biology of the Cell 1998, 9(12):32733297. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB: A gene atlas of the mouse and human proteinencoding transcriptomes.
Proceedings of the National Academy of Sciences of the United States of America 2004, 101(16):60626067. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Shi J, Malik J: Normalized cuts and image segmentation.
IEEE Transactions on Pattern Analysis and Machine Intelligence 2000, 22(8):888905. Publisher Full Text

Chesler EJ, Langston MA: Combinatorial genetic regulatory network analysis tools for high throughput transcriptomic data.
RECOMB Satellite Workshop on Systems Biology and Regulatory Genomics: 2–4 December 2005; San Diego 2005.

Garey MR, Johnson DS: Computers and Intractability: A Guide to the Theory of NPCompleteness. New York: W. H. Freeman; 1979.

Luo F, Yang Y, Zhong J, Gao H, Khan L, Thompson DK, Zhou J: Constructing gene coexpression networks and predicting functions of unknown genes by random matrix theory.
BMC Bioinformatics 2007, 8:299. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Borate B: Comparative Analysis of Thresholding Algorithms for Microarrayderived Gene Correlation Matrices. In Master's thesis. The University of Tennessee; 2008.

Lai LC, Kosorukoff AL, Burke PV, Kwast KE: Metabolicstatedependent remodeling of the transcriptome in response to anoxia and subsequent reoxygenation in Saccharomyces cerevisiae.
Eukaryotic Cell 2006, 5(9):14681489. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

SGD Project: Saccharomyces Genome Database. [http://www.yeastgenome.org] webcite

Newman MEJ, Girvan M: Finding and evaluating comunity struture in networks.

R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2008.

Gautier L, Cope L, Bolstad BM, Irizarry RA: affy – analysis of Affymetrix GeneChip data at the probe level.
Bioinformatics 2004, 20(3):307315. PubMed Abstract  Publisher Full Text

Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics.
Genome Biology 2004, 5:R80. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text