Abstract
Background
The models in this article generalize current models for both correlation networks and multigraph networks. Correlation networks are widely applied in genomics research. In contrast to general networks, it is straightforward to test the statistical significance of an edge in a correlation network. It is also easy to decompose the underlying correlation matrix and generate informative network statistics such as the module eigenvector. However, correlation networks only capture the connections between numeric variables. An open question is whether one can find suitable decompositions of the similarity measures employed in constructing general networks. Multigraph networks are attractive because they support likelihood based inference. Unfortunately, it is unclear how to adjust current statistical methods to detect the clusters inherent in many data sets.
Results
Here we present an intuitive and parsimonious parametrization of a general similarity measure such as a network adjacency matrix. The cluster and propensity based approximation (CPBA) of a network not only generalizes correlation network methods but also multigraph methods. In particular, it gives rise to a novel and more realistic multigraph model that accounts for clustering and provides likelihood based tests for assessing the significance of an edge after controlling for clustering. We present a novel MajorizationMinimization (MM) algorithm for estimating the parameters of the CPBA. To illustrate the practical utility of the CPBA of a network, we apply it to gene expression data and to a bipartite network model for diseases and disease genes from the Online Mendelian Inheritance in Man (OMIM).
Conclusions
The CPBA of a network is theoretically appealing since a) it generalizes correlation and multigraph network methods, b) it improves likelihood based significance tests for edge counts, c) it directly models higherorder relationships between clusters, and d) it suggests novel clustering algorithms. The CPBA of a network is implemented in Fortran 95 and bundled in the freely available R package PropClust.
Keywords:
Network decomposition; Modelbased clustering; MM algorithm; Propensity; Network conformityBackground
The research of this article was originally motivated by two types of network models: correlation networks and multigraphs. After reviewing these special network models, we describe how structural insights gained from them can be used to tackle research questions arising in the study of general networks specified by network adjacencies and more generally to unsupervised learning scenarios modeled by similarity measures.
Background: adjacency matrix and multigraphs
Networks are used to describe the pairwise relationships between n nodes (or vertices). For example, we use networks to describe the functional relationships between n genes. We consider networks that are fully specified by an n × n adjacency matrix A = (A_{ij}), whose entry A_{ij} quantifies the connection strength from node i to node j. For an unweighted network, A_{ij} equals 1 or 0, depending on whether a connection (or link or edge) exists from node i to node j.
For a weighted network, A_{ij} equals a real number between 0 and 1 specifying the connection strength from node i to node j. For an undirected network, the connection strength A_{ij} from i to j equals the connection strength A_{ji} from j to i. In other words, the adjacency matrix A is symmetric. For a directed network, the adjacency matrix is typically not symmetric. Unless we explicitly mention otherwise, we will deal with undirected networks. In this paper the diagonal entries A_{ii} of the adjacency matrix A have no special meaning. We arbitrarily set them equal to 1 (the maximum adjacency value); other authors set them equal to 0 [1].
In an (unweighted) multigraph, the adjacencies A_{ij} = n_{ij} are integers specifying the number of edges between two nodes. A general similarity matrix (whose entries are nonnegative real numbers possibly outside [0,1]) can be interpreted as a weighted multigraph. In each of the network types, the connectivities
are important statistics pertinent to finding highly connected hubs. In an unweighted network (a graph), k_{i} is the degree of node i.
Background: correlation and coexpression networks
Network methods are frequently used to analyze experiments recording levels of transcribed messenger RNA. The gene expression profiles collected across samples can be highly correlated and form modules (clusters) corresponding to protein complexes, organelles, cell types, and so forth [24]. It is natural to describe these pairwise correlations in network language. The intense interest in coexpression networks has elicited a number of new models and statistical methods for data analysis [3,511], with recent applied research focusing on differential network analysis and regulatory dysfunction [12,13].
A correlation network is a network whose adjacency matrix A = (A_{ij}) is constructed from the correlations between quantitative measurements summarized in an m × n data matrix X = (x_{ij}). The m rows of X correspond to sample measurements (subjects), and the n columns of X correspond to network nodes (genes). The jth column x_{j} of X serves as a node profile across the m samples. A correlation network adjacency matrix is constructed from the pairwise correlations Corr(x_{i},x_{j}) in either of two ways. An unweighted gene coexpression network is defined by thresholding the absolute values of the correlation matrix. A weighted adjacency matrix is a continuous transformation of the correlation matrix. For reasons explained in [5,14], it is advantageous to define the adjacency A_{ij} between two genes i and j as a power β ≥ 1 of the absolute value of their correlation coefficient; thus,
Weighted gene coexpression networks have found many important medical applications, including identifying brain cancer genes [14], characterizing obesity genes [15,16], understanding atherosclerosis [17], and locating the differences between human and chimpanzee brains [9]. One of the important steps of weighted correlation network analysis is to find network modules, usually via hierarchical clustering. Each module (cluster) is then represented by the module eigengene defined by a certain singular value decomposition (SVD). Suppose Y denotes the expression data of a single module (cluster) after the appropriate columns of X have been extracted and standardized to have mean 0 and variance 1. The SVD of Y is the decomposition Y=UDV^{t}, where the columns of U and V are orthogonal, D is a diagonal matrix with nonnegative diagonal entries (singular values) presented in descending order, and the superscript t indicates a matrix or vector transpose. The sign of the dominant singular vector u_{1} (the first column of U) is fixed by requiring a positive average correlation with the columns of Y; u_{1} is referred to as the module eigenvector or eigengene. One can show that u_{1} is an eigenvector of the m×m sample correlation matrix corresponding to the largest eigenvalue. The eigenvector u_{1} explains the maximum amount of variation in the columns of Y.
Let d_{i} be the ith singular value of Y. The eigenvector factorizability
measures how well a network factors [18]. This measure is very similar to the proportion of variation explained, . One can prove [18] that when EF(u_{1}) ≈ 1, the correlation matrix Y approximately factors as
In coexpression networks, modules are often approximately factorizable [18,19]. For a network comprised of multiple modules, it should come as no surprise that when the eigenvector factorizabilities of all modules are close to 1, the correlation network factors as
where is the module eigenvector of the module containing i, measures the intramodular connectivity (module membership) of node i with respect to its module, and measures the similarity between clusters c_{i} and c_{j}. The quantity
is called the module membership measure or conformity [18,19].
Unlike general networks, correlation networks allow assessment of the statistical significance of an edge (via a correlation test) and generate informative network statistics such as the module eigenvector. But correlation network methods can only be applied to model the correlations between numeric variables. An open question is whether correlation network methods can be generalized to general networks by defining a suitable decomposition of a general network similarity measure. In the following, we will address this question.
Results and discussion
CPBA is a sparse approximation of a similarity measure
Consider a general n×n symmetric adjacency matrix A, for example one generated by a multigraph. Because the diagonal entries of A are irrelevant, A is determined by its upperdiagonal entries. We now describe a lowrank matrix approximation to A based on partitioning the n nodes into K clusters labeled 1,…,K. Motivated by (Eq. 2), our approximation of a general similarity relies on three main ingredients. The first is a cluster assignment indicator c = (c_{i}) whose entry c_{i} equals a when node i belongs to cluster a. The cluster label a = 0 is special and is reserved for singleton nodes outside any of the “proper” clusters. The clusters are required to be nonempty except for the improper cluster 0. The second ingredient is a K×Kcluster similarity matrix R = (r_{ab}) whose entries quantify the relationships between clusters. The third and final ingredient is the propensity vector p = (p_{i}) whose components quantify the tendency (propensity) of the various nodes to form edges. The goal of cluster and propensity based approximation (CPBA) is to construct an approximation to A by optimally choosing the cluster assignment indicator c, the cluster similarity matrix R, and the propensity vector p. CPBA assumes that the adjacency matrix A_{ij} can be approximated as
The righthand side with parameters can be interpreted as a sparse parametrization of the lefthand side with parameters. In a weighted correlation network, the propensity p_{i} of node i is approximately kME_{i}^{β}. The cluster similarity r_{ab}, defined by the correlation between eigengenes, is an intuitive measure of the interactions between modules. The diagonal entries r_{aa} of R are identically 1.
Objective functions for estimating CPBA
In practice, CPBA parameters c, p, and R of a general similarity are unknown and must be estimated by optimizing a suitably defined objective function. In this article, we describe estimation methods that are based on optimizing two superficially different objective functions. Our first objective is just the squared Frobenius matrix norm
Our second objective is the Poisson loglikelihood
Our later multigraph example interprets Poisson(c,p,R) in this traditional sense. The functional form of the Poisson loglikelihood even applies when the A_{ij} are noninteger. The factorial A_{ij}!, which is irrelevant to maximization in any case, can then be defined via the gamma function. In practice maximization of the Poisson loglikelihood and minimization of the Frobenius norm yield very similar numerical updates.
In the Methods section, we describe a powerful MM algorithm for optimizing the objective functions and estimating its parameters. We now pause and briefly describe a few major applications. First, the sparse parametrization can be used to derive relationships between network statistics; our previous research highlights this possibility [18,19]. For example, the connectivity index (Eq. 1) can be approximated by
Second, since our optimization algorithms also strive to choose the best cluster assignment indicator c, they naturally give rise to clustering algorithms. Cluster reassignment is carried out node by node in a sequential fashion. For the sake of computationally efficiency, all parameters are fixed until node reassignment has stabilized. If parameters are updated as each node is visited, then the computational overhead seriously hinders analysis of networks with ten thousand nodes. Our limited experience suggests that more frequent reestimation of parameters is less likely to end with an inferior optimal configuration. Hence, the tradeoff is complex.
Other major uses depend on the underlying model. In the Frobenius setting, the model can be used to generalize conformitybased decomposition of a network as shown in Example 2. In the Poisson loglikelihood setting, our model suggest a new clustering procedure. In contrast to other clustering procedures, the CPBA models provide a means of relating clusters to each other via the cluster similarities r_{ab}. Furthermore, likelihood based objective functions permit statistical tests for assessing the significance of an edge. For example, in the multigraph model, the significance of the number of connections between two nodes can be ascertained by comparing the observed number of connections to the expected number of connections under the Poisson model. Finally, likelihood based objective functions provide a rational basis for estimating the number of clusters in a data set.
In the following three examples, we illustrate how to generalize a variety of network models to include clustering.
Example 1: Generalizing the random multigraph model
We recently explored a random multigraph model [20] that allows multiple edges to form between two nodes and edges to form with different probabilities. Edges still form independently. Under the random multigraph model, each node i is assigned a propensity p_{i}. The random number of edges between nodes i and j is then assumed to follow a Poisson distribution with mean p_{i}p_{j}. This model relies entirely on propensities and ignores cluster similarities. We will refer to it as the Pure Propensity Poisson Model (PPP) to avoid confusion with CPBA. Thus, the PPP loglikelihood is
where A_{ij}=n_{ij} is the number of edges between nodes i and j. While future work could explore alternatives to the Poisson distribution, it is attractive for several reasons. First, it is the simplest model that gives the requisite flexibility. Second, a Poisson random variable accurately approximates a sum of independent Bernoulli random variables. A binomial distribution also serves this purpose, but it imposes a hard upper bound on the number of successes. Third, the Poisson model is convenient mathematically since it yields nice MM updates in maximum likelihood estimation of the model parameters [20]. Fourth, a likelihood formulation permits testing for statistically significant connections between nodes.
Although the parametrization (Eq. 8) of PPP is flexible and computationally tractable, it ignores cluster formation. To address this limitation, we propose to exploit the CPBA parametrization. This extension is natural because many large multigraphs appear to be made up of smaller subnetworks, often referred to as modules, that are highly connected internally and only sparsely connected externally. For example, consider a coauthorship multigraph where an edge is placed between two scientists whenever they coauthor an article. Scientists working at the same institution and in the same department tend to be highly connected. Similarly, scientists tend to collaborate with other scientists working on the same research topics. Cluster structure is also inherent in biology. For instance, genes often function in pathways, and proteins often cluster in evolutionary families. Thus, when a network exhibits clustering, the propensity to form connections within a cluster is usually higher than the propensity to form connections between clusters. This phenomenon cannot be modeled using our original PPP model [20] and provides the motivation for injecting cluster similarity into the multigraph model. Our hope is that the CPBA based multigraph model will better account for differences in intracluster and intercluster connections and lead to better identification of significant connections. In the absence of an explicit model for clustering, the PPP model is likely to falter on a dataset that exhibits clustering. The most likely result is a host of significant connections between nodes in the same cluster since they all exhibit more edges than expected by chance. These types of significant connections are often uninteresting. In the above mentioned coauthorship network, the cluster structure may reflect institutional affiliations. In this case, it may be more interesting to identify pairs of researchers who publish more (or less) than is expected based on their workplace location.
To keep the number of parameters to a minimum, the cluster similarity matrix R = (r_{ab}) is assumed to be symmetric with a unit diagonal. Thus, our new random multigraph model, CPBA, adds just parameters for K clusters. As previously postulated, the number of edges between nodes i and j in clusters c_{i} and c_{j} is Poisson distributed with mean .
Example 2: Generalizing the conformitybased decomposition of a network
To demonstrate the value in our clustering model and tap into the wealth of data on weighted networks [21], we propose a clustering extension. Because weighted networks by definition have edge weights in [0,1], we drop the Poisson assumption and instead minimize the Frobenius criterion (Eq. 5). A major benefit of this model is that it generalizes the conformitybased decomposition of a network [21]. An adjacency matrix A = (A_{ij}) is exactly factorizable if and only if there exists a vector f = (f_{i}) with nonnegative elements such that
for all i≠j. In this setting, f_{i} is often called the conformity of node i. Although the term factorizable network was first proposed in [19], numerous examples of these types of networks can be found in the literature. A physical model for experimentally determined proteinprotein interactions is exactly factorizable [22]. In that model, the affinityA_{ij} between proteins i and j is the product of conformities f_{i} = exp( − α_{i}), where α_{i} is the number of hydrophobic residues on protein i. Since it can be shown that f is uniquely defined if the network contains n ≥ 3 nodes and all A_{ij} > 0 [19,21], it is easy to see that the propensity vector matches the conformity vector, p = f, when all r_{ab} = 1. Even when a network is not factorizable, our method can estimate conformities while simultaneously clustering the nodes into more factorizable modules. In addition, the entries of the cluster similarity matrix R can be interpreted as adjacencies between modules. Thus, the cluster similarity matrix represents a network whose nodes are networks themselves. In correlation network applications, we proposed a similar measure [23], and for gene networks we defined a measure of the probability of overlap between gene enrichment categories. Although these measures are useful in their respective contexts, they cannot be generalized to other networks. In contrast, by incorporating cluster similarity into our model, we have a standard way of calculating these measures for any type of network.
MM algorithm and R software implementation
Our software implementation of CPBA is freely available in the R package PropClust. On a laptop with a 2.4 GHz i5 processor and 4 GB of RAM, PropClust can estimate the parameters for 1000 nodes for a given cluster assignment in 0.1 seconds. For 3000 nodes, the same analysis takes 1 second. In practice, initial clusters are never perfect and must be reconfigured as well. PropClust adopts a block descent (or ascent) strategy that alternates cluster reassignment and parameter reestimation until clusters stabilize. Block descent takes under 10 rounds on average if initial cluster assignments are good. Note that all parameters are fixed in cluster reassignment, and all clusters are fixed in parameter reestimation. Furthermore, both steps decrease the value of the objective function. Early versions of PropClust reestimated parameters as each node was moved. This tactic proved to be too computationally burdensome on largescale problems despite its slightly better performance in finding optimal clusters.
Judicious choice of the initial clusters is realized by a divideandconquer strategy. First, hierarchical clustering coupled with dynamic branch cutting [24] is used to cluster nodes into manageable blocks of userdefined maximum size, for instance at most 1000 nodes each. Second, the CPBA algorithm is applied to each block to arrive at clusters within blocks. Our coexpression network application shows that this initialization procedure works well even in large data sets. Another way to accelerate clustering is exploit parallel processing in the MM algorithm. Parallelization of the MM algorithm is easily achieved since the parameters separate in the surrogate function and updating the propensities via (Eq. 12) and (Eq. 15) requires only the previous parameter values. Cluster reassignment avoids continuous optimization altogether and is very fast.
Simulated clusters in the Euclidean plane
Our first simulated dataset suggests a geometric interpretation of propensities and cluster similarities. For this dataset we simulated four distinct clusters on the Euclidean plane by sampling from a rotationally symmetric normal distribution with covariance matrix I and means corresponding to the four cluster centers shown in Figure 1A. The numbers of points in the four clusters were 50, 100, 150, and 200, respectively. The adjacency between two points is defined as 1−[dist / max(dist)]^{2}, where dist denotes Euclidean distance between the points and max(dist) denotes the maximum distance between any two points. Thus as depicted in Figure 1B, points closer together have a higher adjacency than those further apart. As anticipated, the MM algorithm provided the correct cluster assignments. Figure 1C also makes it evident that the estimated propensity of a point is significantly correlated to the Euclidian distance between the point and its cluster’s center. This result is expected since a connectivity k_{i} is related to a propensity p_{i} through equation (Eq. 7). Within a module, connectivity is also related to its cluster’s center through the formula
Figure 1. Four clusters were simulated in the Euclidean plane by sampling from the rotationally symmetric normal distribution with means corresponding to the different cluster centers and variance matrix I. The numbers of points in the clusters were 50, 100, 150, and 200 for the black, red, green, and blue clusters, respectively. A) A plot of the points is shown colored by cluster. B) Heatmap that colorcodes the ordered adjacency matrix, calculated using the formula A(i,j) = 1 − [Euclidean.Distance(i,j)/ max(Euclidean.Distance(i,j))]^{2}. In this plot red indicates a high adjacency, and green indicates a low adjacency. As expected, the adjacency within clusters is very high, and the adjacency between the blue and black clusters is the lowest since they are the furthest apart. C) The scatter plot between propensity (yaxis) and whole network connectivity (row sum of the adjacency matrix, Eq. 7) shows that the propensity is related to the distance between a point and its cluster’s center (given Eq. 10) in this example. D) Scatter plot between cluster similarity (yaxis) calculated using CPBA and the Euclidean distance between cluster centers (xaxis) shows a perfect negative correlation (1).
where n is the number of nodes in the cluster, x_{i} is the position of node i, and is the position of the cluster center [21]. This formula also explains why there is a separate line for each cluster in Figure 1C. Finally, Figure 1D shows that the cluster similarity r_{kl} of clusters k and l is significantly correlated to the distance between the centers of k and l. In summary, a propensity can be viewed as a measure of the centrality of a node, while a cluster similarity reflects the distance between two cluster centers.
Simulated gene coexpression network
To illustrate how CPBA generalizes to weighted correlation networks, we simulated gene expression data using the simulateDatExpr5Modules function in the WGCNA package in R [25]. Given the simulated expression data, we calculated Pearson’s correlation coefficient for each pair of genes and formed an adjacency matrix. Applying CPBA based clustering to the simulated data led to clusters that overlap perfectly with the simulated clusters. As Figure 2 depicts, the estimated propensities p_{i} are very significantly correlated to the node connectivities k_{i}. This strong relationship reflects (Eq. 7). Furthermore, as seen in Figure 2, cluster similarity is significantly correlated to true eigengene adjacency, namely, . In both simulations several different cluster assignment initializations were tried and all led to the same, correct, result.
Figure 2. Gene expression simulation results. Gene expression data were simulated using the simulateDatExpr5Modules function under the WGCNA package in R. An adjacency matrix was then calculated from the Pearson correlation coefficients for the expression levels of each pair of genes. These plots reveal the relationship between the intramodular propensity and the true module membership, kME in (Eq. 3), first in all the clusters combined (top left) and then in each of the five clusters individually. Note the strong correlation and significant pvalue in all cases.
Real gene coexpression network application to brain data
In this real data example, we demonstrate that CPBA generalizes weighted correlation network analysis and can deal with fairly large data sets. The human brain expression data in question were measured on the Affymetrix U133A platform [4]. Following Oldham et al. 2008 [4], we restrict our analysis to the roughly 10^{4} probes that were highly expressed in brain tissue. The biological modules discovered by Oldham et al. 2008 [4] via WGCNA are fairly well understood and correspond to cell types such as astrocytes, oligodendrocytes, and neurons enriched for specific cell markers. In reanalyzing these data, we defined initial clusters as sketched in our discussion of the R software implementation of CPBA. This strategy obviates the need to prespecify the number of clusters present in a data set. The results of PropClust are depicted in the second color band of Figure 3A. Overall, we find that CPBA yields modules very similar to those identified by WGCNA. The overlap with the well annotated modules of Oldham et al. [4] shows that the two clustering procedures yield meaningful and nearly equivlaent modules. CPBA has the advantage of giving cluster similarities. Figure 3B shows that eigengene based network adjacency (defined as the correlation between eigengenes raised to the softthresholding power 4) is highly correlated (r=0.93) with the cluster similarity parameter calculated by CPBA. For genes within a given module, Figures 3CE demonstrate that the node propensities estimated under CPBA are highly correlated with the module membership measures kME raised to the soft thresholding power 4. Finally, Figures 3IJ show that the connectivities k_{i} in the correlation network are highly correlated (r=0.96) with the connectivities calculated under the CPBA approximation and with the corresponding CPBA propensities (r=0.88). Figure 3K shows that there is a high correlation (r=0.93) between CPBA based connectivity (i.e. the row sum of the CPBA matrix) and the propensity parameter.
Figure 3. Human brain expression data illustrate how CPBA can be interpreted as a generalization of WGCNA. A) Hierarchical cluster tree based on WGCNA. Color bands show the WGCNA modules (first band), CPBA modules identified by propensity clustering (second band), and the modules identified by Oldham et al[9]. CPBA yields modules very similar to those identified by WGCNA. The overlap with the well annotated modules of Oldham et al[4] confirms that these clustering procedures yield meaningful modules. B) The intermodular adjacency calculated using CPBA (yaxis) is stronly correlated (r = 0.93) with its WGCNA counterpart, the correlation between eigengenes raised to the soft thresholding power. C) For nodes restricted to module 1 (turquoise in the color bands in panel A), CPBA propensity is highly correlated with its WGCNA counterpart, the module membership, kME (Eq. 3) raised to the soft thresholding power. D) and E) show analogous scatter plots for modules 2 (blue) and 3 (brown), respectively. F) The coexpression network exhibits approximate scale free topology (SFT). Specifically, the xaxis corresponds to equal width bins of the logarithm (base 10) of the connectivity (Eq. 1), and the yaxis reports the corresponding logarithm of the frequency. The approximate straight line relationship (linear model fitting index R^{2}=0.91) indicates that SFT fits very well. G) evaluates SFT for CPBA connectivity defined by the righthand side of Eq. 7. H) evaluates SFT for the propensity p_{i} only. I) The CPBA connectivity (yaxis) is highly correlated (r = 0.96) with connectivity k_{i} in the correlation network (xaxis). Genes are colored according to module assignment (PropClust color band in panel A. J) There is a high correlation (r = 0.88) between k_{i} (xaxis) and propensity (yaxis). K) There is a high correlation (r = 0.93) between CPBA based connectivity (xaxis) and propensity (yaxis).
These results demonstrate that CPBA is roughly equivalent to WGCNA in a typical coexpression network. We expect that CPBA will also be helpful in understanding network topology. For example, Figure 3F shows that the weighted coexpression network satisfies the approximate scalefree topology (SFT) property. Future research should aim to characterize the general fit of CPBA parameters to the SFT property. In this example, the CPBA based connectivities and propensities shown in Figures 3G and 3H agree well.
OMIM disease and gene networks
Here we present an application that is not amenable to correlation network models but is arguably well suited for multigraph models. Specifically, we consider a bipartite multigraph between genes and diseases based on curated data from the reference Online Mendelian Inheritance of Man (OMIM), which tracks published links between diseases and corresponding genes [26]. These data were previously studied in detail by Goh et al. [27], who showed that diseases and their associated genes are related at higher levels of cellular and organ function. In the current application we validated their functional clustering using the CPBA model.
Following Goh et al. [27], we analyzed the data in two ways. First we created a disease network by placing an edge between two diseases for each gene they were both linked to. Only the links labeled as high quality by OMIM were considered. This construction yielded a multigraph of 2552 diseases with 1401 diseases connected to at least one other disease. We created a second, complementary multigraph by placing an edge between two genes for each disease they were both linked to. For this multigraph, there were 4045 genes with 1978 genes connected to at least one other gene. As suggested by the Medical Subject Headings (MeSH) list [28], we applied the CPBA model with K=10 clusters for the gene network and K = 14 clusters for the disease network, leaving out irrelevant categories.
We categorized the diseases using MeSH with little success. Nearly half of the diseases (47%) were not mapped to any category, and another 36% were mapped to multiple categories. Using the clustering obtained from the CPBA analysis of the disease network, we looked at whether any MeSH categories were overrepresented in a cluster. Ignoring diseases present in multiple MeSH categories, we found 8 significant categories at P < 0.01, including neoplasms, musculoskeletal diseases, and eye diseases (See Table 1). Although significant results were obtained, only a handful of diseases in each cluster contributed to the statistic. Upon closer inspection of the clusters, we found that many seemingly welldefined diseases were either not mapped or multiply mapped. For example, the eye disease cluster contains morning glory disc anomaly, coloboma, best macular dystrophy, conerod retinal dystrophy, and iris hypoplasia which are all clearly eye diseases, but not classified as such by MeSH. The eye disease cluster is depicted in Figure 4.
Table 1. Overrepresented MeSH categories in the disease network
Figure 4. OMIM disease network. The intramodular connections between the nodes of the eye disease cluster are shown. Diseases are colored based on their MeSH categories, with diseases categorized as eye diseases (colored green), diseases linked to multiple categories (colored grey), and diseases that were not found (colored white). Note that more nodes should have been classified into the eye cluster by MeSH based on the name alone. Primary examples of this include retinitis pigmentosa, conerod dystrophy, retinal dystrophy, and microcornia. In spite of the failure of green labeling, these nodes were correctly classified by CPBA. Node and font sizes are proportional to a disease’s propensity.
Additionally, we found 540 significant connections between diseases at P < 0.01 and 148 significant connections at P < 0.001. The top 10 connections are listed in Table 2. The disease pair Adrenoleukodystrophy and Zellweger syndrome came in first; these two diseases are two of only three peroxisome biogenesis diseases belonging to the Zellweger spectrum [29]. It is also interesting to look for highly connected hub clusters, namely, clusters with high similarity to several other clusters. To define a measure of cluster connectivity, one can use the row sum of the cluster similarity matrix R. The neoplasm cluster has the highest row sum and is the cluster with the highest cluster connectivity. This makes sense given the complexity and diversity of cancers within the cluster.
Table 2. Disease network top 20 significant connections CPBA
Looking at the complementary gene network, we checked for overrepresentation of Gene Ontology (GO) terms using BinGO on Cytoscape [30]. We found that each cluster had an overrepresentation of many GO terms. In the cluster with the wellknown tumor suppressor protein TP53, we found 875 statistically significant GO terms at P < 0.01. Of these, 585 terms are still significant at P < 0.001 after accounting for multiple testing. The top 10 GO terms include both positive and negative regulation of cellular and biological processes, regulation of cell proliferation, anatomical structure development, regulation of apoptosis, and others that are clearly associated with TP53. Finally, we found 1316 significant connections between genes at P < 0.01 and 418 at P < 0.001. The top 10 connections are listed in Table 3. Many of these gene pairs are known to interact from other supporting evidence. For example, interaction between the top ranking pair, Hemoglobin Alpha 1 globin chain (HBA1) and Hemoglobin Subunit Beta (HBB), is confirmed by their cocrystal structure in xray crystallography [31] and by automated yeast twohybrid (Y2H) interaction mating [32]. Figure 5 depicts the full gene network derived from OMIM.
Table 3. Gene network top 20 significant connections CPBA
Figure 5. OMIM Gene Network. Genes are colored based on their cluster membership, and node size is proportional to a gene’s propensity. This view was achieved with a springembedded layout in Cytoscape using the number of edges between two genes as weights. Note that CPBA based clustering identifies modules of highly interconnected nodes.
Empirical comparison of edge statistics
In this section we compare our current CPBA model with our original Pure Propensity Poisson (PPP) model on two real datasets: the OMIM disease network and the complimentary OMIM gene network. On the whole we find that the CPBA model produces more plausible Pvalues for the edgecount tests. Conditioning on clusters enables CPBA to detect significant intercluster connections often missed by the PPP model. It also produces more reasonable Pvalues within clusters since propensities are not artificially deflated by the lack of connections between nodes from different clusters. We now consider how these trends play out in the OMIM disease network and the OMIM gene network.
In the disease network we find that, among the 20 most significant connections under the CPBA model, 5 are intercluster connections (See Table 2). Under the PPP model in contrast, none of the 20 most significant connections link different CPBA clusters (See Table 4). In fact, none of the top 50 connections of the PPP model occur between different CPBA clusters. The significant connection between Usher syndrome and retinitis pigmentosa would have gone completely unnoticed under the PPP model. This would be a shame because retinitis pigmentosa is a major symptom of Usher syndrome [33]. Another missed intercluster pairing, Waardenburg syndrome and Craniofacialdeafnesshand syndrome, also deserves recognition since both syndomes involve deafness and common facial features [34,35].
Table 4. Disease network top 20 significant connections PPP model
Comparing the intracluster connections, we find that CPBA and PPP produce similar results, with 8 connections present in both lists. However, the Pvalues of these connections differ sharply under the two models. Since the PPP model essentially assumes a single cluster, estimated propensities trend lower in response to the lack of connections between nodes from different clusters. This results in lower means for the Poisson distributions and more extreme Pvalues. This phenomenon is especially evident in the pairing between Adrenoleukodystrophy and Zellweger syndrome; in the CPBA model the test for excess edges has −Log_{10}(P) = 8.57, whereas in the PPP model −Log_{10}(P) = 12.06.
The same story holds for the gene network. Among the 20 most significant connections under CPBA, 7 are intercluster connections (Table 3). Under the PPP model the corresponding number is 0 (Table 5). One of the more interesting missed connections occurs between BDNF (brainderived neurotrophic factor) and HTR2A (seratonin receptor 2A). Both genes are associated with attention in schizophrenia [36]. As for intracluster connections, all intracluster connections found in the CPBA list are also found in the PPP list. However, the Pvalues for the most significant pair (HBB and HBA) differ by almost 5 orders of magnitude.
Table 5. Gene network top 20 significant connections PPP model
To summarize, the CPBA model was able to find significant intercluster edge counts that the PPP model missed. Indeed, the PPP model was unable to find a single signficant intercluster pair in either data set. Although conditioning on clusters resulted in less impressive intracluster Pvalues, the CPBA model was still able to detect most of the significant intracluster pairings found by the PPP model. Figure 6 provides a scatterplot of −Log_{10}(P) for all significant connections obtained under either model. Points are colored red if they represent a pairing within a cluster and black if they represent a pairing between different clusters. The figure justifies our contentions that the CPBA model is more sensitive to intercluster connections and less sensitive to intracluster connections than its less nuanced competitor. So while there will be fewer significant intracluster connections, they will arguably be more interesting. Most likely these virtues of the CPBA model carry over to other data sets.
Figure 6. OMIM CPBA versus PPP Analysis. Scatterplot of the Log_{10}(P) values obtained from analysis of OMIM using 14 and 10 clusters versus a single cluster for the Disease network and Gene network respectively. Note that the points are colored based on whether they come from a pair within a cluster(red) or between two clusters(black). This is very telling as it shows that by conditioning on the clustering, CPBA is able to increase its sensitivity in finding intercluster pairs while at the same time toning down that same trait in intracluster pairs.
Simulations for evaluating edge statistics
To drive home the last point, we took a block diagonal adjacency matrix containing 1’s in its diagonal blocks and 0’s in its offdiagonal blocks and introduced a few offblock connections. In our initial matrix with three diagonal blocks of 100, 200, and 500 nodes, we changed 60 offblock entries from 0’s to 1’s. Each pair of node sets accounted for 20 of these switches. We then analyzed the modified matrix under both the CPBA and PPP models. Figure 7 plots −Log_{10}(P) versus true adjacencies for the modified entries. Based on its identification of clusters, the CPBA model yields a better fit to the data. Comparison of −Log_{10}(P) values under the two models shows that CPBA is more adept at finding significant intercluster connections. The evidence from the receiver operating characteristic (ROC) curve is very convincing on this point. The area under the ROC curve for the CPBA model was 0.95 compared to just 0.38 for the PPP model.
Figure 7. Simulated CPBA versus PPP Analysis. Scatterplot of the −Log_{10}(P) values versus the true adjacency values obtained from 0/1 block diagonal matrix by resetting a few other entries from 0 to 1. These changed values are shown along with the resulting −Log_{10}(P) values obtained using CPBA and PPP.
Hidden relationships between fortune 500 companies
To illustrate the utility of CPBA in a nonbiological setting, we briefly describe a multigraph model of crosscompany management. Specifically, we took the Fortune 500 Companies of 2011 and put an edge between two companies for each shared member on their boards of directors. The original data is found in Freebase [37]. As discussed below, the use of the Bayesian Information criterion (BIC) or the Akaike Information criterion for estimating clusters is problematic. For example, the BIC suggests an optimal number of clusters K around 10, while the AIC gives a less plausible value of K > 20. In the following, we assume K = 10 clusters. It is noteworthy that most companies do not cluster into groups of related industries. This makes sense because conflict of interest norms preclude companies in the same field from sharing board members. Overt clustering is consequently discouraged.
Based on the underlying probability model, we ascertained the significance of the edge counts for company pairs. Table 6 lists the 10 most significant connections under the 10cluster model. Several connections stand out. The significant pairing between Fidelity National Financial and Fidelity National Information Services is rather obvious. The same holds for the pairing between Autozone and AutoNation Inc. Other connections are less obvious. The pairing between General Motors and DuPont may reflect the fact that Pierre du Pont, the founder of DuPont, at one point owned a third of all General Motors stock. This remained true until federal antitrust prosecutors filed suit, and the Supreme Court ruled against DuPont, forcing the company to dispose all of its GM shares in 1961 [38]. Although the shares are gone, it seems that some ties persist.
Table 6. Fortune 500 top 10 significant connections
Relationship to other network models and future research
Because so much effort has been devoted to the mathematical and statistical explication of complex networks, we can only touch on the relationship of the CPBA parametrization to other network models. Complex networks can be described by random graphs (the Erdös and Rényi model [39]), smallworld models (the Watts Strogatz model [40]), scalefree networks (the preferential attachment model of Barabasi and Albert [41,42]), and other growing random network (GRN) models. These models involve graphs rather than multigraphs, so the number of edges per node pair equals 0 or 1. The CPBA has interesting ramifications for random graphs with arbitrary connectivity distributions [43]. If the edges are placed randomly in a network with such connectivities, then the probability P_{ij} of observing an edge between nodes i and j is exactly factorizable. In fact, , where k_{i} is the connectivity (degree) of node i[42,44]. Thus, P_{ij} can be well approximated by CPBA with propensities and cluster similarities r_{ab}=1. The Erdös and Rényi (ER) model, which assumes uniform edge probabilities, is too restrictive for realistic networks. The CPBA parametrization adapts well to random graphs if we replace the mean edge count with the edge formation probabilities
This reformulation of the model is consistent with construction of an MM algorithm for parameter estimation [45]. Future research should explore the topological properties of such models.
Growing random networks (GRNs) are also of interest since many networks grow by the continuous addition of new nodes and exhibit preferential attachment. Thus, the likelihood of connecting to a node depends on the node’s current connectivity [4144,46,47]. At each time step of a growing random network [44], a new node is added, and a directed edge to one of the earlier nodes is created. This growing network has a directedtree graph topology whose basic elements are nodes connected by directed edges. In general, the topology of a general GRN is determined by the connection kernel A_{k}, which is the probability that a newlyintroduced node forms an edge to an existing site with k edges (k−1 incoming and 1 outgoing). Future research could explore how to define a connection kernel (or more generally a GRN) so that the resulting network can be well approximated using the CPBA of the adjacency matrix. The BarabasiAlbert (BA) model is an important special case of a GRN [41,42] that leads to a scalefree network. In the BA model, the degree of a node satisfies the powerlaw (or scalefree) distribution
For homogeneous connection kernels, A_{k}∼k^{ν}, and scale free networks only arise if ν = 1 [44]. Future research could explore whether the adjacency matrix of the BA model can be well approximated using the CPBA. Toward this end it may useful to observe that the probability P_{ij} of finding an edge between nodes i and j in the BA model is given by [42,44]
which, importantly, assumes that node i with connectivity k_{i} was added later to the growing network than node j (implying that k_{i} < k_{j}). In view of this temporal assumption, P_{ij} is not symmetrical in i and j; it also contains no parameters to capture clustering. Thus, there is no obvious relationship between the BA model and the CPBA approximation of a network. Future research can investigate how to parameterize preferential attachment so that the resulting probability P_{ij} of finding an edge fits well to the CPBA.
Relationship to other clustering methods
Although the MM algorithm that estimates the CPBA parameters naturally generates a clustering method, CPBA is not just another clustering method. Our applications highlight the utility of the parameter estimates and the resulting likelihood based tests. CPBA not only provides a sparse parametrization of a general similarity matrix, but it also identifies hub nodes and clusters and enables significance tests for excess edges (between nodes) and shared similarities (between clusters). We do not claim that CPBA based clustering outperforms existing clustering methods in the simple task of clustering.
Substitutes for CPBA clustering include hierarchical clustering, partitioning around medoids [48], spectral clustering [49], mixture models [50], component models [51], and many more [5256]. Because CPBA can be interpreted as a generalization of weighted correlation network methods, there is no need to invoke it instead of WGCNA when it comes to coexpression network applications. In modeling relationships between quantitative variables, one can use a host of other methods, for example sparse Gaussian graphical models [57,58], Bayesian networks, and structural equation models. CPBA is not meant to replace these powerful approaches for modeling relationships between quantitative variables. Its main attraction is that it applies to a general similarity measure. Since input data sometimes consists of a similarity (or dissimilarity) measure, CPBA fills a useful niche.
Conclusions
The current paper introduces the CPBA model (cluster and propensity based approximation) for general similarity measures and sketches an efficient MM algorithm for estimation of the CPBA parameters. These advances will prove valuable in dissecting networks involving functional or evolutionary modules. The CPBA model is attractive for several reasons. First, it invokes relatively few parameters while providing sufficient flexibility for modeling observed similarities. Second, the cluster similarity parameters are good at revealing higherorder relationships between clusters. The row sum of the cluster similarity matrix can be used to define a cluster connectivity measure and to identify hub clusters such as the neoplasm hub in the disease network. Third, the CPBA model naturally generalizes network approximations that are already part of scientific practice, namely, the propensity based approach in multigraph models, the conformity decomposition in weighted networks, and the eigenvector based approximation in correlation networks. Fourth, the connections to the MM algorithm make the model well adapted to highdimensional optimization. Fifth, the Poisson multigraph version of the model enables assessment of the statistical significance of edge counts and similarities between clusters. Sixth, likelihoodbased models such as the Poisson multigraph model provide a rational basis for estimating the number of clusters. While it is beyond our scope to evaluate different methods for estimating the number of clusters in a data set, it is worth mentioning that our R implementation allows users to initialize clusters via hierarchical clustering. This tactic obviates the need to prespecify the number of clusters.
Using simulated clusters in the plane and simulated coexpression networks, we demonstrate that CPBA generalizes existing methods. The planar examples show how a propensity can be intuitively seen as a measure of a node’s closeness to its cluster’s center and how a cluster similarity can be seen as a measure of proximity between two clusters. The simulated gene expression dataset exposes the CPBA model’s close ties to the previously studied concepts of intramodular connectivity, module eigengenes, and eigengene adjacency. Our analysis of real gene expression data reassures us that CPBA clustering results are similar to those of a benchmark method used in coexpression network analysis. The CPBA propensity parameters mirror the module eigengene based connectivity kME, and the cluster similarity measures mimic the network eigengenes. In our view, the main value of the CPBA model lies in generalizing correlation network methods.
To illustrate the versatility of CPBA, we applied it to the gene and disease networks of OMIM. The evidence that CPBA identifies biologically meaningful clusters is readily apparent in the significant enrichment of MeSH categories in the disease clusters and in the significant enrichment of GO terms in the gene clusters. While many other clustering procedures could have been used, CPBA has the advantage of dealing with dissimilarity measures as opposed to numeric input variables. It also provides Poisson likelihood based significance tests for edge counts (either pairs of genes and or pairs of diseases) that respect the underlying cluster structure. Finally, the row sums of the cluster similarity measure can be used to define hub clusters, and the estimated propensities can be used to define hub nodes. As we hoped, there were biologically meaningful ties between significantly connected pairs of genes and diseases. Several of these biologically plausible explanations are discussed in the text.
Although our examples are mainly biological, one can apply CPBA in many other contexts. For example, we employed CPBA to highlight shared board members among the Fortune 500 companies. This example illustrates how significant connections mirror the underlying ties between nodes. The edge count significance test suggests that the antitrust suit against GM and DuPont was no accident. To its credit, CPBA not only generalizes correlation network methods to general similarity matrices, but it also provides a valuable extension of random multigraph methods to weighted and unweighted multigraph data. CPBA is not just another clustering procedure but offers unique test statistics that permit identification of hub clusters and significant edge counts. We anticipate that the CPBA model will prove attractive to a wide range of scientists.
Methods
Maximizing the Poisson loglikelihood based objective function
Our algorithm for maximizing the Poisson loglikelihood (Eq. 6) given a clustering assignment c combines block ascent and the MM principle [5961]. Clustering proceeds by reassigning each node in turn until clusters stabilize. It may take several cycles through the nodes to achieve stability. Reassignment fixes parameters and selects the assignment with the highest loglikelihood. In the Poisson loglikelihood (Eq. 6), we take
where is the cluster similarity between clusters c_{i} and c_{j}, p_{i} is the propensity of node i, and A_{ij}=n_{ij} is the number of connections between nodes i and j.
To optimize the objective function for a given cluster assignment, we employ block ascent and alternate updating R and p. Fixing p, it is possible to to solve for the best cluster similarity parameters r_{ab} exactly. Indeed, setting the partial derivative
equal to zero and solving for yields the simple update.
We expect the estimated r_{ab} to occur within the unit interval [0,1] because edge formation is enhanced within clusters.
To update the propensity vector p with R fixed, we turn to an MM algorithm [5961]. The MM principle says we should minorize the objective function by a surrogate function and maximize the surrogate function. This action drives the objective function uphill. One function minorizes another at a point p_{m} if it is tangent to the other function at p_{m} and falls below it elsewhere. The arithmeticgeometric mean inequality
is the key to minorizing the Poisson loglikelihood. Substituting the righthand side for p_{i}p_{j} in the loglikelihood (Eq. 6) gives a surrogate function with parameters separated and leads directly to the propensity updates
In practice, this MM algorithm may require an excessive number of iterations to converge. To accelerate convergence, we employ a QuasiNewton extrapolation specifically designed for highdimensional problems (Methods and [62]). The overall ascent algorithm (outer iterations) on R and p may also be slow to converge. It can also be accelerated by the same extrapolation scheme. Accelerating both inner and outer iterations gives a fast numerically stable procedure for estimating R and p for c fixed.
Minimizing the Frobenius norm based objective function
Minimization of the Frobenius objective function (Eq. 5) employs block descent and again alternates updating R and p. In this case setting the partial derivative
equal to zero and solving for yields the simple update
To update p for R fixed, we again rely on the MM principle. However, since we now seek to minimize the objective function, we majorize it. This is accomplished by first expanding it in the form
In majorization, one is allowed to work piecemeal. Thus, we majorize the term involving (p_{i}p_{j})^{2} by the earlier arithmeticgeometric mean inequality
taking into account squares. The term involving −p_{i}p_{j} can be majorized by the inequality x ≥ 1+ lnx for x ≥ 0 in the form
Substituting upper bounds side for (p_{i}p_{j})^{2} and −p_{i}p_{j} in the expanded objective function (Eq. 14) gives a surrogate function with parameters separated and leads directly to the propensity updates
As in the Poisson case, acceleration is advisable for both inner MM iterations and the outer block descent iterations. The same QuasiNewton extrapolation [62] is pertinent and gives a fast numerically stable procedure for estimating R and p for c fixed.
Model initialization
Initial cluster assignment
Many algorithms exist for creating initial cluster assignments [56]. For most datasets these assignments only affect the time to convergence and not the converged solution. Our R software implements hierarchical clustering and does not require prespecifying the number of clusters. More specifically, our software applies average linkage hierarchical clustering with dynamic branch cutting [24]. Dissimilarities are set equal to 1 minus similarities.
Initial propensities
One way to initialize propensities is to assume a single cluster and estimate propensities as suggested in our earlier work [20]. An alternative in the Frobenius model is to initialize p_{i} by the sum of the connections of node i divided by the square root of the sum of all connections [21],
This initialization can be motivated by showing that the above equation holds if (equivalently, the network consists of a single cluster) and . While the assumption of perfect cluster similarity is unrealistic, it leads to initial values that work well in practice. For the Poisson model the analog is
Cluster similarity parameters
Because the block updates (Eq. 11) and (Eq. 13) for the cluster similarity parameters only depend on cluster assignment and propensities, it is natural to use those updates for initialization as well.
Clustering algorithm
1. Choose the objective function (Frobenius or Poisson).
2. Initialize the cluster assignment, for example, via hierarchical clustering.
3. Initialize the propensity vector p by (Eq. 16) or (Eq. 17) and the cluster similarity matrix R by (Eq. 11) or (Eq. 13).
4. Parameter Estimation: Given cluster assignments, reestimate parameters through the updates (Eq. 11) and (Eq. 12) or (Eq. 13) and (Eq. 15). Declare convergence when the objective function changes by less than a threshold, say 10^{−5}.
5. Cluster Reassignment:
(a) Randomly permute the nodes.
(b) For each node taken in order, try all possible cluster reassignments for the node.
(c) Assign the node to the cluster that leads to the biggest improvement in the objective function.
(d) Repeat step 5 until no nodes are reassigned.
6. Repeat steps 4 and 5 until no nodes are reassigned.
7. (Optional) Repeat steps 1 5 for other cluster numbers and use a cluster number estimation procedure for choosing the number of clusters.
QuasiNewton acceleration
In this section we briefly review a QuasiNetwon acceleration method described more fully in [62]. Newton’s method seeks a root of the equation 0 = x−F(x), where F(x) is a smooth map. For CPBA this is the algorithm map summarized by Equations (11) and (12) for Poisson updates or Equations (13) and (15) for Frobenius updates. Because the function G(x) = x − F(x) has differential dG(x) = I − dF(x), Newton’s method iterates according to
QuasiNetwon acceleration approximates dF(x_{n}) by a lowrank matrix M and explicitly forms the inverse (I − M)^{−1}.
Construction of M relies on secants. We can generate a secant by taking two iterates of the algorithm starting from the current iterate x_{n}. If we are close to the optimal point x_{∞}, then we have the linear approximation
where M = dF(x_{∞}). We abbreviate the secant requirement as Mu = v, where u = F(x_{n})−x_{n} and v = F∘F(x_{n})−F(x_{n}). To improve the approximation of M, one can use several secant constraints Mu_{i} = v_{i} for i = 1,…,q. These are expressed in matrix form as Mu = v. For our purposes the value q=6 works well.
Provided U has full column rank q, the minimum of the strictly convex function subject to the constraints Mu=v is attained at M=V(U^{t}U)^{−1}U^{t}[62]. Fortunately, a variant of ShermanMorrison formula [63] implies that the matrix I−M = I−V(U^{t}U)^{−1}U^{t} has the explicit inverse
Thus, the quasiNewton acceleration can be expressed as
This update involves inversion of the small q × q matrix U^{t}U − U^{t}V; all other operations reduce to matrix times vector multiplications.
Estimating the number of clusters
Estimating the number of clusters is the Achilles heel of cluster analysis. While this topic is beyond our scope, it is worth mentioning that an advantage of model based approaches is that likelihood criteria can be brought to bear. Since adding clusters entails more parameters, it is tempting to use the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) to estimate the number of cluster in the Poissom model [64,65]. Both of these criteria balance the tradeoff between the number of parameters and the fit of the model. Specifically these methods choose the number of clusters K that minimize AIC = − 2 ln(L) + 2c or BIC = − 2 ln(L) + c ln(n), respectively, where c is the number of parameters, L is the likelihood, and n is the sample size. We caution the reader that AIC and BIC may be inappropriate for the present task because both criteria invoke strong assumptions. For example, AIC is derived by assuming a regular model, for instance, a linear model with Gaussian noise. Hence, AIC may be inappropriate for models with latent variables such as cluster labels. BIC may be inappropriate because our approach is frequentist rather than Bayesian. A review of the limitations and utility of these criteria can be found in [66].
Ethical approval
This article involved publicly available human data sets which are completely anonymized. This study is therefore exempted from requiring ethics approval. No animal data were used. We fully comply with the Declaration of Helsinki and the “Animal Research: Reporting In Vivo Experiments” (ARRIVE) guidelines.
Availability and requirements
Project name:PropClust R package Project home page:http://www.genetics.ucla.edu/labs/horvath/PropClust webcite Operating system(s): Platform independent Programming language: R Licence: GNU GPL 3 The propensity based clustering method propensityDecomposition is implemented in the R package PropClust. The package also contains the function CPBAdecomp for carrying out the propensity decomposition of a network.
Abbreviations
BA: Barabasi Albert; CPBA: Cluster and propensity based approximation; ER: Erdos renyi; GO: Gene ontology; GRN: Growing random network; kME: Connectivity based on the module eigenvector or eigengene; MeSH: Medical subject headings; MM: Minorization maximization or majorization minimization; PPP: Pure propensity poisson; SFT: Scale free topology.
Competing interests
The authors declare that they have no conflict of interest.
Authors’ contributions
JR, SH, and KL jointly developed the methods and wrote the article. JR carried out the analysis and implemented the software. PL helped with the R software implementation and carried out the example analysis on empirical expression data. All authors read and approved the final manuscript.
Acknowledgements
This research was supported in part by United States Public Health Service grants GM53275, HG006139, MH59490, and UCLA CTSI Grant UL1TR000124.
References

von Luxburg U: A tutorial on spectral clustering.
Stat Comput 2007, 17(4):395416. Publisher Full Text

Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and display of genomewide expression patterns.
Proc Natl Acad Sci U S A 1998, 95(25):1486314868. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Stuart JM, Segal E, Koller D, Kim SK: A genecoexpression network for global discovery of conserved genetic modules.
Science 2003, 302(5643):249255. PubMed Abstract  Publisher Full Text

Oldham M, Konopka G, Iwamoto K, Langfelder P, Kato T, Horvath S, Geschwind D: Functional organization of the transcriptome in human brain.
Nat Neurosci 2008, 11(11):12711282. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhang B, Horvath S: A general framework for weighted gene coexpression network analysis.

Huang Y, Li H, Hu H, Yan X, Waterman M, Huang H, Zhou X: Systematic discovery of functional modules and contextspecific functional annotation of human genome.
Bioinformatics 2007, 23(13):i222i229. PubMed Abstract  Publisher Full Text

Horvath S, Zhang B, Carlson M, Lu K, Zhu S, Felciano R, Laurance M, Zhao W, Shu Q, Lee Y, Scheck A, Liau L, Wu H, Geschwind D, Febbo P, Kornblum H, Cloughesy T, Nelson S, Mischel P: Analysis of oncogenic signaling networks in Glioblastoma identifies ASPM as a novel molecular target.
Proc Natl Acad Sci USA 2006, 103(46):1740217407. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Carlson M, Zhang B, Fang Z, Mischel P, Horvath S, Nelson SF: Gene connectivity, function, and sequence conservation: predictions from modular yeast coexpression networks.
BMC Genomics 2006, 7(7):40. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Oldham M, Horvath S, Geschwind D: Conservation and evolution of gene coexpression networks in human and chimpanzee brains.
Proc Natl Acad Sci U S A 2006, 103(47):1797317978. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chen L, EmmertStreib F, Storey J: Harnessing naturally randomized transcription to infer regulatory relationships among genes.

Keller M, Choi Y, Wang P, Belt Davis D, Rabaglia M, Oler A, Stapleton D, Argmann C, Schueler K, Edwards S, Steinberg H, Chaibub Neto E, Kleinhanz R, Turner S, Hellerstein MK, Schadt E, Yandell B, Kendziorski C, Attie A: A gene expression network model of type 2 diabetes links cell cycle regulation in islets with diabetes susceptibility.
Genome Res 2008, 18(5):706716. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dawson J, Ye S, Kendziorski C: An empirical bayesian framework for discovering differential coexpression.

de la Fuente A: From ‘differential expression’ to ‘differential networking’ identification of dysfunctional regulatory networks in diseases.
Trends Genet 2010, 26(7):326333. PubMed Abstract  Publisher Full Text

Horvath S, Zhang B, Carlson M, Lu K, Zhu S, Felciano R, Laurance M, Zhao W, Shu Q, Lee Y, Scheck A, Liau L, Wu H, Geschwind D, Febbo P, Kornblum H, TF C, Nelson S, Mischel P: Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a novel molecular target.
Proc Natl Acad Sci U S A 2006, 103(46):1740217407. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ghazalpour A, Doss S, Zhang B, Plaisier C, Wang S, Schadt E, Thomas A, Drake T, Lusis A, Horvath S: Integrating genetics and network analysis to characterize genes related to mouse weight.

Fuller T, Ghazalpour A, Aten J, Drake T, Lusis A, Horvath S: Weighted gene coexpression network analysis strategies applied to mouse weight.
Mamm Genome 2007, 18(67):463472. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gargalovic PS, Gharavi NM, Clark MJ, Pagnon J, Yang WP, He A, Truong A, BaruchOren T, Berliner JA, Kirchgessner TG, Lusis A J: The unfolded protein response is an important regulator of inflammatory genes in Endothelial cells.
Arterioscler Thromb Vasc Biol 2006, 26(11):24902496.
[http://atvb.ahajournals.org/cgi/content/abstract/26/11/2490 webcite]
PubMed Abstract  Publisher Full Text 
Horvath S, Dong J: Geometric interpretation of gene coexpression network analysis.
PloS Comput Biol 2008, 4:8. Publisher Full Text

Dong J, Horvath S: Understanding network concepts in modules.
BMC Syst Biol 2007, 1:24. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Ranola J, Ahn S, Sehl ME, Smith D, Lange K: A Poisson model for random multigraphs.
Bioinformatics 2010, 26(16):20042011. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Horvath S: Weighted Network Analysis: Applications in Genomics and Systems Biology. 1 edition. New York: Springer; 2011.

Deeds E, Ashenberg O, Shakhnovich E: A simple physical model for scaling in proteinprotein interaction networks.
Proc Natl Acad Sci U S A 2006, 103(2):311316. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Langfelder P, Horvath S: Eigengene networks for studying the relationships between coexpression modules.
BMC Syst Biol 2007, 1:54. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Langfelder P, Zhang B, Horvath S: Defining clusters from a hierarchical cluster tree: the dynamic tree cut library for R.
Bioinformatics 2007, 24(5):71920. PubMed Abstract  Publisher Full Text

Langfelder P, Horvath S: WGCNA: an R package for weighted correlation network analysis.
BMC Bioinformatics 2008, 9:559. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

McKusickNathans Institute of Genetic Medicine JHU: Online mendelian inheritance in man, OMIM®. [http://omim.org/ webcite]

Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabási AL: The human disease network.
Proc Natl Acad Sci 2007, 104(21):86858690.
[http://www.pnas.org/content/104/21/8685.abstract webcite]
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Rogers F: Medical subject headings.
Bull Med Libr Assoc 1963, 51:114116. PubMed Abstract  PubMed Central Full Text

Steinberg S, Dodt G, Raymond G, Braveman N, Moser A, Moser H: Peroxisome biogenesis disorders.
Biochemica et Biophysica Acta  Mol Cell Res 2006, 1763(12):1733. Publisher Full Text

Maere S, Heymans K, Kuiper M: BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks.
Bioinformatics 2005, 21(16):34483449.
[http://bioinformatics.oxfordjournals.org/content/21/16/3448.abstract webcite]
PubMed Abstract  Publisher Full Text 
Shaanan B: Structure of human Oxyhaemoglobin at 2.1 a resolution.
J Mol Biol 1983, 171:3159. PubMed Abstract  Publisher Full Text

Stelzl Uea: A human proteinprotein interaction network: a resource for Annotating the Proteome.
Cell 2005, 122(6):957968. PubMed Abstract  Publisher Full Text

U.S. National Institute of Health’s informational page on Usher syndrome [http://www.nidcd.nih.gov/health/hearing/pages/usher.aspx webcite]

U.S. National Institute of Health’s informational page on Waardenburg syndrome [http://www.ncbi.nlm.nih.gov/pubmedhealth/PMH0002401/ webcite]

U.S. National Institute of Health’s informational page on Craniofacialdeafnesshand syndrome [http://ghr.nlm.nih.gov/condition/craniofacialdeafnesshandsyndrome webcite]

Alfmova M, Lezhelko T, Golimbet V, Korovaltseva G, Lavrushkina O, Kolesina N, Frolova L, Muratova A, Abramova L, Kaleda V: Investigation of association of the brainderived neurotrophic factor (BDNF) and a serotonin receptor 2A (5HTR2A) genes with voluntary and involuntary attention in schizophrenia.
Zh Nevrol Psikhiatr Im S S Korsakova 2008, 108(4):6269. PubMed Abstract

Freebase open data resource [http://www.freebase.com/ webcite]

DuPont Wikipedia entry [http://en.wikipedia.org/wiki/DuPont webcite]

Watts D, Strogatz S: Collective dynamics of ‘smallworld’ networks.
Nature 1998, 393(6684):440442. PubMed Abstract  Publisher Full Text

Barabasi A, Albert R: Emergence of scaling in random networks.
Science 1999, 286(5439):509512. PubMed Abstract  Publisher Full Text

Albert R, Barabasi A: Statistical mechanics of complex networks.
Rev Mod Phys 2002, 74:4797. Publisher Full Text

Newman MEJ, Strogatz SH, Watts DJ: Random graphs with arbitrary degree distributions and their applications.
Phys Rev E 2001, 64:026118.
[http://link.aps.org/doi/10.1103/PhysRevE.64.026118 webcite]

Krapivsky PL, Redner S, Leyvraz F: Connectivity of growing random networks.
Phys Rev Lett 2000, 85:46294632.
[http://link.aps.org/doi/10.1103/PhysRevLett.85.4629 webcite]
PubMed Abstract  Publisher Full Text 
Lange K: Numerical Analysis for Statisticians. New York: Springer; 2010.

Strogatz SH: Exploring complex networks.
Nature 2001, 410(6825):268276.
[http://dx.doi.org/10.1038/35065725 webcite]
PubMed Abstract  Publisher Full Text 
Durrett R: Random Graph Dynamics. New York: Cambridge University Press; 2006.

Kaufman L, Rousseeuw P: Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley and Sons, Inc; 1990.

Zhou D, Burges CJC: Spectral clustering and transductive learning with multiple views. In Proceedings of the 24th international conference on Machine learning, ICML ’07. New York: ACM; 2007:11591166.
[http://doi.acm.org/10.1145/1273496.1273642 webcite]

Newman MEJ, Leicht EA: Mixture models and exploratory analysis in networks.
Proc Natl Acad Sci U S A 2007, 104(23):95649569. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sinkkonen J, Aukia J, Kaski S: Component models for large networks.

Hofman JM, Wiggins CH: Bayesian approach to network modularity.
Phys Rev Lett 2008, 100(25):258701. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kemp C, Tenenbaum JB, Griffiths TL, Yamada T, Ueda N: Learning systems of concepts with an infinite relational model. In AAAI. United States: AAAI Press; 2006:381388.

Airoldi EM, Blei DM, Fienberg SE, Xing EP: Mixed membership stochastic blockmodels.
J Mach Learn Res 2008, 9:19812014. PubMed Abstract  PubMed Central Full Text

Newman M: Modularity and community structure in networks.
PNAS 2006, 103:85778582. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Schaeffer SE: Graph clustering.
Comput Sci Rev 2007, 1:2764.
[http://www.sciencedirect.com/science/article/pii/S1574013707000020 webcite]
Publisher Full Text 
Yin J, Li H: A sparse conditional gaussian graphical model for analysis of genetical genomics data.
Ann Appl Stat 2011, 5(4):26302650. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

XulviBrunet R, Li H: Coexpression networks: graph properties and topological comparisons.
Bioinformatics 2010, 26(2):205214. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hunter D, Lange K: A tutorial on MM algorithms.
Am Stat 2004, 58:3037. Publisher Full Text

Wu T, Lange K: The MM alternative to EM.
Stat Sci 2010, 25(4):492505. Publisher Full Text

Zhou H, Alexander D, Lange K: A quasiNewton acceleration for highdimensional optimization algorithms.
Stat Comput 2011, 21:261273.
[http://dx.doi.org/10.1007/s1122200991663 webcite] http://dx.doi.org/10.1007/s1122200991663 webcite
PubMed Abstract  Publisher Full Text  PubMed Central Full Text 
Lange K: Numerical Analysis for Statisticians. New York: SpringerVerlag; 1999.

Akaike H: A new look at the statistical model identification.
Automatic Control, IEEE Trans 1974, 19(6):716723. Publisher Full Text

Schwarz G: Estimating the dimension of a model.
Ann Stat 1978, 6(2):461464.
[http://www.jstor.org/stable/2958889 webcite]
Publisher Full Text 
Watanabe S: Algebraic Geometry and Statistical Learning Theory. Cambridge: Cambridge University Press; 2009.