Biomathematics, University of California, Los Angeles, CA, USA

Human Genetics, UCLA, Los Angeles, CA, USA

Biostatistics, UCLA, Los Angeles, CA, USA

Statistics, UCLA, Los Angeles, CA, USA

Abstract

Background

The models in this article generalize current models for both correlation networks and multigraph networks. Correlation networks are widely applied in genomics research. In contrast to general networks, it is straightforward to test the statistical significance of an edge in a correlation network. It is also easy to decompose the underlying correlation matrix and generate informative network statistics such as the module eigenvector. However, correlation networks only capture the connections between numeric variables. An open question is whether one can find suitable decompositions of the similarity measures employed in constructing general networks. Multigraph networks are attractive because they support likelihood based inference. Unfortunately, it is unclear how to adjust current statistical methods to detect the clusters inherent in many data sets.

Results

Here we present an intuitive and parsimonious parametrization of a general similarity measure such as a network adjacency matrix. The cluster and propensity based approximation (CPBA) of a network not only generalizes correlation network methods but also multigraph methods. In particular, it gives rise to a novel and more realistic multigraph model that accounts for clustering and provides likelihood based tests for assessing the significance of an edge after controlling for clustering. We present a novel Majorization-Minimization (MM) algorithm for estimating the parameters of the CPBA. To illustrate the practical utility of the CPBA of a network, we apply it to gene expression data and to a bi-partite network model for diseases and disease genes from the Online Mendelian Inheritance in Man (OMIM).

Conclusions

The CPBA of a network is theoretically appealing since a) it generalizes correlation and multigraph network methods, b) it improves likelihood based significance tests for edge counts, c) it directly models higher-order relationships between clusters, and d) it suggests novel clustering algorithms. The CPBA of a network is implemented in Fortran 95 and bundled in the freely available R package

Background

The research of this article was originally motivated by two types of network models: correlation networks and multigraphs. After reviewing these special network models, we describe how structural insights gained from them can be used to tackle research questions arising in the study of general networks specified by network adjacencies and more generally to unsupervised learning scenarios modeled by similarity measures.

Background: adjacency matrix and multigraphs

Networks are used to describe the pairwise relationships between **adjacency matrix ****
A
** = (

For a **weighted network**, _{
i
j
} equals a real number between 0 and 1 specifying the connection strength from node _{
i
j
} from _{
j
i
} from **
A
** is symmetric. For a directed network, the adjacency matrix is typically not symmetric. Unless we explicitly mention otherwise, we will deal with undirected networks. In this paper the diagonal entries

In an **(unweighted) multigraph**, the adjacencies _{
i
j
} = _{
i
j
} are integers specifying the number of edges between two nodes. A general similarity matrix (whose entries are non-negative real numbers possibly outside [0,1]) can be interpreted as a **weighted multigraph**. In each of the network types, the connectivities

are important statistics pertinent to finding highly connected hubs. In an unweighted network (a graph), _{
i
} is the degree of node

Background: correlation- and co-expression networks

Network methods are frequently used to analyze experiments recording levels of transcribed messenger RNA. The gene expression profiles collected across samples can be highly correlated and form modules (clusters) corresponding to protein complexes, organelles, cell types, and so forth

A correlation network is a network whose adjacency matrix **
A
** = (

Weighted gene co-expression networks have found many important medical applications, including identifying brain cancer genes **
Y
** denotes the expression data of a single module (cluster) after the appropriate columns of

Let _{
i
} be the **
Y
**. The eigenvector factorizability

measures how well a network factors _{1}) ≈ 1, the correlation matrix **
Y
** approximately factors as

In co-expression networks, modules are often approximately factorizable

where _{
i
} and _{
j
}. The quantity

is called the module membership measure or conformity

Unlike general networks, correlation networks allow assessment of the statistical significance of an edge (via a correlation test) and generate informative network statistics such as the module eigenvector. But correlation network methods can only be applied to model the correlations between numeric variables. An open question is whether correlation network methods can be generalized to general networks by defining a suitable decomposition of a general network similarity measure. In the following, we will address this question.

Results and discussion

CPBA is a sparse approximation of a similarity measure

Consider a general **
A
** are irrelevant,

The right-hand side with _{
i
} of node _{
i
}|^{
β
}. The cluster similarity _{
a
b
}, defined by the correlation _{
a
a
} of **
R
** are identically 1.

Objective functions for estimating CPBA

In practice, CPBA parameters **
c
**,

Our second objective is the Poisson log-likelihood

Our later multigraph example interprets Poisson(**
c
**,

In the Methods section, we describe a powerful MM algorithm for optimizing the objective functions and estimating its parameters. We now pause and briefly describe a few major applications. First, the sparse parametrization can be used to derive relationships between network statistics; our previous research highlights this possibility

Second, since our optimization algorithms also strive to choose the best cluster assignment indicator **
c
**, they naturally give rise to clustering algorithms. Cluster reassignment is carried out node by node in a sequential fashion. For the sake of computationally efficiency, all parameters are fixed until node reassignment has stabilized. If parameters are updated as each node is visited, then the computational overhead seriously hinders analysis of networks with ten thousand nodes. Our limited experience suggests that more frequent re-estimation of parameters is less likely to end with an inferior optimal configuration. Hence, the tradeoff is complex.

Other major uses depend on the underlying model. In the Frobenius setting, the model can be used to generalize conformity-based decomposition of a network as shown in Example 2. In the Poisson log-likelihood setting, our model suggest a new clustering procedure. In contrast to other clustering procedures, the CPBA models provide a means of relating clusters to each other via the cluster similarities _{
a
b
}. Furthermore, likelihood based objective functions permit statistical tests for assessing the significance of an edge. For example, in the multigraph model, the significance of the number of connections between two nodes can be ascertained by comparing the observed number of connections to the expected number of connections under the Poisson model. Finally, likelihood based objective functions provide a rational basis for estimating the number of clusters in a data set.

In the following three examples, we illustrate how to generalize a variety of network models to include clustering.

Example 1: Generalizing the random multigraph model

We recently explored a random multigraph model _{
i
}. The random number of edges between nodes _{
i
}
_{
j
}. This model relies entirely on propensities and ignores cluster similarities. We will refer to it as the Pure Propensity Poisson Model (PPP) to avoid confusion with CPBA. Thus, the PPP log-likelihood is

where _{
i
j
}=_{
i
j
} is the number of edges between nodes

Although the parametrization (Eq. 8) of PPP is flexible and computationally tractable, it ignores cluster formation. To address this limitation, we propose to exploit the CPBA parametrization. This extension is natural because many large multigraphs appear to be made up of smaller sub-networks, often referred to as modules, that are highly connected internally and only sparsely connected externally. For example, consider a co-authorship multigraph where an edge is placed between two scientists whenever they co-author an article. Scientists working at the same institution and in the same department tend to be highly connected. Similarly, scientists tend to collaborate with other scientists working on the same research topics. Cluster structure is also inherent in biology. For instance, genes often function in pathways, and proteins often cluster in evolutionary families. Thus, when a network exhibits clustering, the propensity to form connections within a cluster is usually higher than the propensity to form connections between clusters. This phenomenon cannot be modeled using our original PPP model

To keep the number of parameters to a minimum, the cluster similarity matrix **
R
** = (

Example 2: Generalizing the conformity-based decomposition of a network

To demonstrate the value in our clustering model and tap into the wealth of data on weighted networks **
A
** = (

for all _{
i
} is often called the **conformity** of node **factorizable network** was first proposed in **affinity**
_{
i
j
} between proteins _{
i
} = exp( − _{
i
}), where _{
i
} is the number of hydrophobic residues on protein **
f
** is uniquely defined if the network contains

MM algorithm and R software implementation

Our software implementation of CPBA is freely available in the R package PropClust. On a laptop with a 2.4 GHz i5 processor and 4 GB of RAM, PropClust can estimate the parameters for 1000 nodes for a given cluster assignment in 0.1 seconds. For 3000 nodes, the same analysis takes 1 second. In practice, initial clusters are never perfect and must be re-configured as well. PropClust adopts a block descent (or ascent) strategy that alternates cluster re-assignment and parameter re-estimation until clusters stabilize. Block descent takes under 10 rounds on average if initial cluster assignments are good. Note that all parameters are fixed in cluster re-assignment, and all clusters are fixed in parameter re-estimation. Furthermore, both steps decrease the value of the objective function. Early versions of PropClust re-estimated parameters as each node was moved. This tactic proved to be too computationally burdensome on large-scale problems despite its slightly better performance in finding optimal clusters.

Judicious choice of the initial clusters is realized by a divide-and-conquer strategy. First, hierarchical clustering coupled with dynamic branch cutting

Simulated clusters in the Euclidean plane

Our first simulated dataset suggests a geometric interpretation of propensities and cluster similarities. For this dataset we simulated four distinct clusters on the Euclidean plane by sampling from a rotationally symmetric normal distribution with covariance matrix **
I
** and means corresponding to the four cluster centers shown in Figure

Four clusters were simulated in the Euclidean plane by sampling from the rotationally symmetric normal distribution with means corresponding to the different cluster centers and variance matrix I

**Four clusters were simulated in the Euclidean plane by sampling from the rotationally symmetric normal distribution with means corresponding to the different cluster centers and variance matrix I.** The numbers of points in the clusters were 50, 100, 150, and 200 for the black, red, green, and blue clusters, respectively. **A**) A plot of the points is shown colored by cluster. **B**) Heatmap that color-codes the ordered adjacency matrix, calculated using the formula ^{2}. In this plot red indicates a high adjacency, and green indicates a low adjacency. As expected, the adjacency within clusters is very high, and the adjacency between the blue and black clusters is the lowest since they are the furthest apart. **C**) The scatter plot between propensity (y-axis) and whole network connectivity (row sum of the adjacency matrix, Eq. 7) shows that the propensity is related to the distance between a point and its cluster’s center (given Eq. 10) in this example. **D**) Scatter plot between cluster similarity (y-axis) calculated using CPBA and the Euclidean distance between cluster centers (x-axis) shows a perfect negative correlation (-1).

where **
x
**

Simulated gene co-expression network

To illustrate how CPBA generalizes to weighted correlation networks, we simulated gene expression data using the _{
i
} are very significantly correlated to the node connectivities _{
i
}. This strong relationship reflects (Eq. 7). Furthermore, as seen in Figure

Gene expression simulation results

**Gene expression simulation results.** Gene expression data were simulated using the

Real gene co-expression network application to brain data

In this real data example, we demonstrate that CPBA generalizes weighted correlation network analysis and can deal with fairly large data sets. The human brain expression data in question were measured on the Affymetrix U133A platform ^{4} probes that were highly expressed in brain tissue. The biological modules discovered by Oldham et al. 2008 _{
i
} in the correlation network are highly correlated (r=0.96) with the connectivities calculated under the CPBA approximation and with the corresponding CPBA propensities (r=0.88). Figure

Human brain expression data illustrate how CPBA can be interpreted as a generalization of WGCNA

**Human brain expression data illustrate how CPBA can be interpreted as a generalization of WGCNA. ****A**) Hierarchical cluster tree based on WGCNA. Color bands show the WGCNA modules (first band), CPBA modules identified by propensity clustering (second band), and the modules identified by Oldham **B**) The intermodular adjacency calculated using CPBA (y-axis) is stronly correlated (**C**) For nodes restricted to module 1 (turquoise in the color bands in panel **A**), CPBA propensity is highly correlated with its WGCNA counterpart, the module membership, kME (Eq. 3) raised to the soft thresholding power. **D**) and **E**) show analogous scatter plots for modules 2 (blue) and 3 (brown), respectively. **F**) The co-expression network exhibits approximate scale free topology (SFT). Specifically, the x-axis corresponds to equal width bins of the logarithm (base 10) of the connectivity ^{2}=0.91) indicates that SFT fits very well. **G**) evaluates SFT for CPBA connectivity defined by the right-hand side of Eq. 7. **H**) evaluates SFT for the propensity _{i} only. **I**) The CPBA connectivity (y-axis) is highly correlated (_{i} in the correlation network (x-axis). Genes are colored according to module assignment (PropClust color band in panel A. **J**) There is a high correlation (r = 0.88) between _{i} (x-axis) and propensity (y-axis). **K**) There is a high correlation (r = 0.93) between CPBA based connectivity (x-axis) and propensity (y-axis).

These results demonstrate that CPBA is roughly equivalent to WGCNA in a typical co-expression network. We expect that CPBA will also be helpful in understanding network topology. For example, Figure

OMIM disease and gene networks

Here we present an application that is not amenable to correlation network models but is arguably well suited for multigraph models. Specifically, we consider a bipartite multigraph between genes and diseases based on curated data from the reference Online Mendelian Inheritance of Man (OMIM), which tracks published links between diseases and corresponding genes

Following Goh et al.

We categorized the diseases using MeSH with little success. Nearly half of the diseases (47%) were not mapped to any category, and another 36% were mapped to multiple categories. Using the clustering obtained from the CPBA analysis of the disease network, we looked at whether any MeSH categories were overrepresented in a cluster. Ignoring diseases present in multiple MeSH categories, we found 8 significant categories at

**Name**

**MeSH num.**

**−Log**
_{
10
}**(P)**

Hemic & lymphatic diseases

C15

8.32

Eye diseases

C11

7.78

Cardiovascular diseases

C14

4.23

Nervous system diseases

C10

4.04

Neoplasms

C4

3.37

Musculoskeletal diseases

C5

2.91

Endocrine system diseases

C19

2.04

Congenital, hereditary, &

neonatal diseases & abnormalities

C16

2.03

OMIM disease network

**OMIM disease network.** The intramodular connections between the nodes of the eye disease cluster are shown. Diseases are colored based on their MeSH categories, with diseases categorized as eye diseases (colored green), diseases linked to multiple categories (colored grey), and diseases that were not found (colored white). Note that more nodes should have been classified into the eye cluster by MeSH based on the name alone. Primary examples of this include retinitis pigmentosa, cone-rod dystrophy, retinal dystrophy, and microcornia. In spite of the failure of green labeling, these nodes were correctly classified by CPBA. Node and font sizes are proportional to a disease’s propensity.

Additionally, we found 540 significant connections between diseases at **cluster connectivity**, one can use the row sum of the cluster similarity matrix **
R
**. The neoplasm cluster has the highest row sum and is the cluster with the highest cluster connectivity. This makes sense given the complexity and diversity of cancers within the cluster.

**Disease 1**

**Disease 2**

**C1**

**C2**

**−Log**
_{
10
}**(P)**

1

Zellweger syndrome

Adrenoleukodystrophy

14

14

8.57

2

Muscular dystrophy-dystroglycanopathy (limb-girdle)

Muscular dystrophy-dystroglycanopathy (congenital)

2

2

7.05

3

Ullrich congenital muscular dystrophy

Bethlem myopathy

14

14

6.48

4

Iminoglycinuria

Hyperglycinuria

14

14

6.48

5

Alport syndrome

Hematuria

14

14

5.31

6

Colorblindness

Blue cone monochromacy

14

14

5.31

7

Refsum disease

Zellweger syndrome

14

14

5.05

8

Usher syndrome

Retinitis pigmentosa

8

6

5.04

9

Seckel syndrome

Microcephaly

14

14

4.96

10

Leukoencephalopathy with vanishing white matter

Ovarioleukodystrophy

14

14

4.96

11

Omenn syndrome

Severe combined immunodeficiency

14

14

4.90

12

Tuberous sclerosis

Lymphangioleio-myomatosis

14

14

4.60

13

Cone-rod dystrophy

Macular degeneration

6

10

4.60

14

Bronchiectasis with or without elevated sweat chloride

Pseudohypoaldoste-ronism

11

11

4.47

15

Leri-Weill dyschondrosteosis

Langer mesomelic dysplasia

14

14

4.10

16

Multiple pterygium syndrome

Myasthenic syndrome

14

14

4.00

17

Craniofacial-deafness-hand syndrome

Waardenburg syndrome

3

11

3.77

18

Nicotine addiction

Epilepsy

3

8

3.76

19

Hirschsprung disease

Pheochromocytoma

11

2

3.70

20

Langer mesomelic dysplasia

Short stature

14

14

3.62

Looking at the complementary gene network, we checked for overrepresentation of Gene Ontology (GO) terms using BinGO on Cytoscape

**Rank**

**Gene 1**

**Gene 2**

**Cluster 1**

**Cluster 2**

**- Log**
_{
10
}**(P)**

1

HBB

HBA1

2

2

9.05

2

SHOXY

SHOX

10

10

7.36

3

BDNF

HTR2A

5

4

7.07

4

SH2B3

JAK2

2

8

7.05

5

TSC2

TSC1

10

10

6.28

6

FOXC1

PITX2

7

7

5.73

7

MAPT

PSEN1

4

6

5.66

8

OPN1MW

OPN1LW

10

10

5.58

9

COL4A4

COL4A3

10

10

5.58

10

RAG2

RAG1

10

10

5.56

11

SCNN1G

SCNN1B

5

5

5.25

12

HBB

KLF1

2

10

5.09

13

COL6A1

COL6A3

10

10

5.08

14

COL6A2

COL6A3

10

10

5.08

15

SLC6A19

SLC36A2

10

10

5.08

16

SLC6A20

SLC36A2

10

10

5.08

17

SLC6A20

SLC6A19

10

10

5.08

18

COL6A2

COL6A1

10

10

5.08

19

GPC3

OFD1

8

7

4.75

20

LTBP2

CYP1B1

10

7

4.73

OMIM Gene Network

**OMIM Gene Network.** Genes are colored based on their cluster membership, and node size is proportional to a gene’s propensity. This view was achieved with a spring-embedded layout in Cytoscape using the number of edges between two genes as weights. Note that CPBA based clustering identifies modules of highly interconnected nodes.

Empirical comparison of edge statistics

In this section we compare our current CPBA model with our original Pure Propensity Poisson (PPP) model on two real datasets: the OMIM disease network and the complimentary OMIM gene network. On the whole we find that the CPBA model produces more plausible P-values for the edge-count tests. Conditioning on clusters enables CPBA to detect significant intercluster connections often missed by the PPP model. It also produces more reasonable P-values within clusters since propensities are not artificially deflated by the lack of connections between nodes from different clusters. We now consider how these trends play out in the OMIM disease network and the OMIM gene network.

In the disease network we find that, among the 20 most significant connections under the CPBA model, 5 are intercluster connections (See Table

**Disease 1**

**Disease 2**

**C1**

**C2**

**- Log**
_{
10
}**(P)**

1

Muscular dystrophy-dystroglycanopathy (limb-girdle)

Muscular dystrophy-dystroglycanopathy (congenital)

2

2

13.31

2

Zellweger syndrome

Adrenoleukodystrophy

14

14

12.06

3

Leber congenital amaurosis

Retinitis pigmentosa

6

6

10.12

4

Neuropathy

Charcot-Marie-Tooth disease

12

12

8.99

5

Blood group

Malaria

13

13

8.76

6

Ullrich congenital muscular dystrophy

Bethlem myopathy

14

14

8.57

7

Iminoglycinuria

Hyperglycinuria

14

14

8.57

8

Usher syndrome

Deafness

8

8

8.48

9

Hemolytic uremic syndrome

Macular degeneration

10

10

8.24

10

Bronchiectasis with or without elevated sweat chloride

Pseudohypoal-dosteronism

11

11

7.75

11

Refsum disease

Zellweger syndrome

14

14

7.14

12

Meckel syndrome

Joubert syndrome

6

6

7.08

13

Omenn syndrome

Severe combined immunodeficiency

14

14

6.99

14

Left ventricular noncompaction

Cardiomyopathy

12

12

6.97

15

Mitochondrial complex I deficiency

Leigh syndrome

2

2

6.85

16

Alport syndrome

Hematuria

14

14

6.70

17

Colorblindness

Blue cone monochromacy

14

14

6.70

18

Atrial fibrillation

Long QT syndrome

2

2

6.64

19

Cone-rod dystrophy

Retinitis pigmentosa

6

6

6.56

20

Microphthalmia with coloboma

Microphthalmia

6

6

6.46

Comparing the intracluster connections, we find that CPBA and PPP produce similar results, with 8 connections present in both lists. However, the P-values of these connections differ sharply under the two models. Since the PPP model essentially assumes a single cluster, estimated propensities trend lower in response to the lack of connections between nodes from different clusters. This results in lower means for the Poisson distributions and more extreme P-values. This phenomenon is especially evident in the pairing between Adrenoleukodystrophy and Zellweger syndrome; in the CPBA model the test for excess edges has −Log_{10}(_{10}(

The same story holds for the gene network. Among the 20 most significant connections under CPBA, 7 are intercluster connections (Table

**Rank**

**Gene 1**

**Gene 2**

**Cluster 1**

**Cluster 2**

**- Log**
_{
10
}**(P)**

1

HBB

HBA1

2

2

13.87

2

SHOXY

SHOX

10

10

10.15

3

SDHD

SDHB

5

5

9.96

4

SCNN1G

SCNN1B

5

5

9.27

5

RAG2

RAG1

10

10

8.34

6

TSC2

TSC1

10

10

8.14

7

SDHC

SDHB

5

5

7.79

8

FOXC1

PITX2

7

7

7.54

9

OPN1MW

OPN1LW

10

10

7.43

10

COL4A4

COL4A3

10

10

7.43

11

GDF6

GDF3

7

7

7.29

12

TERC

TERT

9

9

7.20

13

CISH

TIRAP

4

4

7.12

14

GDNF

RET

5

5

7.04

15

COL6A1

COL6A3

10

10

6.94

16

COL6A2

COL6A3

10

10

6.94

17

SLC6A19

SLC36A2

10

10

6.94

18

SLC6A20

SLC36A2

10

10

6.94

19

SLC6A20

SLC6A19

10

10

6.94

20

COL6A2

COL6A1

10

10

6.94

To summarize, the CPBA model was able to find significant intercluster edge counts that the PPP model missed. Indeed, the PPP model was unable to find a single signficant intercluster pair in either data set. Although conditioning on clusters resulted in less impressive intracluster P-values, the CPBA model was still able to detect most of the significant intracluster pairings found by the PPP model. Figure _{10}(

OMIM CPBA versus PPP Analysis

**OMIM CPBA versus PPP Analysis.** Scatterplot of the Log_{10}(

Simulations for evaluating edge statistics

To drive home the last point, we took a block diagonal adjacency matrix containing 1’s in its diagonal blocks and 0’s in its off-diagonal blocks and introduced a few off-block connections. In our initial matrix with three diagonal blocks of 100, 200, and 500 nodes, we changed 60 off-block entries from 0’s to 1’s. Each pair of node sets accounted for 20 of these switches. We then analyzed the modified matrix under both the CPBA and PPP models. Figure _{10}(_{10}(

Simulated CPBA versus PPP Analysis

**Simulated CPBA versus PPP Analysis.** Scatterplot of the −Log_{10}(_{10}(

Hidden relationships between fortune 500 companies

To illustrate the utility of CPBA in a non-biological setting, we briefly describe a multigraph model of cross-company management. Specifically, we took the Fortune 500 Companies of 2011 and put an edge between two companies for each shared member on their boards of directors. The original data is found in Freebase

Based on the underlying probability model, we ascertained the significance of the edge counts for company pairs. Table

**Rank**

**Company 1**

**Company 2**

**- Log**
_{
10
}**(P)**

**Edges**

1

U.S. Bancorp

Ecolab

6.01

4

2

PetSmart

Dean Foods

4.53

3

3

Sempra Energy

Aecom Technology Corp.

4.39

3

4

General Motors

DuPont

4.07

3

5

Cardinal Health

Aon Corp.

4.07

3

6

Lockheed Martin

Monsanato

4.07

2

7

Fidelity National Financial

Fidelity National Inf. Services

4.06

2

8

Hewlett-Packard

News Corporation

3.89

2

9

AutoZone

AutoNation, Inc.

3.8

3

10

United Technologies Corporation

PACCAR

3.74

2

Relationship to other network models and future research

Because so much effort has been devoted to the mathematical and statistical explication of complex networks, we can only touch on the relationship of the CPBA parametrization to other network models. Complex networks can be described by random graphs (the Erdös and Rényi model _{
i
j
} of observing an edge between nodes _{
i
} is the connectivity (degree) of node _{
i
j
} can be well approximated by CPBA with propensities _{
a
b
}=1. The Erdös and Rényi (ER) model, which assumes uniform edge probabilities, is too restrictive for realistic networks. The CPBA parametrization adapts well to random graphs if we replace the mean edge count with the edge formation probabilities

This reformulation of the model is consistent with construction of an MM algorithm for parameter estimation

Growing random networks (GRNs) are also of interest since many networks grow by the continuous addition of new nodes and exhibit preferential attachment. Thus, the likelihood of connecting to a node depends on the node’s current connectivity _{
k
}, which is the probability that a newly-introduced node forms an edge to an existing site with

For homogeneous connection kernels, _{
k
}∼^{
ν
}, and scale free networks only arise if _{
i
j
} of finding an edge between nodes

which, importantly, assumes that node _{
i
} was added later to the growing network than node _{
i
} < _{
j
}). In view of this temporal assumption, _{
i
j
} is not symmetrical in _{
i
j
} of finding an edge fits well to the CPBA.

Relationship to other clustering methods

Although the MM algorithm that estimates the CPBA parameters naturally generates a clustering method, CPBA is not just another clustering method. Our applications highlight the utility of the parameter estimates and the resulting likelihood based tests. CPBA not only provides a sparse parametrization of a general similarity matrix, but it also identifies hub nodes and clusters and enables significance tests for excess edges (between nodes) and shared similarities (between clusters). We do not claim that CPBA based clustering outperforms existing clustering methods in the simple task of clustering.

Substitutes for CPBA clustering include hierarchical clustering, partitioning around medoids

Conclusions

The current paper introduces the CPBA model (cluster and propensity based approximation) for general similarity measures and sketches an efficient MM algorithm for estimation of the CPBA parameters. These advances will prove valuable in dissecting networks involving functional or evolutionary modules. The CPBA model is attractive for several reasons. First, it invokes relatively few parameters while providing sufficient flexibility for modeling observed similarities. Second, the cluster similarity parameters are good at revealing higher-order relationships between clusters. The row sum of the cluster similarity matrix can be used to define a cluster connectivity measure and to identify hub clusters such as the neoplasm hub in the disease network. Third, the CPBA model naturally generalizes network approximations that are already part of scientific practice, namely, the propensity based approach in multigraph models, the conformity decomposition in weighted networks, and the eigenvector based approximation in correlation networks. Fourth, the connections to the MM algorithm make the model well adapted to high-dimensional optimization. Fifth, the Poisson multigraph version of the model enables assessment of the statistical significance of edge counts and similarities between clusters. Sixth, likelihood-based models such as the Poisson multigraph model provide a rational basis for estimating the number of clusters. While it is beyond our scope to evaluate different methods for estimating the number of clusters in a data set, it is worth mentioning that our R implementation allows users to initialize clusters via hierarchical clustering. This tactic obviates the need to pre-specify the number of clusters.

Using simulated clusters in the plane and simulated co-expression networks, we demonstrate that CPBA generalizes existing methods. The planar examples show how a propensity can be intuitively seen as a measure of a node’s closeness to its cluster’s center and how a cluster similarity can be seen as a measure of proximity between two clusters. The simulated gene expression dataset exposes the CPBA model’s close ties to the previously studied concepts of intramodular connectivity, module eigengenes, and eigengene adjacency. Our analysis of real gene expression data reassures us that CPBA clustering results are similar to those of a benchmark method used in co-expression network analysis. The CPBA propensity parameters mirror the module eigengene based connectivity

To illustrate the versatility of CPBA, we applied it to the gene and disease networks of OMIM. The evidence that CPBA identifies biologically meaningful clusters is readily apparent in the significant enrichment of MeSH categories in the disease clusters and in the significant enrichment of GO terms in the gene clusters. While many other clustering procedures could have been used, CPBA has the advantage of dealing with dissimilarity measures as opposed to numeric input variables. It also provides Poisson likelihood based significance tests for edge counts (either pairs of genes and or pairs of diseases) that respect the underlying cluster structure. Finally, the row sums of the cluster similarity measure can be used to define hub clusters, and the estimated propensities can be used to define hub nodes. As we hoped, there were biologically meaningful ties between significantly connected pairs of genes and diseases. Several of these biologically plausible explanations are discussed in the text.

Although our examples are mainly biological, one can apply CPBA in many other contexts. For example, we employed CPBA to highlight shared board members among the Fortune 500 companies. This example illustrates how significant connections mirror the underlying ties between nodes. The edge count significance test suggests that the antitrust suit against GM and DuPont was no accident. To its credit, CPBA not only generalizes correlation network methods to general similarity matrices, but it also provides a valuable extension of random multigraph methods to weighted and unweighted multigraph data. CPBA is not just another clustering procedure but offers unique test statistics that permit identification of hub clusters and significant edge counts. We anticipate that the CPBA model will prove attractive to a wide range of scientists.

Methods

Maximizing the Poisson log-likelihood based objective function

Our algorithm for maximizing the Poisson log-likelihood (Eq. 6) given a clustering assignment **
c
** combines block ascent and the MM principle

where _{
i
} and _{
j
}, _{
i
} is the propensity of node _{
i
j
}=_{
i
j
} is the number of connections between nodes

To optimize the objective function for a given cluster assignment, we employ block ascent and alternate updating **
R
** and

equal to zero and solving for

We expect the estimated _{
a
b
} to occur within the unit interval [0,1] because edge formation is enhanced within clusters.

To update the propensity vector **
p
** with

is the key to minorizing the Poisson log-likelihood. Substituting the right-hand side for _{
i
}
_{
j
} in the log-likelihood (Eq. 6) gives a surrogate function with parameters separated and leads directly to the propensity updates

In practice, this MM algorithm may require an excessive number of iterations to converge. To accelerate convergence, we employ a Quasi-Newton extrapolation specifically designed for high-dimensional problems (Methods and **
R
** and

Minimizing the Frobenius norm based objective function

Minimization of the Frobenius objective function (Eq. 5) employs block descent and again alternates updating **
R
** and

equal to zero and solving for

To update **
p
** for

In majorization, one is allowed to work piecemeal. Thus, we majorize the term involving (_{
i
}
_{
j
})^{2} by the earlier arithmetic-geometric mean inequality

taking into account squares. The term involving −_{
i
}
_{
j
} can be majorized by the inequality

Substituting upper bounds side for (_{
i
}
_{
j
})^{2} and −_{
i
}
_{
j
} in the expanded objective function (Eq. 14) gives a surrogate function with parameters separated and leads directly to the propensity updates

As in the Poisson case, acceleration is advisable for both inner MM iterations and the outer block descent iterations. The same Quasi-Newton extrapolation **
R
** and

Model initialization

Initial cluster assignment

Many algorithms exist for creating initial cluster assignments

Initial propensities

One way to initialize propensities is to assume a single cluster and estimate propensities as suggested in our earlier work _{
i
} by the sum of the connections of node

This initialization can be motivated by showing that the above equation holds if

Cluster similarity parameters

Because the block updates (Eq. 11) and (Eq. 13) for the cluster similarity parameters only depend on cluster assignment and propensities, it is natural to use those updates for initialization as well.

Clustering algorithm

1. Choose the objective function (Frobenius or Poisson).

2. Initialize the cluster assignment, for example, via hierarchical clustering.

3. Initialize the propensity vector **
p
** by (Eq. 16) or (Eq. 17) and the cluster similarity matrix

4. Parameter Estimation: Given cluster assignments, re-estimate parameters through the updates (Eq. 11) and (Eq. 12) or (Eq. 13) and (Eq. 15). Declare convergence when the objective function changes by less than a threshold, say 10^{−5}.

5. Cluster Reassignment:

(a) Randomly permute the nodes.

(b) For each node taken in order, try all possible cluster reassignments for the node.

(c) Assign the node to the cluster that leads to the biggest improvement in the objective function.

(d) Repeat step 5 until no nodes are reassigned.

6. Repeat steps 4 and 5 until no nodes are reassigned.

7. (Optional) Repeat steps 1- 5 for other cluster numbers and use a cluster number estimation procedure for choosing the number of clusters.

Quasi-Newton acceleration

In this section we briefly review a Quasi-Netwon acceleration method described more fully in **0** = **
x
**−

Quasi-Netwon acceleration approximates **
x
**

Construction of **
M
** relies on secants. We can generate a secant by taking two iterates of the algorithm starting from the current iterate

where **
M
** =

Provided **
U
** has full column rank

Thus, the quasi-Newton acceleration can be expressed as

This update involves inversion of the small **
U
**

Estimating the number of clusters

Estimating the number of clusters is the Achilles heel of cluster analysis. While this topic is beyond our scope, it is worth mentioning that an advantage of model based approaches is that likelihood criteria can be brought to bear. Since adding clusters entails more parameters, it is tempting to use the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) to estimate the number of cluster in the Poissom model

Ethical approval

This article involved publicly available human data sets which are completely anonymized. This study is therefore exempted from requiring ethics approval. No animal data were used. We fully comply with the Declaration of Helsinki and the “Animal Research: Reporting In Vivo Experiments” (ARRIVE) guidelines.

Availability and requirements

**Project name:**
**Project home page:**
** Operating system(s):** Platform independent **Programming language:** R **Licence:** GNU GPL 3 The propensity based clustering method

Abbreviations

BA: Barabasi Albert; CPBA: Cluster and propensity based approximation; ER: Erdos renyi; GO: Gene ontology; GRN: Growing random network; kME: Connectivity based on the module eigenvector or eigengene; MeSH: Medical subject headings; MM: Minorization maximization or majorization minimization; PPP: Pure propensity poisson; SFT: Scale free topology.

Competing interests

The authors declare that they have no conflict of interest.

Authors’ contributions

JR, SH, and KL jointly developed the methods and wrote the article. JR carried out the analysis and implemented the software. PL helped with the R software implementation and carried out the example analysis on empirical expression data. All authors read and approved the final manuscript.

Acknowledgements

This research was supported in part by United States Public Health Service grants GM53275, HG006139, MH59490, and UCLA CTSI Grant UL1TR000124.