Abstract
Background
It is of great importance to identify molecular processes and pathways that are involved in disease etiology. Although there has been an extensive use of various highthroughput methods for this task, pathogenic pathways are still not completely understood. Often the set of genes or proteins identified as altered in genomewide screens show a poor overlap with canonical disease pathways. These findings are difficult to interpret, yet crucial in order to improve the understanding of the molecular processes underlying the disease progression. We present a novel method for identifying groups of connected molecules from a set of differentially expressed genes. These groups represent functional modules sharing common cellular function and involve signaling and regulatory events. Specifically, our method makes use of Bayesian statistics to identify groups of coregulated genes based on the microarray data, where external information about molecular interactions and connections are used as priors in the group assignments. Markov chain Monte Carlo sampling is used to search for the most reliable grouping.
Results
Simulation results showed that the method improved the ability of identifying correct groups compared to traditional clustering, especially for small sample sizes. Applied to a microarray heart failure dataset the method found one large cluster with several genes important for the structure of the extracellular matrix and a smaller group with many genes involved in carbohydrate metabolism. The method was also applied to a microarray dataset on melanoma cancer patients with or without metastasis, where the main cluster was dominated by genes related to keratinocyte differentiation.
Conclusion
Our method found clusters overlapping with known pathogenic processes, but also pointed to new connections extending beyond the classical pathways.
Background
Highthroughput experimental techniques provide objective views of the molecular changes that occur in cells during disease progression. A widely used tool when trying to understand the biological meaning of the observed changes is pathway analysis. In pathway analysis, the list of significantly altered genes or molecules is mapped onto usually precompiled pathways. A wide range of methods for finding predefined pathways overrepresented by significant genes have been developed and are steadily used, e.g. [13]. However, this strategy for identifying the important disease pathways is limited, as pathways are not fixed and change with context [4]. As a consequence, many of the significantly altered genes or molecules fall outside of the expected pathways [5].
An alternative approach, which is fundamentally different from the pathway analysis just described, is a purely datadriven approach, in which the experimental data is used without any guidance from prior knowledge about molecular interactions. The goal is then to discover hidden patterns of correlation between genes reflecting the complex processes and pathways that underlie cellular metabolism and physiology [6]. The most common approach for this task is unsupervised clustering. This class of methods was of the first to be applied to microarray data [7], and several approaches have been proposed, e.g hierarchical and kmeans clustering, Prediction Around Medoids (PAM) [8], modelbased clustering via Markov chain Monte Carlo (MCMC), e.g [9,10], tight clustering [11] and clustering via networks [12]. Another way of learning molecular relationships from microarray data is to use Bayesian network (BN) methodology. Instead of finding groups of correlated molecules, BN methodology infers a network of (direct) interacting molecules. BNs were first proposed for microarray data in 2000 [13], and have also been much applied.
Along with the development of such unsupervised methods, there has been a steady development of prior information about molecular interactions. In addition to the mentioned pathways, there are many databases containing pairwise connections, such as proteinprotein interactions, transcription factors binding to genes and protein sequence similarities. Such information is maybe a more reliable source of information than pathways, as the latter lack proper definitions, and are as already mentioned heavily context dependent.
Bayes’ formula is well suited for taking prior information into account during inference, and says
The goal of the analysis is the posterior probability p (MD), i.e. the probability that the model (M) is correct given the data (D). In our setting, the model M represents a clustering or a network configuration, whereas D represents the microarray data. Many methods using BN methodology to incorporate informative priors have been proposed quite recently, e.g [1417]. However, even with the help of prior information, inferring Bayesian networks for large numbers of genes remains computationally challenging, and the number of genes that can be included is usually smaller than one hundred. Inference about groups or clusters of genes is a less daunting task, as there are fewer group/cluster configurations than network configurations. For cluster analysis, there has been some, but not many, suggestions for how to make use of prior knowledge, e.g [18]. Many Bayesian modelbased clustering methods have been proposed, e.g. [9,10,19,20] but common to all of these is that they use noninformative priors. Bayesian clustering with truly informative priors has to our knowledge not previously been proposed.
In the present paper, we develop a novel method for clustering using Bayesian statistics and informative priors making use of Markov chain Monte Carlo (MCMC) sampling. We name our method MCIP (MCMC Clustering with Informative Priors). In our method, coregulation patterns in the microarray data are used together with prior knowledge to find groups or modules of interacting genes. Specifically, we propose a method which searches for an optimal partitioning of the genes into functional modules, where module assignment is based on both microarray data and prior knowledge. The prior value for assigning genes to the same module is retrieved from databases containing gene pairs with a previously reported interaction or connection. Markov chain Monte Carlo (MCMC) sampling is used to search for the most reliable modules. Having found the modules, we generate subnetworks consisting of the prior pairs within each module, hypothesizing that these pairs represent the direct interactions. In the presented work we use prior information in forms of proteinprotein interaction data, transcription factor binding predictions and protein sequence similarities. We apply our method to simulated data as well as two realworld microarray data sets, one in heart failure and one in melanoma cancer.
Methods
Assume we have a genomic dataset D, e.g. a microarray dataset, with measurements of n genes. We want to use D together with prior knowledge about molecular interactions to find the most likely
partitioning into groups or modules of functionally related genes. Now assume that
the n genes represent K different modules, where K is not known in advance. Let M = {(i_{m},j_{m},p_{m}),m ∈ {1,…,q}} be the collection of q gene pairs for which prior knowledge exist, where
Now let g = (g_{1},…,g_{n}) be a representation of group/module membership for each of the n genes, where g_{j }∈ {1,…,K}. We want to find the most likely division g into modules given the data as well as the prior knowledge, or formally, find the g maximizing P(g ∣ D,M). Using Bayes’ formula, we have
Marginal likelihood given grouping
We start by finding the marginal likelihood given the subdivision into different groups (modules/clusters) and the prior information, i.e. P (D∣g,M). Assume the data in group k follow a multivariate normal distribution with covariance matrix Σ_{k}, whose prior is inverse Wishart f (Ψ_{k},m_{k}) where Ψ_{k }describes the prior covariance matrix and m_{k }describes the sharpness of the prior. Assume also that all prior information about the data is contained in M, and that genes belonging to different groups are independent of each other. Consequently, we can let each group k have prior covariance matrix Ψ_{k }= ρ_{k}I_{k}, where I_{k }is a n_{k }× n_{k }identity matrix and n_{k }is the number of genes in group k. We thus have that
where N is the number of rows in D (number of individuals), X_{k }is the data matrix for genes belonging to group k and Γ_{i }() is the multivariate gamma function. If the gene expressions are standardized to have mean values equal to zero and variances equal to one, it is natural to choose ρ_{k }= 1 and m_{k }= n_{k }+ 2, as this yields expected values of 1 for the variances, i.e. for the diagonal elements. The expectancy of the inverse Wishart distribution is defined only for m_{k }> n_{k }+ 1, so this is a minimalistic way of coding for the knowledge that the data has been standardized.
Probability of grouping given prior information
Next, we want to find P (g ∣ M), i.e. the probability of grouping g given the prior information M. Assume we have n genes and K ∈ {1,…,n} groups. Let N (n,K) denote the number of possible subdivisions of n genes into K groups. A new gene can be inserted into one of the K existing groups, or as its own, singlemembered, group, so N (n,K) can be defined recursively by
where N (K,K) = 1. Let N_{ij }(n,K) = K · N (n  2,K) + N (n  2,K  1) = N (n  1,K) be the number of subdivisions of n genes into K groups where i and j are in the same group. Let
Now let M = {(i_{m},j_{m},p_{m}),m ∈ {1,…,q}} be the set of q pairs for which prior knowledge exist, where we define p_{m }as the prior probability of forcing gene i_{m }and j_{m }to belong to the same group. Consider first the situation where we have no prior knowledge, i.e. M_{0 }= ∅. Let (K ,n) denote that we have K groups of n objects, and let P (K ∣ M_{0}) = 1/n, i.e equal prior probability for the number of groups. For a particular grouping g = (g_{1},…,g_{n}) that has a total of K_{g }groups, let P (g ∣ M_{0},K_{g})) = I (K = K_{g})/N (n,K_{g}), i.e. equal probability for all groupings for a given number of groups K_{g}. This implies that the probability of grouping g given no prior knowledge is
We denote this probably measure
We will now specify a probability measure which has the same structure as the baseline prior, except that i and j are forced to be in the same group and will thus be handled as a unit. To do this, consider the situation where we have one pair of genes that belong to the same group with probability one. The prior knowledge is then defined as M_{1,1 }= {(i,j),p_{1 }= 1}. We then have in total N_{ij }(n) = N(n  1) possible subdivisions and n  1 possible number of groups. The number of subdivisions into K groups is N_{ij }(n,K) = N (n  1,K). This means that
where I is the indicator function, so
Call this probability measure
Then consider the situation where we have one pair of genes that belong to the same
group with probability p_{1 }≠ 1. We now define the prior knowledge as M_{1 }= {(i,j),p_{1 }(≠ 1)}. The idea is that
Now let’s generalize to the situation where we have q pairs of genes with existing prior knowledge, i.e. we have M = {(i_{m},j_{m},p_{m}),m ∈ {1,…,q}} with
and thus
Since we have
and
Note that this expression increases exponentially with the number of prior pairs q. In order to avoid computational cost increasing exponentially with the number of prior pairs, we developed a Monte Carlo estimation of P(g ∣ M), described in Additional file 1.
Additional file 1. Calculations and parallel tempering description. Calculations of expressions used in the MCMC algorithm and description of the use of parallel tempering.
Format: PDF Size: 79KB Download file
This file can be viewed with: Adobe Acrobat Reader
For a given grouping, the baseline prior contributes to the overall probability of a grouping with an additive factor (1  p_{1})(1  p_{2}) ⋯ (1p_{q}) P (gM_{0}) (see Eq. 2 and the expression for the baseline prior P(gM_{0})). This implies that the baseline prior will contribute to the probability of two genes being clustered. Define P_{b} as the probability for two genes being in the same group given the baseline prior M_{0}. It is important that this probability is nonzero, in order to allow for the possibility of two arbitrary genes being in the same group whether they are in the set of prior pairs or not. If not, it would be impossible to group gene pairs that are not in the set of specified prior pairs. However, this implies that the probability of grouping two genes (i_{m}, j_{m}) in a prior pair will be be the mixture p_{m }+ (1  p_{m})P_{b}, as there is a nonzero probability that the genes are connected, even if they should not be connected according to the prior information. For instance, if the baseline prior for two arbitrary genes to be connected is 0.1 and pair number m has prior enforcement probability p_{m }= 0.8, the total prior probability for the pair (i_{m},j_{m}) to be connected will be 0.82.
There will be a baseline probability for two arbitrary genes to be connected. We have
become aware that with equal probability for all number of groups and equal probability
for each grouping given the number of groups, the baseline for any two genes to be
connected will be dependent on the total number of genes, n. In general, we get that the prior for groupings gives that the baseline probability,
P_{b }for two genes to be connected when there are K groups will be P(i and j in the same group ∣ K groups) = N (n  1,K)/N(n,K). The total probability will then be
Removing cycles
The set of priors might contain cycles, which can easily occur, e.g if both direct and indirect connections are included. We have developed an algorithm for detecting cycles, and if such are found, the prior pair with the smallest prior probability (smallest extra connection probability) is removed, as this connection is interpreted as a result of the rest of the cycle. The reason for excluding cycles in the prior is that with cycles, the pairwise specification of prior probabilities of pairs could be misleading. As an example, let us assume the prior is specified so that there is a 0.8 prior probability of a pairing between gene A and B (prior pair 1), between B and C (prior pair 2) and between A and C (prior pair 3). Then the probability for a forced connection between A and B (thus disregarding the baseline probability) will be P(A and B forced to be in the same group) = P(prior pair 1 enforced) + P(prior pair 1 not enforced)P(prior pair 2 enforced)P(prior pair 3 enforced) = 0.8 + 0.2*(0.8*0.8) = 0.928. Thus in order for the pairwise prior probabilities to be interpreted as the probability for two pairs to be forced to be connected, cycles should be avoided.
Markov chain Monte Carlo procedure for integrative networks
We use Markov chain Monte Carlo (MCMC) to sample from the posterior distribution P (D ∣ g,M), as the analytical solution is not known. We will use the Metropolis Hastings algorithm, which is the most general version of MCMC. Specifically, we propose the following algorithm:
• Start with a random grouping, called g.
• Sample N groupings g based on the following scheme:
– Propose either (each with probability 1/3)
∗ A partitioning of one of the groups (except when K = n)
∗ A merging of two of the groups (except when K = 1)
∗ A transition of one of the genes to another group (except when K = 1)
– Call the new grouping g_{new}.
– Accept g_{new }with the following probability:
where P (D ∣ g) is given by (1), P (g ∣ M) is given by (2) and Q(g_{x}g_{y}) is the probability of proposing the group configuration g_{x }from the configuration g_{y}, and is given by (35) in the next section.
In order to avoid convergence to local maxima, we implemented parallel tempering [21], as described in Additional file 1.
Inferring clusters from MCMC samples
The above MCMC procedure results in a series of samples g. These samples can be used to calculate the posterior similarity matrix (PSM), in
which each entry (i,j) gives the proportion p_{ij }of samples gene i and gene j occur together in the same cluster. We infer clusters from the PSM using the minbinder function in the mcclust R library [22]. Specifically, minbinder makes use of hierarchical clustering with the PSM as distance matrix together with
the cuttree function to produce cluster configurations with the number of clusters ranging from
1 to L, where L is a userspecified maximum. Now letting I_{K }(i,j) be the indicator for whether gene i and j are in the same cluster for the configuration using K clusters (based on hierarchical clustering and the cuttree function), and using absolute difference as loss function, the posterior expected
loss e_{K }for K clusters is calculated in minbinder as
Proposal distribution
Let n, as before, be the total number of genes, and let n_{k }be the number of genes in group k and n_{s }be the number of singlemembered groups. If the number of groups K equals 1, the only allowed option is splitting into two new groups. If K = n, we can only have a merging of two genes into a new group. If 1<K < n, all three types of moves are allowed (splitting and merging groups, and moving a gene into another group). Each type of move then has probability 1/3.
First, consider the situation where g → g ^{′ }is due to a splitting of group k. For the configuration g, we then either have 1 < K < n, and a splitting is occurring with probability 1/3, or K = 1, and this type of move is occurring with probability 1. The probability of getting the new grouping g^{′ }is then the product of the probability of having a split within group k, which is n_{k}/(n  n_{s}), the probability of the actual splitting of j into two specific new subgroups, which is 1/N (n_{k},2), and the probability of having a split (rather than a merge or a transition of one of the genes), which is 1/3 · I (1 < K < n) + I (K = 1). That is, we have,
Next, consider the situation where g → g^{′ }is due to a merging of group k and l. A merging happens with probability one if the number of groups is equal to n and with probability 1/3 if 1 < K < n. Two groups, l and k, are chosen by first sampling a random gene and finding which group the gene belongs to, then picking another random gene and resampling as long as that other gene belongs to the same group. The probability of merging group l and group k has thus a proposal probability P(merge l and k) = P(choose l and k ∣ merge)P(merge) = [P(choose k first)P(choose l second ∣ k chosen first) + P(choose l first)P(choose k second ∣ l chosen first)] P(merge). That is, we have
Finally, consider the situation where g → g^{′ }is the result of moving a gene in group l to group k. A move of one gene from group l to group k is proposed by sampling a random gene and resampling if the gene itself constitutes a singlemembered group. Another random gene not belonging to group l is then sampled, defining another group k. A random element from group l is then assigned group identity k. The proposal probability thus becomes P(move the kth gene of group l to group k) = P(choose group l to move from and group k to move to and choose ith element of group l ∣ move)P(move) = P(choose group l)P(choose group k ∣ group l)P(choose ith element from group l)P(move) = (n_{l}/(n  n_{s}))(n_{k}/(n  n_{l}))(1/n_{l})(1/3) I (1 < K < n). Thus, we have
Computational aspects
The likelihood estimation involves the determinant function, which is O (n^{3}) in the number n of genes in each module. In our method, the most computationally expensive calculation is the exact calculation of the prior probability P(gM). This calculation is exponential in the number of prior pairs. Often, as is the case for the data used in this paper, the number of pairs for which there exist prior knowledge, can be substantial. To deal with such situations, we developed an approximate estimate of the prior probability based on Monte Carlo simulations. Comparisons on moderately large number of priors suggested that although less accurate, a Monte Carlo estimate of the prior gives results close to the results obtained when calculating the prior exactly (Additional file 2). For larger number of priors, this can not be tested, but experience shows stable results, suggesting that the stochastic nature of the algorithm do not seriously affect the results. With 100 genes and 100 samples and 200 priors, and using 800 for the Monte Carlo prior calculation, it took our method approximately 10 minutes to run 10000 samples on an 64 GB RAM, 16 physical CPU cores compute node.
Additional file 2. Monte Carlo estimation of prior. Evaluation of the precision of the Monte Carlo estimates of the prior compared to the exact calculation.
Format: PDF Size: 419KB Download file
This file can be viewed with: Adobe Acrobat Reader
Prior information
We will use three sources of prior information: proteinprotein interactions, transcription factor binding predictions as well as protein sequence similarity calculations. Gene pairs in these databases have evidence of being functionally related. The strength of the evidence is represented by some type of score for each database. In our MCMC algorithm the prior information is formulated in terms of the prior probability of forcing two genes to belong to the same group. We will use the logit transformation to transform the scores into the unit interval, thus enabling them to be used as a probability measure. Further details are given below for each prior type. There is very little overlap between the prior information databases. In cases where a gene pair is found in more than one database, the highest prior probability will be used.
Protein interaction data
Proteinprotein interactions (PPI) are important for the majority of biological functions. PPI databases contain lists of proteins pairs for which there exist some evidence for interaction. The evidence comes from various types of studies, ranging from highthroughput methods, more traditional lowthroughput proteomics studies, as well as in silico predictions based on known interactions [23]. iRefIndex [23] is a consolidated protein interaction database comprising information from nine wellknown interaction databases. One of the features of the database is the socalled lpr (lowest PMID reuse) score, which is the lowest number of unique interactions that are supported by one of the interaction’s PubMed identifiers (PMID). A low lpr indicates that the interaction is more likely to rely on lowthroughput methods than on highthroughput methods. As the former is recognized as more reliable than the latter, the lpr score can be used as a measure of the reliability of the interaction. We will use a logit transformation taking x to 1/(1 + exp (  x)) of the inverse of the lpr score to obtain a probability measure (between zero and one). For example, if the lpr score is one, p_{m }becomes 0.73, if the lpr score is zero, p_{m }becomes 0.5.
Transcription factor bindings
Transcription factors (TF) are proteins that bind to specific DNA sequences, thereby controlling the expression levels of the corresponding genes. Binding preferences of many transcription factors are known and characterized by a sequence binding motif. However, binding affinity does not depend entirely on the match of the sequence to the motif, but will rely also on sequence specific features. Ernst et al.[24] developed TF binding predictions, where they first found a general binding preference of the sequences based on 28 local genomic features, including histone modifications, conservation and melting temperature. Then, they combined this general binding preference score with motif information for specific transcription factors to improve prediction of genes bound by the factor. We will make use of these scores, and again, we will apply the logit transformation to transform the scores to values between 0 and 1.
Protein sequence similarity
Proteins with similar sequences are likely to be functionally related as the proteins may be expressed by paralogous genes (having common ancestry) or by genes that are selected to have the same function. For instance, two homologous proteins can be phosphorylated by the same kinase, thus playing roles in the same signaling pathway. One additional feature of using protein homology data in this setting is that the function of proteins for which the function is unknown can be learned by borrowing information from their protein homologs. We will calculate percent similarity between each of the human proteins in RefSeq using BLAST [25].
Results
Simulated data
To evaluate the performance of our method we used simulated gene expression data generated according to [26]. In our study, we used a total of five clusters of genes C^{∗ }= (C_{1},…,C_{5}) with dimension N (denoted d in [26]) samples. Cluster sizes n_{c}(c = 1,…,5) were generated from n_{c }∼ 2 × Poisson (λ). Expression values in cluster C_{c }were generated using a hierarchical lognormal model as in [26]:
• A vector of cluster template for cluster C_{c }was created with four periods of constant expression of size m_{1},m_{2},m_{3} and m_{4}. The sizes m_{k}, k = 1,…,4, was from a uniform distribution such that
• Sample variability and gene variability: Sample variability
• We repeated steps 1 and 2 to generate five clusters. We used parameter values of μ = 6, σ = 1, σ_{s }= 1, σ_{0 }= 1, and λ = 10.
As in [26], extra variation was introduced to evaluate robustness of clustering methods against potential random errors introduced from experimental procedures, such as sample acquisition, labeling hybridization and scanning. To each element of the logtransformed expression matrix we added a a random error from a normal distribution with mean zero and standard deviation (SD) equal to 0, 1 and 2. In addition, the sample size was varied, using N = 10, 100 and 1000. For each of these nine scenarios, 50 datasets were generated.
For each dataset, three scenarios for the available prior information were used. In (A), we assumed that no prior information was used. In (B), we assumed priors pairs were available, where 20% where misspecified, i.e. 20% of the gene pairs had members belonging to different groups. In the last scenario (C), all pairs were assumed to be correctly specified (the members belonged to the same group). Prior values were generated from a uniform U(0.5,1) distribution.
We compared our method with five wellknown clustering methods for which a software already exist, namely hierarchical clustering, kmeans clustering, Partitioning Around Medoids (PAM) [8], Modelbased clustering (Mclust) [9] and tight clustering [11]. For our method, after a burnin period generating 10K samples, we generated 10K samples from which each 100th sample was chosen. For all methods except ours, the number of clusters were estimated using the Gap index [27] (our method implicitly searches for the optimal number of clusters). For our method, clusters were inferred by minimizing the posterior expected loss based on the MCMC samples as described in the Methods section. The number of clusters estimated by the GAP index as well as our method is shown by boxplots in Additional file 3: Figure S2. The Rand index, defined as the proportion of concordant gene pairs in two partitions among all possible gene pairs, was used as evaluation measure. Specifically, we used the adjusted Rand index [28], which is standardized to have expected value zero when the partitions are randomly generated and takes maximum value one if two partitions are perfectly identical. Unlike the other methods, tight clustering produces clusters where some genes are not allocated to any cluster. In the calculation of the Rand index, only the allocated genes are considered.
Additional file 3: Figure S2. Number of estimated clusters. GAP index is used for all methods except ours, which inherently finds the number of clusters. hclust = hierarchical clustering, kmeans = k  means clustering, PAM = Prediction Around Medoids, Mclust = modelbased clustering, tight = tight clustering, MCIPA, is our method (MCMC Clustering using Informative Priors), but with no priors used, MCIPB is our method using priors with 20% of the priors misspecified, and MCIPC is our method with all prior pairs correctly specified.
Format: PDF Size: 11KB Download file
This file can be viewed with: Adobe Acrobat Reader
The results are shown in Figure 1. We see that when all pairs are correctly specified (scenario C), our method (MCIP) was at least as good as all other methods, and superior to the other methods for the smallest sample size (N = 10). When 20% of the priors were misspecified (scenario B), the performance was better than our method without using priors (scenario A), as well as hierarchical clustering, which was overall the second best method. We note that Mclust had a very variable performance, and that tight clustering was performing very poorly for large sample sizes. In order to further investigate the effect of misspecifications of the priors on model performance, we calculated the adjusted Rand index for increasing proportion of misspecifications. Additional file 4: Figure S1 shows that about 40% misspecifications were allowed, in the sense that this corresponded to the use of no prior information. We also note that there was a correspondence between number of estimated clusters (Additional file 3: Figure S2) and performance (Figure 1). Especially for small sample sizes (N = 10), the number of clusters found by maximizing the GAP index, as well as with our method without the use of priors, quite often yielded many more clusters than the true number of clusters (equal to five). This bias was much less evident for our method with the use of priors (MCIPB and MCIPC). Additional file 5: Figure S3 shows the performance after fixing the number of clusters to the true number of clusters (five) for all methods except our method, which inherently finds the number of clusters. The figure shows that poor performance, especially seen for tight clustering (N = 100 and N = 1000) and Mclust (N = 100), was not only due to bias in the estimation of number of clusters, as these methods also performed poorly after fixing the number of clusters.
Figure 1. Results on simulated data. We have used the simulation scheme proposed by [26] sampling approximately 100 genes per data set. Sample sizes (N) were varied using N = 10, 100 and 1000, as well as an extra variation (SD) added to each element in the expression matrix, using SD = 0, 1 and 2. hclust = hierarchical clustering, kmeans = k  means clustering, PAM = Prediction Around Medoids, Mclust = modelbased clustering, tight = tight clustering, MCIPA, is our method (MCMC Clustering using Informative Priors), but with no priors used, MCIPB is our method using priors with 20% of the priors misspecified, and MCIPC is our method with all prior pairs correctly specified.
Additional file 4: Figure S1. Investigation of the effect of misspecified priors on model performance. The figure shows boxplots of adjusted Rand index values for 50 simulated datasets using the simulation setup described in the Results section, using extra variation SD of 0, 1 and 2, and number of individuals N of 10, and 0.
Format: PDF Size: 12KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 5: Figure S3. Evaluation of performance in terms of adjusted Rand index following the simulation scheme of [26]. The simulation setup is identical to the one generating Figure 1, except that for this figure the number of clusters where fixed to the true number of clusters (five). hclust = hierarchical clustering, kmeans = kmeans clustering, PAM = Prediction Around Medoids, Mclust = modelbased clustering, tight = tight clustering, MCIPA, is our method (MCMC Clustering using Informative Priors), but with no priors used, MCIPB is our method using priors with 20% of the priors misspecified, and MCIPC is our method with all prior pairs correctly specified.
Format: PDF Size: 12KB Download file
This file can be viewed with: Adobe Acrobat Reader
Heart failure data
We used the data described in [29], consisting of microarray gene expression measurements from fourteen mice subjected to aortic banding and five sham operated mice. Aortic banding leads to increased left ventricular pressure. To compensate for the increased load, gene expression changes occur resulting in myocardial remodeling, involving hypertrophy of cardiomyocytes. Ultimately, the cardiac hypertrophy might lead to development of heart failure. We based our network analysis on the most differentially expressed genes between aortic banding and sham. To find differentially expressed genes we performed ttests between the two groups, using log2 expression values, before multiple testing correction was performed using the method of [30]. We used a false discovery rate (FDR) cutoff of 5%, and among these genes we picked the 400 with largest fold change (up or down). We looked up connections between these genes and assigned prior probabilities for the pairs based on the prior databases described in the previous section. For each of the three prior databases, there were several hundred pairs where both genes were represented in our inputlist. As the use of so many priors pairs was too computationally demanding for our method, we picked the 50 top scoring pairs for each of the prior types. We applied our MCMC algorithm using altogether 150000 Monte Carlo samples, with the first 50000 samples used for burnin (i.e these first samples were discarded and only used to ascertain a stable starting point in the MCMC sampling). We applied parallel tempering as described in Additional file 1. As we here had more than hundred prior pairs, we approximated P(g ∣ M) with the Monte Carlo estimator (1) defined in Additional file 1, using K = 800, as this value gave stable results within a reasonable computation time. Clusters were inferred by minimizing the posterior expected loss based on the posterior similarity matrix, which was calculated from the collection of each of the 100th MCMC sample after the burnin period.
Table 1 summarizes the clusters and Additional file 6: Figure S4 displays the clusters as (fully connected) networks (all nodes in the same cluster are connected) using Cytoscape [31]. There is one one large module of mainly upregulated genes (aorta banding vs sham), and one smaller module of both up and downregulated genes. In order to investigate these modules more thoroughly we applied Gene Ontology [32] analysis using the R/Bioconductor package GOstats [33]. Additional file 7 shows the most significantly altered GO categories in each of the modules. The top GO term of the larger module was extracellular region (p = 3e34, FDR qvalue = 5e31), and many of the other modules were related to this term (proteinaceous extracellular matrix, extracellular space, extracellular matrix structural constituent, collagen). In the smaller module many GO terms were related to carbohydrate metabolism (polysaccharide metabolic process, carbohydrate biosynthetic process). Figure 2 contains a subset of the larger network, showing prior pairs occurring within the main module. We also applied our method without the use of priors, as well as kmeans clustering. For the latter, the Gap index was used to find the number of clusters K. Both these methods gave one dominant cluster and two smaller clusters. GO analyses of the main clusters are given in Additional files 8 and 9.
Table 1. Summarizing the inferred heart failure clusters
Additional file 6: Figure S4. The results of applying our method to the most differentially expressed genes between aorta banding and sham in the microarray heart failure data. Red node color means upregulated in aorta banding vs sham, green color downregulated.
Format: PDF Size: 483KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 7. GO results MCIP with priors, heart failure data. Results of Gene ontology analysis using GOstats [33] of heart failure clusters found using our method with priors.
Format: PDF Size: 40KB Download file
This file can be viewed with: Adobe Acrobat Reader
Figure 2. Heart failure result networks. Network comprised of prior pairs within the main module. Red node color means upregulated in aorta banding vs sham, green color downregulated. Red edges depict known proteinprotein interactions, green edges transcription factor bindings, and blue edges illustrate protein sequence homologies. Text on proteinprotein interaction edges denote type of interactions (MI), lowest pmid reuse (lpr) and number of publications (np), text on transcription factor binding edges denote prediction score, and text on protein homology edges denote sequence similarity.
Additional file 8. GO results MCIP without priors, main cluster, heart failure data. Results of Gene ontology analysis of main heart failure cluster found using our method without priors.
Format: PDF Size: 39KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 9. GO results Kmeans clustering, main cluster, heart failure data. Results of Gene ontology analysis of main heart failure cluster found using Kmeans clustering.
Format: PDF Size: 40KB Download file
This file can be viewed with: Adobe Acrobat Reader
Melanoma cancer data
Metastatic melanoma is a deadly disease while nonmetastatic melanoma and other cutaneous tumor types are usually cured with surgical removal of the primary tumors. To find network of genes differentially expressed between metastatic and nonmetastatic tumors we used data from [34], which included microarray gene expressions from 47 metastatic and 40 nonmetastatic tumor samples of patients with various cutaneous tumors. As for the heart failure data, we used the 400 most differentially expressed genes (based on fold change) for which Benjamini Hochberg FDR <0.05, and found gene pairs in the PPI, the TF and the sequence similarity database where both genes were represented in our input list. We also here approximated P(g ∣ M) with the Monte Carlo estimator (1) of the Additional file 1, K = 800 samples. Again, result clusters were inferred by minimizing the posterior expected loss based on the posterior similarity matrix calculated from each 100th of 150K MCMC samples generated after a burnin period of 50K samples.
Table 2 summarizes the clusters/modules and Additional file 10: Figure S5 shows the modules as (fully connected) networks. Figure 3 shows prior pairs within each module. The top ten GO categories from GOstats analysis on each module are shown in Additional file 11. We note that the top three GO categories of the largest module were epidermis development (p =6e32, FDR q value =8e29), cornified envelope (p =2e18, FDR qvalue =3e15) and keratinization (p =1e14, FDR qvalue =3e15). We also applied our method without the use of priors and kmeans clustering to the melanoma data. GO analyses of the main clusters are given in Additional files 12 and 13.
Table 2. Summarizing the inferred melanoma cancer clusters
Additional file 10. Figure S5. The results of applying our method to the most differentially expressed genes between metastatic and nonmetastatic melanoma cancer patients. Red node color means upregulated in metastatic melanoma, green color downregulated.
Format: PDF Size: 204KB Download file
This file can be viewed with: Adobe Acrobat Reader
Figure 3. Melanoma cancer result networks. Prior pairs within the largest module. Red node color means upregulated in metastatic melanoma, green color downregulated. Red edges illustrate known proteinprotein interactions, green edges transcription factor bindings, and blue edges protein sequence homologies. Text on proteinprotein interaction edges denote type of interactions (MI), lowest pmid reuse (lpr) and number of publications (np), text on transcription factor binding edges denote prediction score, and text on protein homology edges denote sequence similarity.
Additional file 11. GO results MCIP with priors, melanoma data. Results of Gene ontology analysis of melanoma clusters found using our method with priors.
Format: PDF Size: 54KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 12. GO results MCIP without priors, main cluster, melanoma cancer data. Results of Gene ontology analysis of main melanoma cluster found using our method without priors.
Format: PDF Size: 40KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 13. GO results Kmeans clustering, main cluster, melanoma cancer data. Results of Gene ontology analysis of main melanoma cluster found using Kmeans clustering.
Format: PDF Size: 24KB Download file
This file can be viewed with: Adobe Acrobat Reader
Method evaluation
In order to evaluate our method we made use of literature reported interactions that occurred in abstracts of articles labeled with the Medical Subject Headings (MeSH, http://www.nlm.nih.gov/mesh/ webcite) term left ventricular hypertrophy for the heart failure clusters (summarized in Table 1 and displayed in Additional file 6: Figure S4) and the MeSH term melanoma for the melanoma clusters (summarized in Table 2 and displayed in Additional file 10: Figure S5). We applied gene pairs with pvalues smaller than 5% only, using the method described in [35]. We will refer to these interactions as "true" interactions. The application of our method together with minimization of the posterior expected loss led to an inferred clustering. By considering whether the genes of each possible gene pair occurred in the same group or not in the inferred clustering, we were able to calculate the sensitivity (proportion of gene pairs in the literature network also occurring together in the same group in the inferred clustering), the specificity (proportion of gene pairs not in the literature network that were also not in the same clustering group), the positive predictive value (proportion of genes occurring in the same group also occurring in the literature network). From the sensitivities and specificities the Area Under Curve (AUC) was also calculated. Table 3 shows performance measures for the heart failure and the melanoma data, respectively, using our method both with and without priors, and compared to the results of kmeans clustering. For kmeans clustering the optimal number of clusters was found using the Gap index.
Table 3. Comparing performance of our method (MCIP) with and without the use of priors to k means clustering
Discussion
Using simulated data we compared our method to established clustering methods as kmeans and hierarchical clustering. When no prior information is provided our method did not offer any gains over other standard approaches, except for the N = 10, SD = 0 regime. When most priors were correctly specified (80% or 100%), our method was as least as good as the bestperforming established methods for large samples, and superior to the same methods for small sample sizes. We believe that the majority of the priors will be correctly specified in most cases. The reason for this is that there will usually exist so many prior pairs that one would restrict oneself to only those with the strongest evidence. Of course, a previously shown connection might still not be real in the current situation, e.g. if the connection has been discovered in a different tissue type than the one under study. However, this problem will be limited if one is analyzing a set of genes differentially expressed between two conditions, as many of these will be coexpressed, and thus also correlated.
The Gene Ontology analysis of the heart failure network showed that many of the top ranked GO categories for the larger module were related to the extracellular matrix. Awareness has emerged regarding the extracellular matrix changes taking place in myocardial remodeling [36,37]. When investigating Figure 2, we see that the larger module contains many collagens (type I alpha 1, type III alpha 1, type IV alpha 1, type V alpha 1, type VI alpha 1, type XII alpha 1, type XIV alpha 1 and type XVI a), which are important structural components of the extracellular matrix. The figure shows that these molecules were primarily connected by protein sequence homology (blue lines). Moreover, other molecules thought to be parts of the extracellular matrix were identified in the same module (thrombospondin 3 and 4, fibromodulin, versican, biglycan, fibronectin, elastin). For example, fibronectin has been demonstrated to play a role in the organizing of collagen type I [38]. The module also contained enzymes and hormonal factors known to process the extracellular matrix structural proteins (matrix metallopeptidase 2, transforming growth factor beta 2 and beta 3, latent transforming growth factor beta binding protein 2, lysyl oxidase, TIMP metallopeptidase inhibitor 1). For instance, matrix metallopeptidase 2 is known to be important for the breakdown of collagen type I [39]. Finally, the extracellular matrix related transcription factor early growth response 3 was a hub in this module, indicating a particular importance in this pathological process. Altered metabolism has been described in remodeled and failing hearts. E.g. reduced fatty acid oxidation and an increase in glucose utilization have been reported [4042]. The smaller module contained several genes involved in carbohydrate metabolic processes (6phosphofructo2kinase, phosphorylase kinase gamma 1, aldolase B fructosebisphosphate, fructose bisphosphatase 2). Taken together, the two modules confirm regulatory and signaling events and structural alterations known to be involved in myocardial remodeling. We believe that the complete result network, shown in Additional file 6: Figure S4, contains many other important, not yet discovered, interactions. However, a detailed investigation of all these connections is beyond the scope of this paper.
The melanoma network was derived from the data of Riker et al.[34]. As the present paper represents a reanalysis of the data, the outcome of an analysis with similar input data should be expected to generate somewhat similar output data. The main difference in approach is that while the Riker paper primarily performs a direct Gene Ontology analysis, the current approach attempts to combine multiple sources of information prior to performing any functional analysis. In this way, the resultant network modules will represent subnets of relevance with respect to the combined prior information, thus presenting local information clusters with higher likelihood of coherence in function for each cluster. Riker et al. observed differentially expressed genes involved in keratinocyte differentiation and epidermal development, including loricrin (LOR), involucrin (IVL), keratin 5 (KRT5) and plakophilin 1 (PKP1), suggesting a loss of epidermal characteristics, in highlighting the desquamation process. In our largest module (module 1) we found many genes related to keratinocyte differentiation and epidermal development (see Additional file 11, Table 1). When looking at Figure 3, where gene pairs with prior information within each module were connected, we note that keratins and kallikreins were connected through a set of interacting genes that included desmocollin 1 and 3 (DSC1 and DSC3), desmoglein 1 and 3 (DSG1 and DSG3), desmoplakin (DSP) and corneodesmosin (CDSN). Many of the genes in this path are related to corneodesmosomes (KLK5 and KLK7: kallikrein 5 and 7, DSG1, DSG3, DSC1, DSC3, DSP, IVL, LOR and keratin 10). Corneodesmosomes are located at the junctional structures that mediate corneocyte cohesion [43], underlining keratinocyte dedifferentiation. The individual connections in this path are supported by previous findings; Caubet et al. [43] showed that KLK5 and KLK7 degrade the desmosomal proteins DSC1 and DSG1. Smith et al.[44] showed that when epidermal cells differentiate, DSC1 and DSG1 are bound by PKP1 and DSP, thereby enhancing anchorage to keratin intermediate filaments. Interactions between plakophilins, desmoplakins and keratins have also been shown by other authors, e.g. [4547]. Interestingly, the transcription factor RARrelated orphan receptor A (RORA) was downregulated, and placed as a hub in this module. RORA has recently been indicated as a breast tumor suppressor [48], as well as having a role in keratinocyte differentiation [49]. Taken together, Figure 3 suggests that the loss of epidermal characteristics seen in the metastatic individuals is mediated by a downregulation of the RORA transcription factor.
When comparing the GO analysis of the main clusters from our method with the use of priors (Additional file 7 and 11) to the GO analysis of the main clusters using our method without priors (Additional files 8 and 9) and kmeans clustering (Additional files 9 and 13), we note that many of the same GO categories were showing up. But the pvalues were overall much smaller for our method with the use of priors, than the other two, indicating that our method has better sensitivity of finding functionally related gene groups. We note that the path found in the melanoma data linking the RORA transcription factor to the keratins via the kallikreins and the desmosomal proteins, would not have been discovered by our method without priors or kmeans clustering, as none of the these methods gave clusterings where all these gene subgroups were contained in one cluster.
Method evaluation using the literaturederived network showed that adding prior knowledge yielded improved AUC. The AUC was improved from 0.62 to 0.65 for the heart failure data and from 0.6 to 0.64 for the melanoma data. The positive predicted values (PPV) were also improved (0.02 to 0.023 for the heart failure data and 0.024 to 0.03 for the melanoma data). We note that these values are very small, which is because there are very few gene pairs with cocitation (only around one in a hundred of all possible pairs) compared to the number of gene pairs clustered together (more than half of the pairs for the heart failure data, for instance). It is important to be aware that the possibility of improvement is limited by the degree of overlap between the list of pairs in the prior interaction databases and list of pairs in the literature reference network. Many of the pairs in the literature databases were not found in the interaction databases. In fact this was the case for about 90% for these data. Such interactions are, when evaluated on the literature network, treated as false positives, but might of course be true positives. Of the 90% of the pairs in the literature network that were not in the prior databases, the evidence for grouping the genes together was based solely on the correlation in the microarray data. Another limitation of the literature network is that it itself might contain false positives. Genes that are mentioned in the same PubMed abstract are not necessarily connected.
We have here focused on finding groups of connected genes, and not on discovering direct interactions. Direct interactions are often inferred using Bayesian networks (BN). As mentioned in the Background section, the BN formalism allows for incorporation of prior knowledge, and BN methods for genomic data has indeed been proposed. However, constructing largescale networks using BN methodology is very difficult as the number of possible configurations is about exponential in the number of genes. Formally this has been shown to be an NPComplete problem. By instead focusing on groups, we heavily reduce the number of possible configurations. Thus, our method can handle many more genes than BN methods. This is relevant, as microarray data may involve several hundred regulated genes. One way of using our method is to apply it as an initial step prior to a BN analysis; If the number of genes is too large to apply BN methodology to the full set, our method can be used to first find smallersized, independent sets of genes, followed by separate BN analysis on each of the subsets. The result of such an analysis will be similar to the networks shown in Figures 2 and 3, where prior pairs within each cluster are shown. We believe that the connections shown in Figures 2 and 3 are reliable and robust, as they display connections between genes that are coregulated and that have a previously shown connection. Using BN on each cluster could lead to an improvement, as novel interactions can be detected based on strong correlations in the data, and is one possibility for future methodological development.
Along with the development of the primarily datadriven methods, various methods focusing more on the prior information have been proposed, e.g. [5,50,51]. In these approaches, the authors made use of the observed changes in the experimental data, e.g. the most differentially regulated genes, while the interactions among these genes were derived from prior information only. For example, Huang et al.[5] applied the socalled Prizecollecting Steiner tree to select nodes and interactions. As we speak, the prior knowledge about molecular interactions is rapidly increasing. For instance, the data generated by the largescale consortium project Encode [52] represent huge possibilities for extending the source of prior knowledge about transcriptional regulation. Obviously, methods enabling the use of prior knowledge when constructing genetic networks will increase their relevance in line with this development.
Conclusions
We have presented a novel method for finding modules of interacting molecules, representing biological pathways or processes, based on microarrays or other transcriptomic data. The method is a clustering approach, which is a commonly used technique for analyzing transcriptomic data. Unlike previous approaches using Bayesian statistics to infer clusters, e.g. [9,10], we used informative (as opposed to noninformative) priors. The prior information was assumed to be in form of pairs of connected genes, along with a connection score, as this enabled us to capture a lot of relevant prior information, like proteinprotein interactions, transcription factor binding information and protein sequence similarity measurements. Using simulated data our method showed improved ability of identifying correct groups compared to traditional clustering, especially for small sample sizes, and for situations where most priors were correctly specified. When applying the method to realworld microarray data in heart failure and melanoma cancer we found clusters overlapping with known pathogenic processes, but where subnetworks of prior pairs within each cluster pointed to new connections extending beyond the classical disease pathways.
Availability
The method is implemented in C++ and R and is available from: http://folk.uio.no/trondr/gene_corr webcite, as well as a tool in the Galaxy [53] based Genomic HyperBrowser [54], see http://hyperbrowser.uio.no/dev2 webcite under assorted tools →MCMC gene corr. The data sets supporting the results of this article are available from the Gene Expression Omnibus repository: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36074 webcite (heart failure data) and http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7553 webcite (melanoma cancer data).
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
SN and TR developed the methodology. SN, TR and VN implemented the software. SN, TR, TC, VN, JB, BS and TT generated and collected data used in this paper. SN, TR, TC, JB, GC and EH analyzed the results. SN, TR, JB, GC and EH wrote the article. All authors participated in discussions, read and approved the final manuscript.
Acknowledgements
We thank Ian Donaldson for help using the iRefIndex database, Torbjørn Rognes for providing sequence homology scores using BLAST, Jason Ernst for giving us TF binding predictions, Kjetil Klepper for providing mappings of PWM ids to gene symbols, Kai Trengereid and Morten Johansen for incorporating the software into the Genomic HyperBrowser, and John Quackenbush and his lab for hosting internship of SN during which the idea of this project was conceived. We also would like to thank the anonymous reviewers for their detailed and constructive comments. The work of SN was supported by the SouthEastern Norway Regional Health Authority.
References

Alexa A, Rahnenfuhrer J, Lengauer T: Improved scoring of functional groups from gene expression data by decorrelating GO graph structure.
Bioinformatics 2006, 22(13):16001607. PubMed Abstract  Publisher Full Text

Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C, Zhu J, Haussler D, Stuart JM: Inference of patientspecific pathway activities from multidimensional cancer genomics data using PARADIGM.
Bioinformatics 2010, 26(12):i237i245. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Huang DW, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.
Nat Protoc 2009, 4(1):4457. PubMed Abstract  Publisher Full Text

Werner: Bioinformatics applications for pathway analysis of microarray data.
Curr Opin Biotechnol 2008, 19:5054. PubMed Abstract  Publisher Full Text

Huang SsC, Fraenkel E: Integrating proteomic, transcriptional, and interactome data reveals hidden components of signaling and regulatory networks.
Sci Signal 2009, 2(81):ra40. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Djebbari A, Quackenbush J: Seeded Bayesian networks: constructing genetic networks from microarray data.
BMC Syst Biol 2008, 2:57. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Allison DB, Gadbury GL, Heo MS, Fernandez JR, Lee CK, Prolla TA, Weindruch R: A mixture model approach for the analysis of microarray gene expression data.
Comput Stat Data Anal 2002, 39(1):120. Publisher Full Text

Kaufman L, Rousseeouw P: Statistical Data Analysis Based on the L1norm and Related Methods. North Holland: Amsterdam; 1987.

Fraley C, Raftery A: MCLUST: software for modelbased cluster analysis.
J Classif 1999, 16(2):297306. Publisher Full Text

Booth JG, Casella G, Hobert JP: Clustering using objective functions and stochastic search.
J Roy Stat Soc B Stat Meth 2008, 70(1):119139. Publisher Full Text

Tseng G, Wong W: Tight clustering: a resamplingbased approach for identifying stable and tight patterns in data.
Biometrics 2005, 61(1):1016. PubMed Abstract  Publisher Full Text

Lee H, Hsu A, Sajdak J, Qin J, Pavlidis P: Coexpression analysis of human genes across many microarray data sets.
Genome Res 2004, 14(6):10851094. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Friedman N, Linial M, Nachman I, Pe’er D: Using Bayesian networks to analyze expression data.
J Comput Biol 2000, 7:601620. PubMed Abstract  Publisher Full Text

Werhli A, Husmeier D: Reconstructing gene regulatory networks with bayesian networks by combining expression data with multiple sources of prior knowledge.

Geier F, Timmer J, Fleck C: Reconstructing generegulatory networks from time series, knockout data, and prior knowledge.
BMC Syst Biol 2007, 1:11. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Mukherjee S, Speed TP: Network inference using informative priors.
Proc Natl Acad Sci U S A 2008, 105(38):1431314318. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gao S, Wang X: Quantitative utilization of prior biological knowledge in the Bayesian network modeling of gene expression data.
BMC Bioinformatics 2011, 12:359. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Segal E, Shapira M, Regev A, Pe’er D, Botstein D, Koller D, Friedman N: Module networks: identifying regulatory modules and their conditionspecific regulators from gene expression data.
Nat Genet 2003, 34(2):166176. PubMed Abstract  Publisher Full Text

Yeung K, Fraley C, Murua A, Raftery A, Ruzzo W: Modelbased clustering and data transformations for gene expression data.
Bioinformatics 2001, 17(10):977987.
[3rd Georgia TechEmory International Conference on Bioinformaticsin Silico Biology: Bioinformatics after the Human Genome, ATLANTA, GEORGIA, NOV 15–18, 2001]
PubMed Abstract  Publisher Full Text 
Medvedovic M, Sivaganesan S: Bayesian infinite mixture model based clustering of gene expression profiles.
Bioinformatics 2002, 18(9):11941206. PubMed Abstract  Publisher Full Text

Geyer C: MarkovChain MonteCarlo maximumlikelihood. In Computing Science and Statistics. Edited by Keramidas EM, Fairfax Station VA. Interface Foundation of North America Inc; 1991:156163.

mcclust: Process an MCMC Sample of Clusterings. 2012.
[http://CRAN.Rproject.org/package=mcclust webcite]. [R package version 1.0]

Razick S, Magklaras G, Donaldson IM: iRefIndex: a consolidated protein interaction database with provenance.
BMC Bioinformatics 2008, 9:405. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Ernst J, Plasterer HL, Simon I, BarJoseph Z: Integrating multiple evidence sources to predict transcription factor binding in the human genome.
Genome Res 2010, 20(4):526536. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSIBLAST: a new generation of protein database search programs.
Nucleic Acids Res 1997, 25(17):33893402. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC: Evaluation and comparison of gene clustering methods in microarray analysis.
Bioinformatics 2006, 22(19):24052412. PubMed Abstract  Publisher Full Text

Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a data set via the gap statistic.
J Roy Stat Soc B Stat Meth 2001, 63(2):411423. Publisher Full Text

Hubert L, Arabie P: Comparing partitions.
J Classif 1985, 2:193218. Publisher Full Text

Skrbic B, Bjørnstad JL, Marstein HS, Carlson CR, Sjaastad I, Nygård S, Bjørnstad S, Christensen G, Tønnessen T: Differential regulation of extracellular matrix constituents in myocardial remodeling with and without heart failure following pressure overload.
Matrix Biol 2013, 32(2):133142. PubMed Abstract  Publisher Full Text

Benjamini Y, Hochberg Y: Controlling the false discovery rate  a practical and powerful approach to multiple testing.

Shannon P, Markiel A, Ozier O, Baliga N, Wang J, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: A software environment for integrated models of biomolecular interaction networks.
Genome Res 2003, 13(11):24982504. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

The Gene Ontology Consortium: Gene ontology: tool for the unification of biology.
Nat Genet 2000, 25:2529. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Falcon S, Gentleman R: Using GOstats to test gene lists for GO term association.
Bioinformatics 2007, 23(2):257258. PubMed Abstract  Publisher Full Text

Riker AI, Enkemann SA, Fodstad O, Liu S, Ren S, Morris C, Xi Y, Howell P, Metge B, Samant RS, Shevde LA, Li W, Eschrich S, Daud A, Ju J, Matta J: The gene expression profiles of primary and metastatic melanoma yields a transition point of tumor progression and metastasis.
BMC Med Genomics 2008, 1:13. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Adamic L, Wilkinson D, Huberman B, Adar E: A literature based method for identifying genedisease connections. In CSB2002: IEEE Computer Society Bioinformatics Conference. IEEE Computer Society; 2002:109117.
[1st International IEEEComputerSociety Bioinformatics Conference (CSB2002), Stanford univ, Stanford, CA, Aug 14–16, 2000]

Bowers SLK, Banerjee I, Baudino TA: The extracellular matrix: at the center of it all.
J Mol Cell Cardiol 2010, 48(3, SI):474482. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fomovsky GM, Thomopoulos S, Holmes JW: Contribution of extracellular matrix to the mechanical properties of the heart.
J Mol Cell Cardiol 2010, 48(3, SI):490496. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sottile J, Shi F, Rublyevska I, Chiang HY, Lust J, Chandler J: Fibronectindependent collagen I deposition modulates the cell response to fibronectin.
Am J Physiol Cell Physiol 2007, 293(6):C1934C1946. PubMed Abstract  Publisher Full Text

Stearns M: Alendronate blocks TGFbeta 1 stimulated collagen 1 degradation by human prostate PC3 ML cells.
Clin Exp Metastasis 1998, 16(4):332339. PubMed Abstract

Ingwall JS: Energy metabolism in heart failure and remodelling.
Cardiovasc Res 2009, 81(3):412419. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lehman JJ, Kelly DP: Gene regulatory mechanisms governing energy metabolism during cardiac hypertrophic growth.
Heart Fail Rev 2002, 7:175185.
[10.1023/A:1015332726303]
PubMed Abstract  Publisher Full Text 
Lehman J, Kelly D: Transcriptional activation of energy metabolic switches in the developing and hypertrophied heart.
Clin Exp Pharmacol Physiol 2002, 29(4):339345. PubMed Abstract  Publisher Full Text

Caubet C, Jonca N, Brattsand M, Guerrin M, Bernard D, Schmidt R, Egelrud T, Simon M, Serre G: Degradation of corneodesmosome proteins by two serine proteases of the kallikrein family, SCTE/KLK5/hK5 and SCCE/KLK7/hK7.
J Invest Dermatol 2004, 122(5):12351244. PubMed Abstract  Publisher Full Text

Smith E, Fuchs E: Defining the interactions between intermediate filaments and desmosomes.
J Cell Biol 1998, 141(5):12291241. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kouklis P, Hutton E, Fuchs E: Making a connection: direct binding between keratin intermediate filaments and desmosomal proteins.
J Cell Biol 1994, 127(4):10491060. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Meng J, Bornslaeger E, Green K, Steinert P, Ip W: Twohybrid analysis reveals fundamental differences in direct interactions between desmoplakin and cell typespecific intermediate filaments.
J Biol Chem 1997, 272(34):2149521503. PubMed Abstract  Publisher Full Text

Hofmann I, Mertens C, Brettel M, Nimmrich V, Schnolzer M, Herrmann H: Interaction of plakophilins with desmoplakin and intermediate filament proteins: an in vitro analysis.
J Cell Sci 2000, 113(13):24712483. PubMed Abstract  Publisher Full Text

Xiong G, Wang C, Evers BM, Zhou BP, Xu R: ROR alpha suppresses breast tumor invasion by inducing SEMA3F expression.
Cancer Res 2012, 72(7):17281739. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Taylor JM, Street TL, Hao L, Copley R, Taylor MS, Hayden PJ, Stolper G, Mott R, Hein J, Moffatt MF, Cookson WOCM: Dynamic and physical clustering of gene expression during epidermal barrier formation in differentiating keratinocytes.
PLOS ONE 2009, 4(10):e7651. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dittrich MT, Klau GW, Rosenwald A, Dandekar T, Mueller T: Identifying functional modules in proteinprotein interaction networks: an integrated exact approach.
Bioinformatics 2008, 24(13):I223I231. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

HaibeKains B, Olsen C, Djebbari A, Bontempi G, Correll M, Bouton C, Quackenbush J: Predictive networks: a flexible, open source, web application for integration and analysis of human gene networks.
Nucleic Acid Res 2012, 40(D1):D866D875. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

The ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome.

Giardine B, Riemer C, Hardison R, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent W, Nekrutenko A: Galaxy: a platform for interactive largescale genome analysis.
Genome Res 2005, 15(10):14511455. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sandve GK, Gundersen S, Rydbeck H, Glad IK, Holden L, Holden M, Liestol K, Clancy T, Ferkingstad E, Johansen M, Nygaard V, Tostesen E, Frigessi A, Hovig E: The Genomic HyperBrowser: inferential genomics at the sequence level.
Genome Biol 2010, 11:R121. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text