Computer Laboratory, University of Cambridge, Cambridge, CB3 0FD, UK

Abstract

Background

Grouping genes into clusters on the basis of similarity between their expression profiles has been the main approach to predict functional modules, from which important inference or further investigation decision could be made. While the univocal determination of similarity metric is important, current practices are normally involved with Euclidean distance and Pearson correlation, of which assumptions are not likely the case for high-throughput microarray data.

Results

We advocate the use of a novel metric - BayesGen - to measure similarity between gene expression profiles, and demonstrate its performance on two important applications: constructing genome-wide co-expression network, and clustering cancer human tissues into subtypes. BayesGen is formulated as the evidence ratio between two alternative hypotheses about the generating mechanism of a given pair of genes, and incorporates as prior knowledge the global characteristics of the whole dataset. Through the joint modelling of expected intensity levels and noise variances, it addresses the inherent nonlinearity and the association of noise levels across different microarray value ranges. The full Bayesian formulation also facilitates the possibility of meta-analysis.

Conclusion

BayesGen allows more effective extraction of similarity information between genes from microarray expression data, which has significant effect on various inference tasks. It also provides a robust choice for other object-feature data, as illustrated through the results of the test on synthetic data.

Background

With the development of high-throughput experimental techniques, biological research have been transformed into a data-rich discipline. DNA microarray, which allows user to measure the expression levels of thousands of gene simultaneously in a single experiment, emerged to be one of the most widely used technology. The analysis of microarray data is normally based on the reasoning that variations in gene expression patterns under different experimental conditions are the results of underlying cellular pathway changes

Before a clustering procedure could be performed, it is naturally conceivable to ask if the metric of similarity between expression profiles has been univocally determined. While the normal practice has been largely involved with Euclidean distance and Pearson correlation, these metrics either assume a clean experimental space or the linearity between similar genes, which are not likely the case for high-throughput expression data. We expect a metric that could handle the dependency on the responsiveness/determination accuracy of different concentrations, and the nonlinearities that are likely to result from the production of mRNAs and deficiencies of measuring devices. Moreover, as commonly observed, the measurement average and dependency may be linked, or high intensity values are likely to be affected by larger error

In this paper we apply Bayesian model selection to construct a principled framework for similarity/distance definition. One emergent feature about Bayesian approach is that it requires explicit statement of underlying assumptions, making it easier for users to evaluate the suitability of a given metric. We then propose a novel distance metric that addresses the nonlinearity and the variation of noise levels across different microarray value ranges, through the joint modelling of data points' intensity levels and noise variances. Another important aspect is that by deriving a full Bayesian model, it also facilitates the employment of meta-analysis through the estimation of the hyper-parameters.

Bayesian model selection

Bayesian model selection uses the probability rules and Bayes theorem to choose among alternative hypotheses. To evaluate the plausibility of a given model

This quantity automatically encodes the Occam's factor or the preference for models with more constrained generating mechanism. In other words, since complex models have the capability of explaining over a wider range of data, their evidence distributions are more widely spread over the data space. Hence, if _{2 }is a more complex model compared to _{1}, and given a data _{1 }and _{2}, the evidence _{1}) will be larger than _{2}). More detailed discussion about Bayesian model selection could be found elsewhere

Bayesian evidence serves as the basis for Bayes Factor (BF), which is defined as the evidence ratio _{1})/_{2}) times the hypotheses prior ratio _{1})/_{2}). In the case of no prior bias exists between the two hypotheses, the prior ratio could be safely ignored. Due to the Bayesian evidence's ability of automatically choosing the right model, BF was known to be more robust with sparse data in comparison to the popular likelihood ratio test (LRT). However, this advantage comes at the computational cost of integration over the parameter space, which normally employed Monte Carlo intergration and importance sampling

Bayes Factor and its approximated versions have recently attracted more interest as a tool for selecting alternative hypotheses in bioinformatics

In this paper, we apply the model averaging principle of Bayesian evidence, which takes into account all possible models rather than relying on the best one. We start by constructing the general Bayesian formulation for pairwise similarity/distance measurement, from which the new distance metric BayesGen is described. We then compare BayesGen performance with Euclidean distance and Pearson correlation through three test sets. The first test on simulated data suggests that the full Bayesian approach is better in differentiating homologous and heterogeneous pairs. The second test on two genome-wide

Results and discussion

Bayesian pairwise distance

Suppose we are interested in a set of **o**_{1}, ..., **o**_{n}} of which observations of their behaviour are available as dataset **x**_{1}, ..., **x**_{n}}. We assume an object behaviour is the result of an underlying generative process that takes into account its properties. We formalise such generative process as a probability distribution ** θ**) over the experimental space, where all observation vectors are generated from.

The similarity between two objects **o**_{i }and **o**_{j }would be best specified as the similarity between their inherent properties. Although such information is not directly available to us, it has been encoded into the generative processes that resulted in our observations. The similarity between _{i }and _{j }could then be defined to be proportional to the probability that two samples **x**_{i }and **x**_{j }were generated from the same process. Denoting _{same }as the hypothesis that **x**_{i }and **x**_{j }are from a single process, and _{diff }as its complement (two samples were generated from two different processes), we have:

where **o**_{i }and **o**_{j}, and _{same}) and _{diff}) are the prior beliefs. Since similarity/distance measurement are invariant to monotonic transformations, we could define a distance measurement between two objects **o**_{i }and **o**_{j }as the Bayes factor between the two hypotheses, employing the evidence expansion from (1):

BayesGen distance for gene expression data

Given a dataset

As cellular processes are carried out through the coordination of gene modules, where the expression levels of co-regulated genes within each module are similar under a given condition, we assume that each sample **x**_{i}, ** θ **= {

where _{0}, Σ_{0}, _{0}, and _{0 }are hyperparameters indicating the prior mean, prior variance, and their belief levels respectively. Note that ** μ **and Σ are not independent, reflecting the dependency between variance and intensity levels observed in expression data. The generative process has two stages: firstly, different processes are generated by mutating the global mean

Assuming no prior knowledge, the expected decomposition of process-generated and sample-generated variance is equiprobable and equals **V**/2. Plugging the model of (6-8) to (5), and assuming that Σ is a diagonal matrix, we obtain the closed-form formula for BayesGen distance measurement for two given genes

where

and **m**^{k}, **v**^{k},

Experiment 1: Synthetic data

The first experiment was designed to compare the capability of the three metrics in differentiating between sample pairs that are generated from a single process, and those generated from two different processes. In order to explore the strengths and weaknesses of the metrics in a reasonably exhaustive way, we use synthetic data with different generating assumptions, which are not necessarily the valid assumptions for real microarray expression datasets.

We conducted the test over three cases, distinguished by the way samples within a process are linked: (1) Samples are independently generated from a Gaussian distribution, with different expected noise levels for different conditions; (2) Samples are independently generated from a Gaussian distribution, with fixed noise levels over all conditions; (3) Samples are generated as linear transformations from a common mean vector, with random noises added.

A dataset is the composition of 200 samples coming from two different processes (100 samples each). The distances between all pairs in the dataset were calculated, ranked, and scaled so that they are evenly distributed over the range [0, 1]. We then grouped distance values into two classes by the origin of their objects:

Figure

Distance distributions of the homologous and heterogeneous groups

**Distance distributions of the homologous and heterogeneous groups**. Comparison of the three distance metric capability in differentiating between homologous and heterogeneous sample pairs over three generating cases. Red lines: densities of homologous distances (two samples are from the same process); blue lines: densities of heterogeneous distances (two samples are from two different processes). Case 1: Samples are independently generated from a Gaussian distribution with varying noises (favours BayesGen); Case 2: Samples are independently generated from a Gaussian distribution with fixed noise (favours Euclidean distance); Case 3: Samples are generated as noisy linear transformations from a common mean vector (favours Pearson correlation).

Experiment 2: Functional association discovery

In the second experiment, we examined the direct application of the proposed measurement approach in predicting protein pairs that participate in the same cellular processes from high throughput microarray expression data. Our application was based on the guilt-by-association heuristic

Datasets

We used two public datasets measured genome-wide gene expressions of

The first dataset was extracted from the gene expressions of wild-type and Mec1 defective yeasts in response to two different DNA-damaging agents: methylmethane sulfonate and ionising radiation

The second dataset contains the gene expressions from triple replicates of 14 yeast samples differentiated by their sucrose gradients

Since the purpose of our experiment was to evaluate the proposed measurement directly, without intervention from any other algorithms, we did not apply any imputation method here. All the rows that contain missing values were ignored, leaving a total of 2,222 genes for ^{k}, the following transformation was applied:

where ^{k }are the mean and variance of feature

Experiment and results

For each dataset, 5 pairwise distance matrices were computed using: Euclidean distance on original data, Euclidean distance on normalised data, Pearson correlation on original data, Pearson correlation on normalised data, and BayesGen on original data (BayesGen has the inherent column-wise normalisation in its formula).

Given a distance matrix, the smallest

To evaluate the quality of our prediction, we compared the predicted pairs against the positive pairs derived from the combination of Gene Ontology (GO)

Protein functional association discovery

**Protein functional association discovery**. Comparison of the three distance metric capability in predicting interacting yeast protein pairs from genome-wide microarray expression data. The standard positive pairs are derived from the annotations of GO terms that got 5/6 votes of expert survey. (A) Results from Gasch et al.

Experiment 3: Hierarchical clustering application

The aim of the third experiment was to quantify the advantage of the proposed approach in application to a distance-based clustering method. We chose agglomerative hierarchical clustering due to its popularity in the area of gene expression analysis. Starting from a set of

Datasets

We used four public datasets of gene expression profiles measured on cancer patients during the diagnosis stage

The first dataset contained bone marrow samples obtained from acute leukemia patients, measured on the Human Genome HU6800 Affymetrix microarray

The second dataset consisted of leukemia bone marrow samples from ALL-type pediatric patients, measured on the Human Genome U95 Affymetrix microarray, with the focus on the patients' risk of relapse

The third dataset contained 103 cancer samples from 4 distinct tissues (26 breast, 26 prostate, 28 lung, and 23 colon), measured on the Human Genome U95 Affymetrix microarray

The last dataset consisted of diagnostic samples from diffuse large B-cell lymphoma patients, measured on the Human Genome U133A and U133B Affymetrix microarrays

Since it is possible that the datasets contained multiple signatures other than the known phenotypes, they had been preprocessed by applying a signal-to-noise ratio test and selecting the most up-regulated genes for each class

Experiment and results

For each of the described dataset, we calculated the distance matrix using the 5 approaches: Euclidean distance, Euclidean distance with z-score normalisation, Pearson correlation, Pearson correlation with z-score normalisation, and the newly proposed BayesGen. These distance matrices were then fed as inputs to the agglomerative hierarchical clustering to obtain one linkage tree for each metric. We used average linkage, which defines the distance between two clusters as the average of all between-cluster distances. Formally, given 2 clusters _{1 }and _{2 }of _{1 }and _{2 }objects respectively, the distance between _{1 }and _{2 }is:

Hierarchical clustering does not require users to specify the number of clusters beforehand. One could later decides on the number of partitions by looking at the tree structure. However, this process is normally bias and based on one's prior expectation about the data. In an attempt of achieving a reasonable fairness level for all approaches, we estimated the appropriate number of clusters for each tree using gap statistics

Table

Clustering expression profiles into cancer subtypes

**euclid**

**euclidNorm**

**corr**

**corrNorm**

**bayesGen**

General leukemia

0.5447

0.1175

0.7491

0.1817

0.8076

Pediatric leukemia

0.1982

0.4789

0.2014

0.9129

0.9413

Multiple tissues

0.5304

0.9082

0.6416

0.783

0.9726

B-cell lymphoma

0.0016

0.0008

0.4407

0.1745

0.9053

Average

0.3187

0.3764

0.5082

0.5130

0.9067

The numbers of clusters estimated from gap statistics are shown in table

Predicting number of clusters using gap statistics

**true number**

**euclid**

**euclidNorm**

**corr**

**corrNorm**

**bayesGen**

General leukemia

3

3

3

3

3

4

Pediatric leukemia

6

3

13

2

15

7

Multiple tissues

4

6

7

6

9

4

B-cell lymphoma

3

2

2

15

15

6

Average difference

1.2

2.2

3.6

5.2

1.0

Cluster structures resulted from the use of different metrics on hierarchical clustering

**Cluster structures resulted from the use of different metrics on hierarchical clustering**. Comparison of the resulted cluster structures resulted from the use of different distance metrics on hierarchical clustering over 4 cancer datasets. Top row: the true structure derived from known phenotypes; Middle row: the structure resulted from BayesGen (offered highest Rand indices); Bottom row: the structure resulted from the metric that offered the second best Rand indices.

Conclusion

We suggested the use of BayesGen - a new metric for measuring similarity/distance between gene expression profiles. Based on the observation that both data points' intensity levels and their relative variance jointly contribute to the identification of the underlying cellular processes, the metric was derived using a full Bayesian approach, which incorporates as prior knowledge the global characteristics of the whole dataset.

In comparison to Euclidean distance and Pearson correlation, BayesGen was shown to be the superior in predicting the interacting protein pairs through the construction of pairwise relevance networks. The profound effect of metric selection on clustering results was confirmed in the last experiment, showing significant improvement brought by BayesGen to hierarchical clustering both in term of partition accuracy and cluster structure. Although encoding more information, BayesGen shares the calculation simplicity of the other two, and we expect its seamless integration capability to any downstream distance-based approach.

Despite the inspiration from gene expression data, BayesGen was designed with a general purpose in mind, and could be well applied to other object-feature data. The test on synthetic data under different generating assumptions showed that BayesGen is robust enough to be considered as the safe choice in most cases. Work in progress is to extend the same Bayesian framework to other data types, including relational and structured data.

Methods

Euclidean distance

The Euclidean distance between two expression profiles **x**_{i }and **x**_{j }is defined as follows:

which measures the absolute distance between expression profiles in the ** θ **= {

where _{0 }is of the form

where

which is the squared version of Euclidean distance.

Pearson correlation

The Pearson correlation between two expression profiles **x**_{i }and **x**_{j }is defined as follows:

where **x**_{i }= **x**_{j }where

Note that while Euclidean and Bayesian distance treat each expression profile **x**_{i }as a vector of

Gap statistic

Gap statistic

Given a partition with _{1}, ..., _{k }of size _{1}, ..., _{k}, the dispersion index at

where _{r }is the sum of within-cluster distances of cluster

At each point of

where

Tibshirani et al.

where _{k+1 }is the standard deviation of

1. Choose candidate numbers {^{c}}as all _{k }≥ _{k+1}.

2. Choose the number of clusters as the smallest ^{c }such that

Adjusted Rand index

The Rand index _{1 }and _{2}, with _{1 }and _{2 }groups respectively (_{1 }and _{2 }are not necessarily equal). The matching of the two partitions is defined as confusion matrix **C **of size _{1 }× _{2}, where _{ij }is the number of objects in group _{1 }that are also in group _{2}. Rand index computes the probability that any 2 out of

The adjusted Rand index

where

and

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

VAN conceived and designed the study, carried out experiments, and drafted the manuscript. PL supervised the work, discussed the results and critically revised the paper. Both authors read and approved the final manuscript.

Note

Other papers from the meeting have been published as part of

Acknowledgements

VAN was supported by the Computer Laboratory Premium Studentship, Cambridge Overseas Research Studentship, Cambridge Overseas Trust, and King's College Studentship. This work was supported by the WADA (World AntiDoping Association) grant to PL, and the Ferris Fund from King's College to VAN.

This article has been published as part of