Department of Computer Science, University of Minnesota - Twin Cities, Minneapolis, MN 55455, USA

Abstract

Background

An important analysis performed on microarray gene-expression data is to discover biclusters, which denote groups of genes that are coherently expressed for a subset of conditions. Various biclustering algorithms have been proposed to find different types of biclusters from these real-valued gene-expression data sets. However, these algorithms suffer from several limitations such as inability to explicitly handle errors/noise in the data; difficulty in discovering small bicliusters due to their top-down approach; inability of some of the approaches to find overlapping biclusters, which is crucial as many genes participate in multiple biological processes. Association pattern mining also produce biclusters as their result and can naturally address some of these limitations. However, traditional association mining only finds exact biclusters, which limits its applicability in real-life data sets where the biclusters may be fragmented due to random noise/errors. Moreover, as they only work with binary or boolean attributes, their application on gene-expression data require transforming real-valued attributes to binary attributes, which often results in loss of information. Many past approaches have tried to address the issue of noise and handling real-valued attributes independently but there is no systematic approach that addresses both of these issues together.

Results

In this paper, we first propose a novel error-tolerant biclustering model, ‘

Conclusions

The results obtained for both the problems: functional module discovery and biomarkers discovery, clearly signifies the usefulness of the proposed

Background

Recent technical advancements in DNA microarray technologies have led to the availability of large-scale gene expression data. These data sets can be represented as a matrix _{ij}

Association pattern mining can naturally address some of the issues faced by biclustering algorithms i.e, finding overlapping biclusters and performing an exhaustive search. However, there are two major drawbacks of traditional association mining algorithms. First, these algorithms use a strict definition of support that requires every item (gene) in a pattern (bicluster) to occur in each supporting transaction (experimental condition). This limits the recovery of patterns from noisy real-life data sets as patterns are fragmented due to random noise and other errors in the data. Second, since traditional association mining was originally developed for market basket data, it only works with binary or boolean attributes. Hence it’s application to data sets with continuous or categorical attributes requires transforming them into binary attributes, which can be performed by using discretization

Efforts have been made to independently address the two issues mentioned above and to the best of our knowledge, no prior work has addressed both the issues together. For example, various methods

Another recent approach

As it has been independently shown that both issues, handling real-valued atributes and noise, are critical and affect the results of the mining process, it is important to address them together. In this paper, we propose a novel extension of association pattern mining to discover error-tolerant biclusters (or patterns) directly from real-valued gene-expression data. We refer to this approach as ‘

To demonstrate the efficacy of our proposed

For the first problem of functional module discovery, we used real-valued

Contributions

• We proposed a novel association pattern mining based approach to discover error-tolerant biclusters from noisy real-valued gene-expression data.

• Our work highlights the importance of tolerating error(s) in the biclusters in order to capture the true underlying structure in the data. This is demonstrated using two case studies: functional module discovery and biomarker discovery. Using various real-valued gene expression data sets, we illustrated that our proposed algorithm

• We used two randomization techniques to compute the empirical p-value of all the discovered error-tolerant biclusters and demonstrated that they are statistically significant and it is highly unlikely to have obtained them by random chance.

**Organization:** The rest of the paper is organized as follows. In Section 2, we discuss our proposed algorithm

Experimental results and discussion

We implemented our proposed association pattern mining approach ‘

**Selecting top biclusters:** As association mining based approach generally produces a large number of biclusters that often have substantial overlap with each other, this redundancy in biclusters may bias the evaluation. Hence, we used a commonly adopted selection methodology similar to the one proposed by

Case study 1 - discovery of functional modules

We used the following two real-valued

• Hughes et al’s data set _{10} ratio of expression values observed for experimental condition and control condition) in the range [-2,2].

• Mega Yeast data set

**Functional enrichment analysis:** Since the discovered biclusters represent groups of genes that are expected to co-express with each other, we evaluated all the biclusters discovered in terms of their functional coherence using the biological processes annotation hierarchy of Gene Ontology

To compare error-tolerant biclusters and

Quantitative analysis of biclusters

Table

This table shows various statistics of all the biclusters obtained using

**Run ID**

**Parameter settings**

**# total biclusters**

**#**
** genes covered ^{1}**

**# **
**top biclusters**

**# genes covered ^{2}**

**Size distribution ^{2}**

**Time taken (seconds)**

**Error-tolerant biclusters on Mega Yeast data set**

_{M}_{1}

**[120 150), ε = 0.25 for RS ≥ 150**

153,960

361

500

295

2:128, 3:235, 4:8, 5:76, 6:39, 7:7, 8:2, 9:1, 10:2, 11:1, 13:1

10,560

_{M}_{2}

α = 0.3,

271,101

792

500

233

3:203, 4:28, 5:177, 6:80, 7:5, 8:3, 9:3, 10:1

33,000

**RAP biclusters on Mega Yeast data set**

_{M1}

α = 0.5,

33,330

361

500

247

2:68, 3:379, 4:33, 5:16, 6:4

642

_{M}_{2}

94,806

792

500

241

3:384, 4:68, 5:43, 6:5

7,580

**Error-tolerant biclusters on Hughes et al's data set**

_{H}_{1}

150,372

506

496

437

2:210, 3:187, 4:12, 5:66, 6:14, 7:3, 8:1, 10:1, 11:1, 13:1

8,360

_{H}_{2}

234,761

1135

500

443

2:115, 3:258, 4:22, 5:69, 6:24, 7:6, 8:1, 9:2, 11:1, 13:1, 14:1

21,745

**RAP biclusters on Hughes et al's data set**

_{h}_{1}

56,009

506

495

438

2:212, 3:207, 4:25, 5:40, 6:5, 7:3, 8:2, 11:1

2,835

_{h}_{2}

80,335

1135

500

405

2:96, 3:303, 4:18, 5:75, 6:2, 7:2, 8:3, 12:1

1,505

Statistics of biclusters obtained using '^{1 }all biclusters, ^{2} top biclusters).

Parameter controlling error-tolerance (ε) was set to 0.25 in all the runs for

**Number of biclusters:** It can be clearly seen from Table

**Size of biclusters:** Another important observation one can make from the results shown in Table

**Coverage of genes and relationships among them:** As can be noted from Table

As mentioned above and shown in Table

Functional enrichment using GO biological processes

As mentioned earlier, a p-value for each of the (bicluster, GO term) pair is computed for the selected top 500 biclusters using the 2652 biological processes GO terms considered in this study. To demonstrate how well error-tolerant and _{10}(_{10}(

This figure shows the relationship between the size of biclusters and their enrichment scores as computed using GO biological processes for both Mega Yeast and Hughes et al’s data sets.

This figure shows the relationship between the size of biclusters and their enrichment scores as computed using GO biological processes for both Mega Yeast and Hughes et al’s data sets.

Consider mega yeast data for example, while

Further, considering various p-value thresholds (from loose –5 × 10^{–2} to strict – 1 × 10^{–5}), we collected two more statistics. First, the fraction of biclusters that are enriched by at least one GO term, and second, the fraction of GO terms that enriched at least one bicluster. To illustrate the efficacy of _{M}_{2}) were enriched, only 76% of the top 500 RAP biclusters (corresponding to Run ID _{M}_{2}) were enriched by at least one GO term at a reasonable p-value threshold of 1 × 10^{–3}, a gain of 7%. At even more strict p-value threshold of 1 × 10^{–5}, the gain is 11%. Similarly, for Hughes et al’s data set, though the gain is not significant, biclusters obtained from _{10}(

Statistical significance of error-tolerant biclusters using randomization tests

Motivated by the discussion of randomizaton tests and their importance in validating the results from any data mining approach

In the first randomization test, conserving the size of the top 500 error-tolerant biclusters, we generated 1000 random sets of 500 biclusters each and evaluated them by the same functional enrichment analysis using GO biological processes. So effectively, for each actual error-tolerant bicluster, we generated 1000 random biclusters of the same size (in terms of number of genes). The empirical p-value for each actual error-tolerant bicluster is then computed as the fraction of random biclusters (out of total 1000) whose enrichment score (_{10}(^{–3}.

Figure _{10}(^{–5} to ensure that they stand out from the rest. Therefore, all the biclusters showing (_{10}(_{10}(

This figure shows the biological and empirical p-values (using 1000 random runs) of the biclusters obtained using our proposed

This figure shows the biological and empirical p-values (using 1000 random runs) of the biclusters obtained using our proposed

We also showed in Table _{M2}_{10}(

This table shows the statistical significance of biclusters obtained from our proposed

Run ID

pval ≤ 0.05

pval ≤ 0.01

pval ≤ 0.005

pval ≤ 0.001

pval ≤ 0.00001

_{M}_{1}

660

33

0

0

0

_{M}_{2}

660

76

4

0

0

_{H}_{1}

797

0

0

0

0

_{H}_{2}

886

0

0

0

0

Statistical significance of biclusters obtained from

In the second randomization test, we randomized the data itself by shuffling the data values among the conditions for each gene. By doing this, we conserved the distribution of each gene profile but broke the correlation among them. We ran our proposed

Both of the above randomization tests suggest that the error-tolerant biclusters obtained from the real-valued gene-expression data sets were indeed biologically meaningful and are neither obtained by random chance nor capture random structures in the data.

Case study 2 - discovery of biomarkers

We used four real-valued

We discovered biclusters on combined Breast Cancer gene-expression data set using

**Selecting disriminative biclusters:** First we select top biclusters using the approach defined earlier and then amongst the top biclusters, only those are selected as biomarkers that are discriminative of the two groups of patients, cases and controls. To measure the discriminative power, we used two measures, odds ratio and p-value. While odds ratio quantifies how different are cases and controls for a specific bicluster, p-value quantifies the significance of the difference reflected by odds ratio. Only those biclusters are selected that have a p-value of less than 0.05 and odds ratio of more than 2.0 (biclusters more represented in cases) or less than 0.5 (biclusters more represented in controls).

**Functional enrichment analysis:** We evaluated all the identified biomarkers in terms of their enrichment scores using the MSigDB gene sets _{10}(_{min}

Enrichment analysis using MSigDB gene sets

Considering various p-value thresholds (from 10^{–6} to 10^{–14}), Figure ^{–8} (corresponding to _{10}(

(a) This figure shows the fraction of biomarkers enriched by at least one MSigDB gene set. (b) This figure shows the fraction of MSigDB gene sets enriched by at least one biomarker.

(a) This figure shows the fraction of biomarkers enriched by at least one MSigDB gene set. (b) This figure shows the fraction of MSigDB gene sets enriched by at least one biomarker.

Now refer to Figure ^{–6} (corresponding to _{10}(^{–8} are 1.01% (55 gene sets) 0.26% (14 gene sets) respectively.

After observing these global statistics for biomarkers obtained using

This figure shows the relationship among enrichment score computed using MSigDB gene sets, support (number of samples supporting the biomarker) and size (number of genes) of biomarkers obtained using

This figure shows the relationship among enrichment score computed using MSigDB gene sets, support (number of samples supporting the biomarker) and size (number of genes) of biomarkers obtained using

It is clear from the above analysis that the biomarkers obtained from

Biological relevance - example

We also observed the network based enrichment for an example biomarker obtained by each of the algorithms,

This figure shows the top network enriched based on an example biomarker (8 genes) obtained using our proposed

This figure shows the top network enriched based on an example biomarker (8 genes) obtained using our proposed

During metastasis, tumor cells can interact with the ECM through adhesion molecules such as integrins. In fact,

This figure shows the top network enriched based on an example biomarker (corresponding to the one obtained using ET-bicluster algorithm and considered in Figure

This figure shows the top network enriched based on an example biomarker (corresponding to the one obtained using ET-bicluster algorithm and considered in Figure

Thus the network obtained by the bigger

Conclusions

We proposed a novel error-tolerant biclustering model and presented an heuristic-based algorithm ‘

We presented two biological case studies, functional module discovery and biomarker discovery, to demonstrate the importance of incorporating noise and errors in the data for discovering coherent groups of genes. In both the case studies, we found that the biclusters discovered using our proposed

The work presented in this study can be extended in various ways. Below we discuss some of the limitations of the

• Since the

• The current implementation of

We only presented comparison of

It is also important to note that gene-expression data provides useful but limited view of the genome and therefore biclusters obtained from gene-expression data alone may not elucidate the complete underlying biological mechanism. Therefore to further illustrate the utility of

Methods

Error-tolerant bicluster model for real-valued data

As shown in

(a) **Bicluster composition:** Unlike the case of binary data where collection of 1s was defined as a bicluster, in the case of real-valued data, similar values across a set of rows constitute a bicluster. These values can be any values in the set ℝ and athough similar across rows, they can be different for different rows. The errors in the biclusters defined on real-valued attributes are introduced in a way similar to the binary case. However, like binary case in which all non-error entries are same (1s), in real-valued case, imposing such a requirement would be very harsh. Therefore, a measure is needed to check the coherence among the gene-expression values. For this purpose, we use the _{val} –_{val}_{val}

(b) **Positive/negative values:** Unlike binary data, real-valued microarray data has both positive and negative values. In this case, it is important to consider the sign of the value to discover meaningful biclusters. Similar to

(c) **Error/non-error values:** In binary case, 1 is always a non-error value and 0 an error value. This notion is no more valid for the real-valued data case. For example, consider an error-tolerant bicluster shown in Figure

Now, with the understanding of specific challenges and potential ways to address them, we now give the formal definition of error-tolerant biclusters for a real-valued data.

Definition of error-tolerant biclusters

Intuitively, a bicluster

•

• All supporting transactions of bicluster

∀_{t}_{,}_{G}

Thus according to the definition, fraction of errors in each supporting transaction of the bicluster should not exceed

Algorithm to discover error-tolerant biclusters from real-valued data

Starting with singletons, the

Checking the range criterion to ensure the coherence of values depends on the number of permissible errors at a particular bicluster-level (

Again, if any of the case satisfies the

An example

Considering a sample real-valued data with 5 genes (a, b, c, d, and e) and 8 experimental conditions (1 through 8) as shown in Figure

A sample matrix showing an example of error-tolerant bicluster.

A sample matrix showing an example of error-tolerant bicluster.

**Step 1: **

**Step 2: **

**Step 3: **

**Step 4: **_{r}

(((2^{nd}max

**Step 5: **

It is important to note that since

Authors' contributions

RG and VK conceived and designed the study. RG, NR and VK developed the proposed approach and the evaluation methodologies. RG and NR prepared the implementation and experimental results. All the authors participated in the preparation of the manuscript and approved the final version.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

This work was supported by NSF grants ΠS-0916439, CRI-0551551 and a University of Minnesota Rochester Biomedical Informatics and Computational Biology (BICB) Program Traineeship Award (Rohit Gupta). Access to computing facilities was provided by the Minnesota Supercomputing Institute.

This article has been published as part of