LERIA, Université d'Angers, 2 Boulevard Lavoisier, 49045 Angers Cedex 01, France

LaTICE, Higher School of Sciences and Technologies of Tunis, 5 Avenue Taha Hussein, B. P. : 56, Bab Menara, 1008 Tunis, University of Tunis, Tunisia

Abstract

Background

Biclustering aims at finding subgroups of genes that show highly correlated behaviors across a subgroup of conditions. Biclustering is a very useful tool for mining microarray data and has various practical applications. From a computational point of view, biclustering is a highly combinatorial search problem and can be solved with optimization methods.

Results

We describe a stochastic pattern-driven neighborhood search algorithm for the biclustering problem. Starting from an initial bicluster, the proposed method improves progressively the quality of the bicluster by adjusting some genes and conditions. The adjustments are based on the quality of each gene and condition with respect to the bicluster and the initial data matrix. The performance of the method was evaluated on two well-known microarray datasets (

Conclusions

The proposed method is computationally fast and can be applied to discover significant biclusters. It can also used to effectively improve the quality of existing biclusters provided by other biclustering methods.

Background

The DNA microarray technology permits to monitor and to measure gene expression levels for 10s of 1000s of genes simultaneously in a cell mixture in a single experiment under diverse experimental conditions. DNA microarray data are typically represented by a large matrix where each row contains the gene expression levels under specific conditions (columns). Since its invention, this technology has found many applications in biological and medical research. For instance, it is being used in cancer studies to better understand the biological mechanisms underlying oncogenesis, to discover new targets and new drugs, and to develop predictors for tailoring individualized treatments

Microarray data analysis is a critical step in practical applications and often achieved with the help of data mining techniques

Another general approach for microarray data analysis relies on non-supervised classification (or clustering) methods. These cluster analysis methods try to identify groups of genes, or/and groups of conditions (samples), that exhibit similar expression patterns

From a computational point of view, the biclustering problem is a highly combinatorial search problem and known to be NP-hard

1. Greedy iterative search approach: Greedy biclustering algorithms build a solution by starting from the initial data matrix (or a transformed matrix) and iteratively remove bad genes/conditions according to a quality criterion. For instance, the algorithm presented in

2. Biclusters enumeration approach: This approach tries to enumerate (implicitly) all the biclusters. The enumeration process is often represented by a search tree. During the construction of the search tree, some nodes are closed as soon as some pruning conditions are fulfilled. For instance, in

3. Stochastic search approach: This approach can be further divided into neighborhood search and evolutionary search. For neighborhood search, one begins with an initial candidate solution (bicluster) and improves iteratively its quality by replacing the bicluster with a neighboring bicluster. The neighboring bicluster is typically obtained by replacing a gene/condition by a better one. Cheng and Church

In this paper we introduce a stochastic neighborhood search algorithm called

Method

Preprocessing of gene expression matrix

Prior to the search by PDNS, our method first applies a preprocessing step to transform the input data matrix

Formally, the behavior matrix

with

Figure

Construction of bicluster pattern

**Construction of bicluster pattern**.

Pattern-driven neighborhood search for biclustering - general procedure

Our proposed PDNS method can be considered as an Iterated Local Search procedure

The key originality of PDNS concerns the use of bicluster pattern both in its search space and neighborhood definition. The bicluster pattern is a characteristic representation of a bicluster. It is used to evaluate genes/conditions of bicluster. This representation is defined by the behavior matrix of the bicluster, i.e., the trajectory patterns of the genes under all combined conditions of the bicluster. This representation is important because it is well recognized that in microarray data, genes are considered to belong to the same cluster if they have similar trajectory patterns of expression levels

Starting from an initial bicluster (call it current solution s), PDNS uses the descent strategy to explore the pattern-based neighborhood and moves to an improving neighboring solution at each iteration. By using the bicluster pattern, we define a set of rules which allow us to qualify the goodness (or badness) of a gene and condition. Using these rules (explained in a later section "Neighborhood and its exploration"), PDNS iteratively replaces within the current bicluster bad genes/conditions by good ones, thus progressively improves the quality of the bicluster under consideration. This iterative improvement procedure stops when the last bicluster attains a fixed quality threshold according to the ASR evaluation function (see next section) or when a fixed number

The whole PDNS algorithm stops when the best bicluster is not updated for a fixed number

General PDNS procedure

**General PDNS procedure**.

The ASR evaluation function

Many functions exist for bicluster evaluation. One of the most popular evaluation functions is the

In this paper, we use the Average Spearman's Rho (ASR) function which avoids the drawback of MSR

where _{ij }_{kl }

A high (resp. low) ASR value, close to 1 (resp. close to -1), indicates that the genes/conditions of the bicluster are strongly (resp. weakly) correlated.

Let us notice that the existing evaluation functions can roughly be classified into two families:

Configuration representation

PDNS uses a solution representation based on the behavior matrix

Initial solution

Our algorithm needs an initial bicluster to start its search. The initial bicluster can be provided by any means. For instance, this can be done randomly with a risk of starting with an initial solution of bad quality. A more interesting strategy is to employ a fast greedy algorithm to obtain rapidly a bicluster of reasonable quality. We use this strategy in this work and adopt two well-known algorithms: one is presented by Cheng and Church

Neighborhood and its exploration

The neighborhood is one of the most critical elements of any local search algorithm. The neighborhood can be defined by a move operator. Given a solution

In this paper, we devise two specially designed move operators operating respectively on rows (genes) and columns (combinations of pairwise conditions) of a given solution. Both operators are based on the general drop/add operation which removes some elements and adds new elements in the given solution. The critical issue here is the criterion that is employed to determine the elements to be removed and added. In our case, this decision is based on the "behavior pattern".

Our first move operator, denoted by _{g}

Row move operator _{g}

**Row move operator mv**. A bad gene (g

Now for each gene _{i}_{i }_{g }

Figure _{4}_{10}_{4 }_{10 }

Our second move operator, denoted by _{c}_{c }

Then, when our second move operator _{c }

Columns move operator _{c}

**Columns move operator mv**. Column c

For a given solution, our PDNS algorithm applies these two move operators to reach a local optimum _{1}, g_{2}, g_{3}, g_{4}_{1}c_{2}, c_{1}c_{3}, c_{1}c_{4}, c_{2}c_{3}_{4 }_{2 }_{3}_{1}, g_{2}, g_{3}, g_{4}_{1}, c_{2}, c_{3}

Results and discussion

Experimental protocol

We perform statistical and biological validations of the obtained biclusters and we evaluate our PDNS algorithm against the results of some prominent biclustering algorithms used by the community, namely, CC

For the experiments, we empirically fix α,

Datasets and results

Saccharomyces Cerevisiae dataset

The Saccharomyces Cerevisiae dataset (available at

The results of PDNS are compared against the reported scores of RMSBE, Bimax, OPSM, ISA, Samba and CC from

Figure

Proportions of biclusters significantly enriched by GO on Saccharomyces Cerevisiae dataset

**Proportions of biclusters significantly enriched by GO on Saccharomyces Cerevisiae dataset**.

Yeast Cell-Cycle dataset

The Yeast Cell-Cycle dataset (available at

For this dataset, two criteria are used. First, we evaluate the statistical relevance of the extracted biclusters by computing the adjusted

Statistical relevance

To evaluate the statistical relevance of PDNS, we use again the

Proportions of biclusters significantly enriched by GO on Yeast Cell-Cycle dataset

**Proportions of biclusters significantly enriched by GO on Yeast Cell-Cycle dataset**.

Analysis of biological annotation enrichment of biclusters

To evaluate the biological significance of the obtained biclusters in terms of the associated biological processes, molecular functions and cellular components respectively, we use the Gene Ontology (GO) term finder

Table _{CC}_{CC}_{OPSM}_{OPSM}_{PDNS}_{PDNS}_{PDNS}_{PDNS}

Most significant shared GO terms (process, function, component) of CC and PDNS for biclusters on Yeast Cell-Cycle dataset

**Bic**.

**Algo**.

**Biological process**

**Molecular function**

**Cellular component**

_{CC}

_{PDNS}

CC

PDNS

unknown

glutamate biosynthetic process

(10.2%, 8.62e-08)

unknown

isocitrate dehydrogenase (NAD+) activity

(18.6%, 0.00300)

unknown

mitochondrion part

(48.3%, 5.19e-07)

_{CC}

_{PDNS}

CC

PDNS

translation

(46.6%, 1.72e-22)

translation

(58.1%, 8.71e-37)

structural constituent of ribosome (38.8%, 1.05e-36)

structural constituent of ribosome (51.3%, 4.48e-59)

cytosolic ribosome

(38.8%, 1.10e-41)

cytosolic ribosome

(53.00%, 5.97e-70)

Most significant shared GO terms (process, function, component) of OPSM and PDNS for biclusters on Yeast Cell-Cycle dataset

**Bic**.

**Algo**.

**Biological process**

**Molecular function**

**Cellular component**

_{OPSM}

_{PDNS}

OPSM

PDNS

unknown

ribosome biogenesis

(32.1%, 2.02e-07)

unknown

snoRNA binding

(5.3%, 5.84e-06)

unknown

nucleolus

(32.1%, 6.22e-10)

_{OPSM}

_{PDNS}

OPSM

PDNS

sister chromatid

segregation (24.7%, 0.00337)

nucleic acid metabolic

process (34.0%, 2.45e-11)

unknown

phosphatase regulator

activity (1.7%, 0.00041)

spindle

(14.1%, 0.00196)

nucleus

(44.8%, 3.46e-15)

For the bicluster labeled _{PDNS }

For the worst (resp. the best) biclusters obtained from CC, i.e, _{CC }_{CC}_{OPSM }_{OPSM}_{CC }_{OPSM}_{PDNS }_{PDNS}_{CC }_{OPSM}_{PDNS }_{PDNS}

Conclusions

We have presented the pattern-driven neighborhood search for the biclustering problem of microarray data. PDNS alternates between a descent-based intensification phase and a perturbation phase. By using a behavior matrix representation of solutions, the descent search procedure is guided by a pattern-based neighbourhood which is defined by two move operators. These operators change respectively the rows and columns of the current solution according to the pattern information related to each row and each column of the current solution as well as the initial matrix. Perturbation is realized by changing randomly a percentage of rows and columns of the best recorded solution (an option would be to constraint the changes to some critical rows and columns).

The proposed algorithm has been assessed using two well-known microarray datasets (Yeast Cell-Cycle and Saccharomyces Cerevisiae). The experimental study showed competitive results of PDNS in comparison with other popular biclustering algorithms by providing statistically and biologically significant biclusters. PDNS is a computationally effective method and can also be used to improve biclusters obtained by other methods.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

WA carried out the implementation of the proposed idea, performed the statistical and biological experiments using

Acknowledgements

This work was partially supported by the projects 'Bioinformatique Ligérienne - BIL' (2009-2011, Pays de La Loire, France) and Radapop (2009-2013, Pays de La Loire, France) which are acknowledged. We thank the reviewers of the paper for their comments and suggestions.

This article has been published as part of