School of Computer and Information Engineering, Central South University of Forestry and Technology, Changsha 410004, China

State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, China

Beijing Key Laboratory of Network Technology, Beihang University, Beijing 100191, China

Department of Computer Science, Central China Normal University, Wuhan 430079, China

College of Information Science, Drexel University, Philadelphia, PA 19104, USA

School of Information Science and Technology, Hunan Agricultural University, Changsha 410128, China

Library, Central South University of Forestry and Technology, Changsha 410004, China

Abstract

Background

Multi-objective optimization (MOO) involves optimization problems with multiple objectives. Generally, theose objectives is used to estimate very different aspects of the solutions, and these aspects are often in conflict with each other. MOO first gets a Pareto set, and then looks for both commonality and systematic variations across the set. For the large-scale data sets, heuristic search algorithms such as EA combined with MOO techniques are ideal. Newly DNA microarray technology may study the transcriptional response of a complete genome to different experimental conditions and yield a lot of large-scale datasets. Biclustering technique can simultaneously cluster rows and columns of a dataset, and hlep to extract more accurate information from those datasets. Biclustering need optimize several conflicting objectives, and can be solved with MOO methods. As a heuristics-based optimization approach, the particle swarm optimization (PSO) simulate the movements of a bird flock finding food. The shuffled frog-leaping algorithm (SFL) is a population-based cooperative search metaphor combining the benefits of the local search of PSO and the global shuffled of information of the complex evolution technique. SFL is used to solve the optimization problems of the large-scale datasets.

Results

This paper integrates dynamic population strategy and shuffled frog-leaping algorithm into biclustering of microarray data, and proposes a novel multi-objective dynamic population shuffled frog-leaping biclustering (MODPSFLB) algorithm to mine maximum bicluesters from microarray data. Experimental results show that the proposed MODPSFLB algorithm can effectively find significant biological structures in terms of related biological processes, components and molecular functions.

Conclusions

The proposed MODPSFLB algorithm has good diversity and fast convergence of Pareto solutions and will become a powerful systematic functional analysis in genome research.

Background

With rapid development of the DNA microarray technology, simultaneously measuring the expression levels of thousands of genes in a single experiment can yield large-scale datasets. The analysis of microarray data mainly contains the study of gene expression under different environmental stress conditions and the comparisons of gene expression profiles for tumors from cancer patients. A subset of genes showing correlated co-expression patterns across a subset of conditions are expected to be functionally related. By comparing gene expression in normal and disease sells, microarray dataset may be used to identify disease genes and targets for therapeutic drugs. Therefore, mining patterns from microarray dataset becomes more and more important. These patterns relate to disease diagnosis, drug discovery, protein network analysis, gene regulate, as well as function prediction.

For microarray data analysis, clustering techniques is a popular technique for mining significant biological models. Clustering can identify set of genes with similar profiles. However, traditional clustering approaches such as k-means

In recent three decades, inspired by biology views, heuristics optimization has become a very popular research topic. To order to escape from local minima, many evolutionary algorithms (EA) are used to find global optimal solutions from gene expression data

However when mining biclusters from microarray data, we must optimize simultaneously several objectives in conflict with each other, for example, the size and the homogeneity of the clusters. In this case MOEA is proposed to discover efficiently global optimal solution. Among many MOEA proposed, the relaxed forms of Pareto dominance has become a popular mechanism to regulate convergence of an MOEA, to encourage more exploration and to provide more diversity. Among these mechanisms, ϵ-dominance has become increasingly popular

Recently particle swarm optimization (PSO) proposed by Kebnnedy and Eberhart

The most attractive of PSO is that there are very few parameters to adjust. So it has been successfully used for both continuous nonlinear and discrete binary single-objective optimization.

The rapid convergence and relative simplicity of PSO make it very suitable to solve multi-objective optimization named as multi-objective PSO (MOPSO). In recent years many multi-objective PSO (MOPSO) approaches

Most MOPs use a fixed population size to find non-dominated solutions for obtaining the Paterto front. The computational cost is the greatest influence of population size on these population-based meta-heuristic algorithms. Hence dynamically adjusting the population size need consider the balance between computational cost and the algorithm performance. Some methods using dynamic size are proposed. Tan

In recent years, Eusuff

To the best of our knowledge, there is no published work dealing with the biclustering of microarray data by using SFLA. Thus, in this paper we present an effective SFLA biclustering algorithm for mining the maximum biclusters with allowable dissimilarity within the biclusters, and with a greater row variance. Computational experiments and comparisons show that the proposed SFLA outperforms three best performing algorithms proposed recently for solving the biclustering problem with the biclustering criterion.

Methods

Based on shuffled frog-leaping algorithm, crowding distance and ε-dominance strategy

Biclusters

Given a gene expression data matrix D = G×C = (here _{1}, g_{2},..., g_{n}}, C a set of m biological conditions {c_{1}, c_{2},..., c_{n}}. Entry _{ij }_{i }_{j}

**Definition 1 Bicluster**. Given a gene expression dataset D = G×C, if there is a submatrix B = g×c, where g⊂G, c⊂C, to satisfy certain homogeneity and minimal size of the cluster, we say that B is a bicluster.

**Definition 2 Maximal bicluster**. A bicluster B = g×c is maximal if there exists not any other biclusters B'B'= g'×c' g'×c' such that, g'⊂g, c'⊂C.

**Definition 3 Dimension mean**. Given a bicluster B = g×c, with subset of genes g⊂G, subset of conditions c⊂C, _{ij }_{i }_{j }_{ic}d_{ic }the mean of the ith gene in B, d_{gj }the mean of the jth condition in B. We also denote by d_{gc }the mean of all entries in B. These values are defined as follows, where Size(g, c) = |g||c| presents the size of bicluster B.

**Definition 4 Residue and mean square residue**. Given a bicluster B = g×c, to assess the difference the actual value of an element _{ij }_{ij}) the residue of d_{ij }in bicluster B in Eq.(4). Therefore the mean squared residue (MSR) of B is defined as the sum of the squared residues to assess overall quality of a bicluster B in Eq.(5).

**Definition 5 Row variance**. Given a bicluster B = g×c, the ith gene variance in B is defined by RVAR(i, c) and the overall gene-dimensional variance is defined as the sum of all genes variance as follows.

Our target is mining good quality biclusters of maximum size, with mean square residue (MSR) smaller than a user-defined threshold **δ **> 0, which presents the maximum allowable dissimilarity within the biclusters, and with a greater row variance. The problem is NP-complete, so the large majority of the algorithms use heuristic approaches to attain near optimal solutions.

Bicluster encoding

Each bicluster is encoded as an individual of the population. Each individual is represented by a binary string of fixed length

An individual encoding a bicluster

**An individual encoding a bicluster**. Figure 1 presents the individual encoding a bicluster with 2 genes and 3 conditions, and its size is 2 × 3 = 6.

Fitness function

We hope to mine those biclusters with low mean squared residue, with high volume and gene-dimensional variance, thus three objectives in conflict with each other are used to model multi-objective optimization problem. In this paper, we use the following three fitness functions

Where G and C are the total number of genes and conditions of the microarray datasets respectively. Size(x), MSR(x) and RVAR(x) denotes the size, mean squared residue and row variance of bicluster encoded by the frog × respectively. δ is the user-defined threshold for the maximum acceptable mean squared residue. Our algorithm minimizes those three fitness functions.

ϵ-dominance

Among many MOEA proposed, the non-dominated solutions of each generation are kept in an external population that must be updated in each generation. The time needed for updating the population depends on the population size, population size and the number of objectives and increases extremely when increasing the values of these three factors

**Definition 6 Dominance relation**. Let f, g ∈R^{m}. Then f is said to dominate g (denoted as f ≻ g), iff

**(i) **∀i ∈**{1,...., m}**: f_{i }≤ g_{i}

**(ii) **∃j ∈**{1,...., m}**: f_{j }< g_{j}

**Definition 7 Pareto set**. Let ^{m }

Vectors in

**Definition 8 ϵ-dominance**. Let f, g ∈ ^{m}_{ϵ }g, iff for all i∈{1,...., m}

**Definition 9 ϵ-approximate Pareto set**. Let F ⊆ ^{m }_{ϵ }_{ϵ }

The set of all ϵ-approximate Pareto sets of F is denoted as P_{ϵ }

**Definition 10 ϵ-Pareto set**. Let ^{m }

(i)

(ii)

The set of all ϵ-Pareto set of F is denoted as

Update of ϵ-Pareto set of the frog population

In order to guarantee the convergence and maintain diversity in the population at the same time, we implement updating of ϵ-Pareto set of the frog population during selection operation

Fining the global best solution

To order to find the global best solutions, we use the Sigma method _{g }_{j }_{i }_{i }_{j}, _{k }_{i }is selected as the best local guide for the frog _{g }= x_{k }is the best local guide for frog _{i }

Shuffled frog-leaping algorithm

SFL is a population-based cooperative search metaphor combining the benefits of the genetic-based memetic algorithm and the social behavior based on particle swarm optimization. Shuffled frog leaping algorithm is a new meta-heuristic proposed by Eusuff _{1}, x_{2},..., x_{P}_{i }_{i }_{i1}, x_{i2},..., x_{iN }

SFL starts with the whole population partitioned into a number of parallel subsets referred to as memeplexes. Then eachmemeplex is considered as a different culture of frogs and permitted to evolve independently to search the space. Within each memeplex, the individual frogs hold their own ideas, which can be affected by the ideas of other frogs, and experience a memetic evolution. During the evolution, the frogs may change their memes by using the information from the memeplex best

**Step1**. For _{bit }_{i}:

where _{i }_{i }_{1}, c_{2 }and _{3 }_{1 }_{2 }_{3 }_{1}, r_{2 }_{3 }_{1 }_{2 }_{1 }reflects the influence of the global best position on the worst frog and _{2 }_{1 }_{2 }_{1 }_{2 }_{1 }_{2 }

The position of the frog is determined using Eq.(15):

where

If this process produces a better solution, it replaces the worst frog; otherwise go to the next step.

**Step2**. A mutation operator is applied on the position of the worst frog. In the case of improvement, the resulted position is accepted; otherwise go to the next step.

**Step3**. A crossover operator is applied between the worst frog of the memeplex and the globally best position. The worst frog is replaced if its fitness is improved; otherwise go to the next step.

**Step4**. The worst frog is replaced randomly.

If no improvement becomes possible in this case, then x(w) is replaced by a randomly generated solution within the entire feasible space.

After a predefined number of memetic evolution steps, the frogs in memeplexes are submitted to a shuffling process, where all the memeplexes are combined into a whole population and then the population is again divided into several new memeplexes. The memetic local search and shuffling process are repeated until a given termination condition is reached.

As a predefined number of improvement cycles is reached, memeplexes are shuffled, and if stopping criteria are not met, the algorithm is repeated.

Accordingly, the main parameters of DSFL are: number of frogs P, number ofmemeplexes m, number of processing cycles on each memeplex before shuffling, number of shuffling iterations (or function evaluations), number of bits for any variable, mutation rate, crossover type, the constriction factor, acceleration coefficients and influence factors.

Based on some primary experimental results, the suitable values were found as follows: number of frogs and number of bits for each variable are 60 and 10, respectively, number of processing cycles on each memeplex before shuffling is 10, number of memeplexes is 6. The values of other parameters have been mentioned before. This paper incorporating dynamic population size.

Dynamic population strategy

Generally, multiple-objective optimization focus on two competing objectives: (1) to quickly converge to the true Pareto front and (2) to maintain the diversity of the solutions along the resulting Pareto front. Because maintaining the diversity will slow down the convergence speed and may degrade the quality of the resulting Pareto front, these two objectives are in conflict each other. In this paper, we adopt dynamically adjusting the population size to explore the search space in balance between two competing objectives.

Initializing the population

The initial population is get by running state-of-art MOEA (NSGA-II

Adding population size

Population adding strategy mainly consist in increasing the population size to ensure sufficient number of individuals to contribute to the search process and to place those new individuals in unexplored areas to discover new possible solutions. Based on the strategies of dynamic population size

Decreasing population size

To prevent the excessive growth in population, a population decreasing strategy

MODPSFLB biclustering algorithm

We incorporates dynamic population strategy into multi-objective shuffled frog leaping biclustering (MOSFLB)

**Algorithm 1**: MODPSFLB Algorithm

**Input**: microarray data, minimal

**Output**: the best solutions, that is, the found biclusters

**Begin**

Initialize the frog population A according to the population initializing stragery

**While **not terminated **do**

Calculate fitness for each frog

Add the size of population A according to the population adding stragery

Divide the population into several memeplexes

**For **each memeplex

Determine the best and worst frogs

Improve the worst frog position x(w) using Eq.(15)

**If **no improvement in this case **then**

x(w) is replaced by a randomly generated frog within the entire feasible space

**End for**

Combine the evolved memeplexes

Select the best frogs using Sigma method and ϵ-dominance

Decrease the size of population A according to the population decreasing stragery

**End while**

**Return **

END

MODPSFLB algorithm iteratively updates the frogs population until maximum number of generation are reached and converge to the optimal solution set.

Results

Mitra and Banka applied MOEA to solve biclustering problem and proposed MOE Biclustering (MOEB)

Datasets and data preprocessing

The first dataset is the yeast Saccharomyces cerevisiae cell cycle expression data

The yeast dataset collects expression level of 2,884 genes under 17 conditions. All entries are integers lying in the range of 0-600. Out of the yeast dataset there are 34 missing values. The 34 missing values are replaced by random number between 0 and 800

The human B-cells expression dataset is collection of 4,026 genes and 96 conditions, with 12.3% missing values, lying in the range of integers -750-650. The missing values are replaced by random numbers between -800-800^{[5]}. However, those random values affect the discovery of biclusters

Experiments

MODPSFLB algorithm is implemented in JAVA programming language and is performed on a 1.7 GHz Pentium 4 PC with 512 M of RAM running Windows XP. To evaluate its performance, the proposed algorithm is compared to MOEB, MOPSOB

Yeast dataset

In Table

Information of biclusters found on yeast dataset

**Bicluster**

**Genes**

**Conditions**

**Residue**

**Row variance**

1

101

15

215.62

749.17

6

514

10

289.65

955.25

14

858

10

322.58

702.36

22

478

11

298.68

885.64

31

123

12

201.88

699.87

36

801

8

221.88

687.18

44

1125

13

236.47

598.68

56

847

11

208.48

748.54

75

546

9

250.14

664.13

89

89

17

210.88

666.57

Table 1 shows the number of genes and conditions, the mean squared residue and the row variance of ten biclusters out of the one hundred biclusters found on the yeast dataset.

Figure

Small biclusters of size 24 × 17 on the yeast dataset

**Small biclusters of size 24 × 17 on the yeast dataset**. Figure 2 shows the expression value of 24 genes under 17 conditions from the small biclusters (bicluster 63).

Human B-cells expression dataset

Table

Biclusters found on human dataset

**Bicluster**

**Genes**

**Conditions**

**Residue**

**Row variance**

1

882

34

987.54

3587.26

4

666

54

1087.25

4201.36

11

1024

36

773.69

2930.64

17

1102

39

1204.65

3698.84

24

968

37

1110.25

3548.45

35

805

41

844.44

2987.01

39

871

48

2874.17

2140.36

44

1208

29

885.74

3587.45

59

258

86

777.58

2874.94

88

1508

59

1405

6658.45

Table 2 shows the number of genes and conditions, the mean squared residue and the row variance of ten biclusters out of the one hundred biclusters found on the human dataset.

Comparative analysis

We compare the proposed MODPSFLB algorithm with MOPSOB, MOSFLB and DMOPSOB algorithm on the yeast dataset and the human dataset and the results are showed in Table

Comparative study of three algorithms

**MOPSOB**

**MOSFLB**

**DMOPSOB**

**MODPSFLB**

**Dataset**

**Yeast**

**Human**

**Yeast**

**Human**

**Yeast**

**Human**

**Yeast**

**Human**

Avg. MSR

218.54

927.47

215.98

913.53

216.13

905.23

212.8

904.9

Avg. size

10510.8

34012.24

1109.23

35507.22

11213.5

35442.98

11220.7

35601.8

Avg. genes

1102.84

902.41

1148.21

928.12

1151.25

932.57

1154.21

933.9

Avg. conditions

9.31

40.12

9.78

43.11

9.59

42.78

9.81

43029

Max size

15613

37666

15709

37871

14770

37231

14827

37486

Avg. time

120.78

328.56

111.41

319.88

100.47

310.34

88.24

287.98

Table 3 compares the performance of two algorithms. It gives the average of mean squared residue and the average size of the found biclusters, and gives computation cost of two algorithms.

From Table

As for the computation cost, Table

In total it is clear from the above results that the proposed MODPSFLB algorithm performs best in maintaining diversity, achieving convergence.

Biological analysis of biclusters

We determine the biological relevance of the biclusters found by MODPSFLB on the yeast dataset in terms of the statistically significant GO annotation database. The gene ontology (GO) project (

The degree of enrichment is measured by p-values which use a cumulative hyper geometric distribution to compute the probability of observing the number of genes from a particular GO category (function, process and component) within each bicluster. For example, the probability

Where m is the total number of genes within a category and g is the total number of genes within the genome. The p-values are calculated for each functional category in each bicluster to denote how well those genes match with the corresponding GO category.

Table _{1}, we find that the genes are mainly involved in Oxidoreductase activity. The tuple (n = 13, p = 0.00051) means that out of 101 genes in cluster C_{1}, 13 genes belong to Oxidoreductase activity Function, and the statistical significance is given by the p-value of 0.00051. Those results mean that the proposed MODPSFLB biclustering approach can find biologically meaningful clusters.

Significant GO terms of genes in three biclusters

**Cluster No**.

**No. of genes**

**Process**

**Function**

**Component**

1

101

Lipid transport (n = 21, p = 0.00389)

Oxidoreductase activity

(n = 13, p = 0.00051)

Membrane

(n = 12, p = 0.0023)

12

71

Physiological process

(n = 43, p = 0.0043)

MAP kinase activity

(n = 7, p = 0.00126)

Cell

(n = 32, p = 0.00194)

33

58

Protein biosynthesis

(n = 27, p = 0.00216)

Structural constituent of ribosome

(n = 17, p = 0.00132)

Cytosolic ribosome

(n = 11, p = 0.00219)

Table 4 lists the significant shared GO terms which are used to describe genes in each bicluster for the process, function and component ontology.

Conclusions

This paper proposes a novel multi-objective dynamic population shuffled frog-leaping biclustering framework for mining biclusters from microarray datasets. We focus on finding maximum biclusters with lower mean squared residue and higher row variance. Those three objective are incorporated into the framework with three fitness functions. We apply the following techniques: a SFL method to balance and control the search process, population adding method to dynamically grows new individuals with enhanced exploration and exploitation capabilities, population decreasing strategy to balance and control the dynamic population size, and final to quicken convergence of the algorithm.

The comparative study of MODPSFLB and three state-of-the-art biclustering algorithms on the yeast microarray dataset and the human B-cells expression dataset clearly verifies that MODPSFLB can effectively find significant palocalized structures related to sets of genes that show consistent expression patterns across subsets of experimental conditions. The mined patterns present a significant biological relevance in terms of related biological processes, components and molecular functions in a species-independent manner.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JL was primarily responsible for the design of MODMSFLB to mine biclusters from gene expression data and drafted the manuscript. ZL and XH were involved in study design and coordination and revised the manuscript. YC and FL conducted the algorithm design.

Acknowledgements

This article has been published as part of

This work was supported by the National Natural Science Foundation of China (60973105, 90718017, 61170189), the Research Fund for the Doctoral Program of Higher Education (20111102130003), the Fund of the State Key Laboratory of Software Development Environment (SKLSDE-2011ZX-03), the Scientific Research Fund of Hunan Provincial Education Department (09A105), the Talents Import Fund of Central South University of Forestry and Technology (104-0177) and the Fund of Hunan Provincial University Library and Information Commission (2011L058).