School of Computer and Information Engineering, Central South University of Forestry and Technology, Changsha 410004, China

State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, China

Beijing Key Laboratory of Network Technology, BeiHang University, Beijing 100191, China

College of Information Science and Technology, Drexel University, Philadelphia, PA 19104, USA

School of Information Science and Technology, Hunan Agricultural University, Changsha 410128, China

CSI-CUNY in Staten Island, NY, USA

Abstract

Background

Newly microarray technologies yield large-scale datasets. The microarray datasets are usually presented in 2D matrices, where rows represent genes and columns represent experimental conditions. Systematic analysis of those datasets provides the increasing amount of information, which is urgently needed in the post-genomic era. Biclustering, which is a technique developed to allow simultaneous clustering of rows and columns of a dataset, might be useful to extract more accurate information from those datasets. Biclustering requires the optimization of two conflicting objectives (residue and volume), and a multi-objective artificial immune system capable of performing a multi-population search. As a heuristic search technique, artificial immune systems (AISs) can be considered a new computational paradigm inspired by the immunological system of vertebrates and designed to solve a wide range of optimization problems. During biclustering several objectives in conflict with each other have to be optimized simultaneously, so multi-objective optimization model is suitable for solving biclustering problem.

Results

Based on dynamic population, this paper proposes a novel dynamic multi-objective immune optimization biclustering (DMOIOB) algorithm to mine coherent patterns from microarray data. Experimental results on two common and public datasets of gene expression profiles show that our approach can effectively find significant localized structures related to sets of genes that show consistent expression patterns across subsets of experimental conditions. The mined patterns present a significant biological relevance in terms of related biological processes, components and molecular functions in a species-independent manner.

Conclusions

The proposed DMOIOB algorithm is an efficient tool to analyze large microarray datasets. It achieves a good diversity and rapid convergence.

Background

Rapid development of the DNA microarray technology makes it very possible to study the transcriptional response of a complete genome to different experimental conditions. The rapid increasing of microarray datasets provides unique opportunities to perform systematic functional analysis in genome research. A subset of genes showing correlated co-expression patterns across a subset of conditions are expected to be functionally related. One important research area in bioinformatics and clinical research is finding patterns which relate to disease diagnosis, drug discovery and the function prediction.

Biclustering is proposed for grouping simultaneously genes set and condition set over which the gene subset exhibit similar expression patterns. Cheng and Church

During the last three decades, inspired by biology views, some heuristic approachs such as evolutionary algorithms

Recently an artificial immune system is introduced to deal with MOO problem. Jiao

Most MOPs use a fixed population size to find non-dominated solutions for obtaining the Paterto front. The computational cost is the greatest influence of population size on these population-based meta-heuristic algorithms. Hence dynamically adjusting the population size need consider the balance between computational cost and the algorithm performance. Some methods using dynamic size are proposed. Tan

Methods

Based on the immune response principle and ε-dominance strategy

Biclusters

Given a gene expression data matrix D=G×C={_{ij}_{1}, g_{2}, ⋯, g_{n}}, C a set of m biological conditions {c_{1}, c_{2}, ⋯, c_{n}}. Entry _{ij}_{i}_{j}.

Bicluster encoding

Each bicluster is encoded as an individual of the population. Each individual is represented by a binary string of fixed length

Fitness function

Our hope is mining biclusters with low mean squared residue, with high volume and gene-dimensional variance, and those three objectives in conflict with each other are well suited for multi-objective to model. To achieve these aims, this paper uses the same fitness functions as

Update of ϵ-Pareto set of the population

In order to guarantee the convergence and maintain diversity in the population at the same time, we implement updating of ϵ-Pareto set of the population during clonal selection operation. A general scheme of the updating algorithm is given in

Immune response principle

An immune system can collect biological processes of an organism that protects against disease by identifying and killing pathogens and tumour cells. It can detect a wide variety of viruses and parasitic worms, and distinguish them from the organism's own healthy cells and tissues to protect an organism. It is highly distributed, highly adaptive, self-organization in nature

The immune selection principle

DMOIO biclustering algorithm

Multiple-objective optimization aim at the following two competing objectives: 1) to quickly obtain a non-dominated front that is close to the true Pareto front and 2) to maintain the diversity of the solutions along the resulting Pareto front. These two objectives are in conflict each other because maintaining the diversity will slow down the convergence speed and may degrade the quality of the resulting Pareto front. On one hand, MOIO algorithms tend to the optimal regions. On the other hand, the clonal selection behaviour may lead to premature convergence in the search space and produce a uniformly distributed Pareto front. The influence of population size on the performance of MOIO is the computational cost. It is difficult to deal with this conflict issues for a MOIO with a fixed population size because a predetermined computation resource has to be allocated and properly distributed between two competing objectives. Hence, inspired by

Initial population

In most multi-objective optimization methods the initial archive is set to empty. The first archive contains the non-dominated solutions of the initial population. Each antigen selects best local guide from the archive members using Sigma method

Fining the global best solution

To order to find the global best solutions, this paper uses the basic idea of Sigma method _{g} among the archive members for the antigen ** i**of population as follows. In the first step, we assign the value σ

Population adding method

Population adding strategy mainly consist in increasing the population size to ensure sufficient number of individuals to contribute to the search process and to place those new individuals in unexplored areas to discover new possible solutions. Based on the strategies of dynamic population size

**Step 1:** Selecting candidate antibodies added

The non-dominated set considered as candidate antibodies must have the highest probability of generating new antibodies that will improve the convergence toward the Pareto front. Therefore the number of potential antibodies determined via ns = INT(r1× (total no. of antibodies in non-dominated set)) is randomly selected from the non-dominated set. Where _{1}

**Step 2:** Defining the number of mutation

The number of mutation of the selected antibody is adaptively determined every iteration. Each selected antibody’s responsibility is to generate a certain number of new antibodies from the selected antibody. A probability value is used to determine the number of perturbations adaptively in which the number of mutation (number of new antibodies to be generated) is bound by the minimum and maximum number of mutation.

**Step 3:** Limiting the range of new antibodies

In proposed algorithm, to balance the exploitation and exploration capabilities and to avoid generating too many new antibodies from being too far away from the selected antibodies, it is necessary to generate a higher number of new antibodies within the neighbourhood than outside of the neighbourhood which similar to

Population decreasing method

To prevent the excessive growth in population, a population decreasing strategy which similar to

DMOIOB algorithm

We propose a dynamic MOIO biclustering algorithm (DMOIOB) to mine biclusters from the microarray datasets to attain the global optimum solutions. We incorporates the following three strategies: 1) ϵ-dominance to quicken convergence speed; 2) Sigma method to find good local guides; 3) population-growing strategy to increase the population size to promote exploration capability; and 4) population declining strategy to prevent the population size from growing excessively.

The pseudo-code of the proposed DMOIOB algorithm is given in Algorithm 2.

DMOIOB algorithm iteratively updates the antigens population until user-defined number of generation are generated and last converges to the optimal solution.

Results

This paper applies the proposed DMOIOB algorithm to mine biclusters from two well known datasets and compare the diversity and convergence of the DMOIOB algorithm with MOIB algorithm. Lastly, the biological relevance of the biclusters found by DMOIOB is given.

Datasets and data preprocessing

The first dataset is the yeast Saccharomyces cerevisiae cell cycle expression data

The yeast dataset collects expression level of 2,884 genes under 17 conditions. All entries are integers lying in the range of 0-600. Out of the yeast dataset there are 34 missing values. The 34 missing values are replaced by random number between 0 and 800

The human B-cells expression dataset is collection of 4,026 genes and 96 conditions, with 12.3% missing values, lying in the range of integers -750-650.The missing values are replaced by random numbers between -800-800

Testing

DMOIOB algorithm is implemented in JAVA programming language and is performed on a 1.7GHz Pentium 4PC with 512M of RAM running Windows XP. To evaluate its performance, the proposed algorithm is compared to MOIB

Yeast dataset

Table

Information of biclusters found on yeast dataset

Bicluster

Genes

Conditions

Residue

Row Variance

1

1

16

238.54

789.25

22

91

17

210.58

685.36

24

563

12

201.55

875.65

29

1233

9

275.69

896.35

78

145

13

225.11

745.65

98

874

11

207.98

874.01

Table

Figure

Small biclusters of size 26×15 on the yeast dataset

**Small biclusters of size 26×15 on the yeast dataset** Figure

Human B-cells expression dataset

Table

Biclusters found on human dataset

Bicluster

Genes

Conditions

Residue

Row Variance

1

597

49

855.69

3584.54

3

611

45

911.58

2875.12

8

1024

31

887.54

3012.25

10

478

39

812.88

6854.54

22

874

29

874.96

8740.24

31

698

37

800.74

4870.91

Table

Comparative analysis

In this section, this paper compares the proposed algorithm with MOIB algorithm on the yeast dataset and the human dataset and the results are showed in Table

Comparative study of three algorithms

Algorithm

Dataset

Avg. MSR

Avg. size

Avg. time

DMOIOB

Yeast

201.86

2841.08

88.02

Human

832.79

7106.51

258.48

MOIB

Yeast

202.32

2638.74

108.12

Human

839.74

6918.29

280.76

Table

From Table

For computation cost we find that the computation time of MOIOB is 88.02s on yeast dataset and 258.48s on human dataset, is superior to that of MOIB.

In total it is clear from the above results that the proposed DMOIOB algorithm performs best in maintaining diversity, achieving convergence.

Biological analysis of biclusters

We determine the biological relevance of the biclusters found by DMOIOB on the yeast dataset in terms of the statistically significant GO annotation database. The gene ontology (GO) project (

The degree of enrichment is measured by p-values which use a cumulative hyper-geometric distribution to compute the probability of observing the number of genes from a particular GO category (function, process and component)within each bicluster. The p-values are calculated for each functional category in each bicluster to denote how well those genes match with the corresponding GO category given in Table

Significant GO terms of genes in three biclusters

Cluster No.

No. of genes

Process

Function

Component

1

99

Response to DNA damage stimulus (n=21,p=0.0016)

RNA polymerase II transcription factor activity (n=11,p=0.0064)

Intracellular membrane-bound organelle (n=16,p=0.0025)

22

91

Physiological process (n=23,p=0.0014)

MAP kinase activity (n=6,p=0.0023)

Cytosolic ribosome (n=17,p=0.0042)

78

145

Protein biosynthesis (n=52,p=0.0024)

Protein transporter activity (n=9,p=0.0021)

Cytosolic ribosome (n=12,p=0.0032)

Table

Conclusions

This paper has provided a novel dynamic multi-objective immune optimization biclustering framework for mining biclusters from microarray datasets. We focus on finding maximum biclusters with lower mean squared residue and higher row variance. Those three objectives are incorporated into the framework with three fitness functions. We apply immune clonal selection principle and Sigma method to find better local guide in objective space and combine ε-dominance and crowding distance strategy to improve the diversity of the solutions and to quicken convergence of the algorithm; a population adding method that dynamically grows new individuals with enhanced exploration and exploitation capabilities; a population decreasing strategy to balance and control the dynamic population size. The results on the yeast microarray dataset and the human B-cells expression dataset verify the good quality of the found biclusters, and comparative analysis show that the proposed MOIB is superior to MOIB algorithm in terms of the diversity of solutions and the convergence of the algorithm.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JL design DMOBIO to mine biclusters from gene expression data and drafted the manuscript. ZL, XH and EP were involved in study design and coordination and revised the manuscript. YC conducted the algorithm design.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (60973105), the fund of the State Key Laboratory of Software Development Environment, Scientific Research Fund of Hunan Provincial Education Department (09A105), Talents Import Fund of Central South University of Forestry and Technology (104-0177), NSF CCF 0905291 and NSF CCF 1049864

This article has been published as part of