College of Management Science, Chengdu University of Technology, Chengdu 610059, China

Group of Gene Computation, College of Mathematics and Software Science, Sichuan Normal University, Chengdu 610066, China

Department of Computer Science, Sam Houston State University, Huntsville, TX 7734, USA

Department of Epidemiology and Biostatistics, School of Public Health, Indiana University Bloomington, 1025 E. 7th Street, Bloomington, IN 47405-7109, USA

Harvard NeuroDiscovery Center and Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Charlestown, MA 02129, USA

Neurochemistry Laboratory, Department of Psychiatry, Massachusetts General Hospital and Harvard Medical School, Charlestown, MA 02129, USA

Cancer Bioinformatics, Rush University Cancer Center, and Department of Internal Medicine, Rush University Medical Center, Chicago, IL 60612, USA

Abstract

Background

Computational genomics of Alzheimer disease (AD), the most common form of senile dementia, is a nascent field in AD research. The field includes AD gene clustering by computing gene order which generates higher quality gene clustering patterns than most other clustering methods. However, there are few available gene order computing methods such as Genetic Algorithm (GA) and Ant Colony Optimization (ACO). Further, their performance in gene order computation using AD microarray data is not known. We thus set forth to evaluate the performances of current gene order computing methods with different distance formulas, and to identify additional features associated with gene order computation.

Methods

Using different distance formulas- Pearson distance and Euclidean distance, the squared Euclidean distance, and other conditions, gene orders were calculated by ACO and GA (including standard GA and improved GA) methods, respectively. The qualities of the gene orders were compared, and new features from the calculated gene orders were identified.

Results

Compared to the GA methods tested in this study, ACO fits the AD microarray data the best when calculating gene order. In addition, the following features were revealed: different distance formulas generated a different quality of gene order, and the commonly used Pearson distance was not the best distance formula when used with both GA and ACO methods for AD microarray data.

Conclusion

Compared with Pearson distance and Euclidean distance, the squared Euclidean distance generated the best quality gene order computed by GA and ACO methods.

Background

A brief introduction of Alzheimer's disease

Being the most common form of age-related dementia, Alzheimer's disease (AD) affects 5.4 million Americans, and at least $183 billion will be spent in 2011 on care of AD and other dementia patients. The problem is worsening as life expectancy continues to increase. By 2050, the projected number of AD patients could range from 11 to 16 million people in the United States alone if no cure or preventive measure for AD is found. Hence, AD has quickly become a pandemic and exacted a huge socioeconomic toll

AD is named after Dr Alois Alzheimer, who has first investigated the disease

Frangione et al reported on the sequencing of the exons 16 and 17 of amyloid precursor protein (APP) to reveal the first pathogenic mutation in APP

Currently, the mainly proposed therapeutic intervention for AD is anti-amyloid approach, which ranges from interdicting amyloidogenic processing of the β-amyloid precursor protein (APP) to removing amyloid plaques in the brain

Introduction of gene clustering and gene order

Having been applied to many biological domains, such as drug discovery, molecular diagnosis, and toxicological research, DNA microarray technology is used most importantly to generate gene data, which holds a lot of biological information. One common data structure of a microarray data set is the presentation of a matrix. In matrix _{ij }

One important aspect of biology is to make similar genes cluster together. Since line vectors of a matrix contain the information of genes, clustering similar vectors together is equivalent to cluster similar genes together. A number of algorithms were proposed to cluster gene expression profiles. Eisen

To achieve a much better quality of clustering, the computing concept of gene order has been proposed. Gene order is the permutation of all line vectors in such a way that all the line vectors are ordered one by one in a sequence, and that similar vectors are ordered together. A gene is associated with a line vector of a matrix. The optimal gene order refers to the permutation that results in a sequence that all the vectors line up via the minimal distance. Alternatively, computing optimal gene order is equivalent to identifying a route of the traveling salesman problem (TSP) in which every vector associates with a gene that has been abstracted as a virtual city

Since TSP is an NP-hard problem, the computation of the optimal gene order is NP-hard and only the approximation of the optimal gene order can be calculated. To obtain the approximation of the optimal gene order, Tsai

Introduction of ant colony optimization (ACO)

First introduced in 1992, ant colony optimization (ACO) is a novel nature-inspired method based on the foraging behavior of real ants to solve TSP. (Dorigo, 1992; Dorigo

Introduction of genetic algorithm

Genetic algorithm (GA) can be understood as an intelligent probabilistic search algorithm that works on Darwin's principle of natural selection and that can be applied to a variety of combinatorial optimization problems

To understand the outline of GA as in

A GA simulates these processes by taking an initial population of individuals and applying a genetic algorithm to their reproduction. In optimization terms, each individual in the population is encoded into a string or chromosome that represents a possible solution to a given problem. The fitness of an individual is evaluated with respect to a given objective function. Highly fit individuals or solutions have opportunities to reproduce by exchanging pieces of their genetic information, in a crossover procedure, with other highly fit individuals. This produces new "offspring" solutions (i.e., children), who share some characteristics taken from both parents

To date, there are few types of tools to calculate gene order. In our knowledge, GA

Methods

This study intends to answer the question of which algorithm, between ACO and GA, generates the optimal AD gene order. The distance formula, which measures the similarity degree of two genes, is the key parameter that affects the quality of gene order. With different distance formulas (see the following Formulae 1-3), the gene orders will be calculated using the tools of ACO and GA in this section. Then, the quality of gene order will be measured both by the fitness function and by a heat map.

Traveling salesman problem (TSP)

TSP is introduced below:

Assume that there are _{ij}_{ij }

Measurement of gene similarity

As aforementioned, a gene associates with a vector and the similarity of two genes can be estimated by the distance between the two vectors.

For two genes, different metric measurements will measure out different degrees of possible similarity. That is, the estimation of gene similarity is sensitive to the distance formula.

Many distance formulas of vectors to measure the similarity of genes are presented, such as Pearson correlation, absolute correlation, Spearman rank correlation

The first distance measure is the Pearson correlation:

Let k-dimensional vector _{1}, _{2}, ..., _{k}) and _{1}, _{2}, ..., _{k}) be the expression levels of two genes

, where _{X }_{X }

Pearson distance is defined as

The second distance is the Euclidean distance:

The third distance measure is the squared Euclidean distance:

Gene order

As it is introduced before, a gene is associated with a vector that is derived from microarray data. In this way, a gene can be regarded as a virtual city whereby each coordinate is a vector. Two associated genes are more similar as the distance shortens between two virtual cities. As it is introduced at Section 1, an optimal (shortest) TSP route for a given set of virtual cities is the optimal gene order that is a permutation of all genes. In an optimal TSP route, closed cities are ordered together and the length of the route is that which is the shortest. In an optimal gene order, similar genes cluster together, and the quality of clustering is optimal globally. This is in contrast to many clustering methods that are only optimal locally.

Currently optimal gene order cannot be calculated perfectly because it is an NP-hard problem; only an approximation can be achieved. Therefore, we need a function to measure the quality of the approximation. The following function

where _{i }_{i}_{i+1}) is the distance between gene _{i }_{i+1}, and _{i}_{i+1}) can be chosen from Pearson distance, Euclidean distance, squared Euclidean distance, Spearman distance, and other measurements.

Function

However, the measurement of function

Apply ACO to calculate optimal gene order

To generate the optimal gene order, ACO is applied as it is below:

**Step 1**: Use the distance formula to compute the distance between genes.

**Step 2**: Initialize the pheromone trails for all edges between genes (or virtual cities) and put _{max }and let

**Step 3**: _{max})

{

**Step 3.1: **Each ant selects its next city according to the transition probability

The transition probability of the

, where _{k }_{ij}_{ij}_{ij}_{ij}_{ij }

**Step 3.2**: After all ants finish their travels, all pheromone values _{ij}

, where _{k }

**Step 3.3**:

}

**Step 4: **End procedure and select the TSP route that has the minimum length as the output.

Apply GAs to calculate optimal gene order

As mentioned before, the calculation of gene order can be converted to TSP. To make GA fit to process TSP and gene order, the commonly used GA is modified a little. The modifications are listed below:

First, the roulette rule

Second, the crossover probability is set to be 1.0 in this paper. That is, the crossover will occur definitely.

Third, the mutation is designed to occur. Between the parent and mutated offspring, the one which has the better fitness value is selected as the genuine offspring, and the others are discarded.

The modified GA is described below:

**Step 1**: Initialization: Set the maximum iteration number to _{max}. The _{old}

**Step 2**: The next generation is denoted by _{new}

**Step 3**: Selection

1. Calculate each chromosome's fitness value according to formula (4).

2. Calculate the proportion (ratio) of the fitness value of each chromosome.

3. A ratio is chosen by the roulette rule, and its associated chromosome is chosen too. According to this method, two chromosomes are chosen, which are denoted by _{1 }and _{2}.

**Step 4**: Crossover

1. Generate two random integer numbers between 1 and _{point1 }and _{point2}(_{point1 }<_{point2}), and where _{point1 }and _{point2 }are used to indicate the positions of two crossovers on chromosomes _{1 }and _{2}.

2. Denote the part of _{2 }from _{point1 }to _{point2 }as _{t2}, and copy it to the head of _{1}. The increased chromosome _{1 }is denoted by

Denote the part of _{1 }from _{point1 }to _{point2 }as _{t1}, and copy it to the head of _{2}. The increased chromosome _{2 }is denoted by

3. Find every gene that lies in chromosome _{t2 }and _{1}, which is denoted by _{t2 }∩ _{1}). Delete every _{1}, and add _{t2 }to the head of updated _{1 }(i.e., _{1 }and denoted as _{offspring1}. Using the same method, the temporary offspring of _{2 }is generated, which is denoted as _{offspring2}.

**Step 5**: Mutation

Select a point on _{offspring1 }randomly as a mutation point, which is denoted by _{point1}. Suppose the value of mutation point _{point1 }is _{old}_{new}_{new }_{point1}.

Find the point at which value is equal to _{new }_{point1}, and update its value as _{old}

The chromosome _{offspring1 }is updated, and it is a true offspring.

Using the above method, chromosome _{offspring2 }can also be updated, and it is a true offspring.

**Step 6**: Add the two true offspring into the set _{new}

**Step 7**: Joint population _{old }_{new }_{old }_{new}_{old }

**Step 8**: Increase the iteration step: _{max}, and go to step 2, or else go to Step 9.

**Step 9**: End the algorithm and choose the chromosome that has the smallest fitness value from the last population _{old }

Kirk presented an improved GA (IGA) program

Part I (operation of mutation)

Suppose there is a chromosome{_{1}, _{2}, _{3}, _{4}, _{5}, _{6}}, and it is a permutation of genes _{1}, _{2}, _{3}, _{4}, _{5 }and _{6}. Firstly, cut a sub-sequence from the chromosome randomly, and suppose it is {_{2}, _{3}, _{4}, _{5}}. Three types of mutations are listed below:

Flip operation _{f}

Flip the gene positions of the sub-sequence. For example,

Swap operation _{s}

Swap the positions of the two terminal genes-

Slide operation _{l}

Shift the gene to the next position by a rotation-

Part II (group)

Suppose _{1}, _{2}, _{3}, ..., and _{N}

Firstly, select the chromosome with the minimal fitness value as seed, and discard the other three chromosomes.

Secondly, let the mutation operation _{f}_{s }_{l }

Thirdly, all chromosomes in this team are updated as the seed and the three mutated chromosomes, which updates table

Part III (iteration computation)

An operation of a group is called an iteration computation. Within every iteration, an optimal chromosome will be generated for which the fitness value is minimal compared to the other _{t }_{max}. The solution is selected from

Source data

In this paper, the AD microarray data was downloaded from GEO Datasets, NCBI

The illustration of organization of AD microarray data

**AFFX -NAME**

**GSM**

**21215**

**GSM**

**2127**

**GSM**

**2128**

**GSM**

**21219**

**GSM**

**21220**

**GSM**

**21221**

**GSM**

**21226**

**GSM**

**21231**

**GSM**

**21232**

BioB-5_at

8.937

9.941

8.986

9.305

9.366

8.781

9.236

9.35

9.386

BioB-M_at

9.278

10.56

9.55

10.08

10.23

9.355

9.915

10.27

10.37

BioB-3_at

7.92

9.033

8.71

8.993

9.353

8.381

8.716

9.481

9.299

BioC-5_at

10.18

11.46

10.49

10.76

10.88

10.25

10.52

10.87

10.91

*Each column of the data represents the result of one microarray test. Each line of the data represents the expression levels of the same gene under different conditions. All data was log-transformed.

Seven samples of incipient for each gene are selected to form a 7-dimensional vector, and the resulting 22283 vectors are used to form a data set; eight samples of moderate for each gene are selected to form an 8-dimensional vector and to form a data set; and seven samples of severe for each gene are selected to form a data set.

In addition, according to the usual practice, all data of the AD gene is log-transformed for smoothing.

Computing parameters and environment

All data tested by GAs and ACO run on a personal computer, CPU (2): 2.99 GHZ, 3.0 GHZ; Memory: 1.0 GB.

The parameters of ACO are set below:

_{ij}_{max }= 100.

The parameters of GA are set below:

_{max }= 500,

, where _{max }and

The parameters for the improved genetic algorithm are set as below:

_{max }= 2000,

In addition, in GA, parameter values of _{max }and

Results and discussion

The results are showed in Figure

The comparison of the quality of gene order generated by ACO and GA using Euclidean distance

**The comparison of the quality of gene order generated by ACO and GA using Euclidean distance**. *Ancillary information for figures:**1. All microarray data are downloaded from 48, and the data from the 1 ^{st }line to 300^{th }line are used to do experiment and for other figures and tables. 2. Every heat map is the optimal gene order, which has the smallest value of fitness function and was selected from tests performed over 40 times. In addition, the distance formula used in the fitness function (see formula 4) is the Euclidean Distance. 3. All of the figures listed in this paper are generated by TreeView, which was developed by Dr Eison, and is downloaded from the website: **

The comparison of the quality of gene order generated by ACO and GA using squared Euclidean distance formula

**The comparison of the quality of gene order generated by ACO and GA using squared Euclidean distance formula**.

The comparison of the quality of gene order generated by ACO and GA using Pearson distance formula

**The comparison of the quality of gene order generated by ACO and GA using Pearson distance formula**.

The statistical comparison of the quality of gene order

**Algorithm**

**Distance**

**Control man**

**Incipient patient**

**Moderate patient**

**Severe patient**

ACO

ED

507.9163

442.7255

459.7381

504.0716

GA

ED

1800.9287

1582.5394

1689.2580

1604.3304

IGA

ED

566.0912

508.6311

516.0917

579.3226

ACO

SED

484.8221

419.8804

437.9346

479.5701

GA

SED

1916.9891

1679.9281

1789.6030

1682.0008

IGA

SED

576.9810

521.2992

529.8852

593.4252

ACO

PD

2737.5938

2233.1848

2518.7568

2167.4011

GA

PD

2882.9409

2532.2205

2708.5082

2515.8520

IGA

PD

2712.5501

2319.1112

2513.9173

2218.1910

**Notation: **ED: Euclidean Distance; PD: Pearson Distance; SED: Squared Euclidean Distance

**Ancillary information: **all data in this table is the value of the fitness function, and it is the average of 40 times of tests. In addition, the distance formula used to calculate fitness value is ED. Every data in Table 5 corresponds to an average runtime.

The statistical comparison of the runtime of ACO, GA and IGA

**Algorithm**

**Distance**

**Control man**

**Incipient patient**

**Moderate patient**

**Severe patient**

ACO

ED

122.0545

121.8582

121.8611

121.8653

GA

ED

580.8345

586.7355

588.9012

586.7427

IGA

ED

133.0079

131.2152

140.4218

139.1710

ACO

SED

109.8382

110.0110

109.7321

110.2532

GA

SED

186.4143

184.5551

185.1629

185.7899

IGA

SED

126.8957

126.9276

126.9757

127.0232

ACO

PD

123.0438

122.8454

122.6719

122.6450

GA

PD

186.9550

187.5644

187.0732

188.4089

IGA

PD

129.8745

127.7448

127.0051

126.4476

**Notation: **ED: Euclidean Distance; PD: Pearson Distance; SED: Squared Euclidean Distance

**Ancillary Information: **Every runtime in this table is the average of 40 times of tests. In addition, every runtime corresponds to a fitness value i listed at Figure 3.

(1) ACO was better suited than GA to calculate the gene order of the AD genes tested in this paper.

(2) Both for ACO and GAs, the use of different distance formulas generated a different quality of gene order. The squared Euclidean distance generated the best quality overall compared with the Pearson distance and Euclidean distance.

Pearson distance is a popular distance formula that is commonly used to calculate gene order. However, we found that Pearson distance is not the optimal distance formula for the calculation of gene order associated with AD genes. In this paper, the original data is not normalized, the reason for which is explained below:

Suppose two genes and their associated vectors are _{1}, _{2}, ..., _{k}) and _{1}, _{2}, ..., _{k}). If all components of the vector are normalized, they become small real value that is less than 1.0. Value

Conclusion

With AD being the most common form of senile dementia, the study of AD-associated genes is an imperative research subject. One important branch of an AD gene study is to cluster AD genes with the highest quality; gene order generates a better quality of clustering than other methods in general. In addition, our results of the experiment support the following conclusion: ACO is better than GA in AD gene order computation. Further, the following computational features were revealed in our study: For both ACO and GA, different distance formulas generated a different quality of gene order. Compared to Pearson distance and Euclidean distance, the squared Euclidean distance generated the best quality of AD gene order. Although Pearson distance commonly used tool, it is less optimal in AD gene order computation when employed in both ACO and GA methods.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

BH formulated the computing framework in this paper with CP together. GJ performed computing and drafted this manuscript with CP together. SW wrote part of background section for this manuscript. QL, ZC, CRV, JTR, and YD assisted the study and provided some suggestions. XH initiated the project, provided the guidance for the study, and performed the final editing for the manuscript. All authors have read and approved the final manuscript.

Acknowledgements

The work was supported by the BWH Radiology and MGH Psychiatry research funds (to X. Huang) and the Technology Innovation fund (No. 09zz028) of Key Developing Program from Education Department of Sichuan Province, China. The authors appreciate the help from the other members of gene computation group: W. Hu, C.-B. Wang, X. Li, H. Liu, L.-J. Ye, J.-L. Zhou, P. Shuai, and S.-Q. Liu. The authors appreciate the help from Prof. J. Zhang and Prof. J. Zhou. The authors would also like to thank Ms. Kimberly Larson of BWH Radiology and Mr. Conan Huang of Brown University and MGH Psychiatry for editing the manuscript.

This article has been published as part of