Cognitive Science Department & Fujian Key Laboratory of the Brain-like Intelligent Systems, Xiamen University, Xiamen, China

Shenzhen Key Lab for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China

Department of Computer Sciences, University of Sherbrooke, Sherbrooke, QC, Canada

Abstract

Background

Clustering DNA sequences into functional groups is an important problem in bioinformatics. We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. This method transforms DNA sequences into the feature vectors which contain the occurrence, location and order relation of

Results

The proposed distance measure and clustering method are evaluated by clustering functionally related genes and by phylogenetic analysis. This method is also compared with BlastClust, CD-HIT-EST and some others. The experimental results show our method is effective in classifying DNA sequences with similar biological characteristics and in discovering the underlying relationship among the sequences.

Conclusions

We introduced a novel clustering algorithm which is based on a new sequence similarity measure. It is effective in classifying DNA sequences with similar biological characteristics and in discovering the relationship among the sequences.

Background

With the development of advanced biotechnology, more and more biological sequence information has been generated. The amount of genetic data is growing faster than the rate at which it can be analyzed. Clustering techniques provide a viable solution for handling and analyzing such rapidly growing genetic data. Clustering algorithms partition sequences into different biologically meaningful groups, facilitating therefore the prediction of functions of genes

Clustering of gene sequences requires calculation of similarity between sequences. There are two clustering approaches according to the similarity measure used in a clustering method. One is based on sequence alignment. The similarity between two gene sequences is measured by the scores obtained from an alignment algorithm such as BLAST

The other approach for similarity measure is to use alignment-free methods

Major algorithms used in gene sequence clustering can be divided into two categories according to the result format: hierarchical clustering algorithms and partitional clustering algorithms

Partitioning algorithms have also been used. Partitional clustering obtains a partition of data objects by optimizing some clustering criterion. Partitional clustering algorithms are simple and well-suited for clustering large datasets

Hierarchical clustering produces a nested series of partitions, where the results are usually depicted as a dendrogram while partitional clustering produces a flat partition. BlastClust

Recent studies reveal also that BlastClust is less effective for clustering divergent sequences

The approach presented in this paper involves a new alignment-free distance measure based on

Methods

A gene is a stretch of DNA that codes for a single polypeptide chain

The traditional approach for clustering DNA sequences requires all-by-all comparisons from alignment _{1} = AGCACACA and _{2} = ACACAGTA, _{1}^{P} and _{2}^{P} are used to represent the ^{th} characters in _{1} and _{2}, respectively. The alignment score _{1}_{2}) is given by

where

In the follows, we will present DMk first, and then describe mBKM algorithms.

A new similarity measure: DMk

In this section, we introduce a new similarity measure which takes into account the occurrence, location and order relation of

Sequences are numerically transformed to feature vectors that can be processed by data mining algorithms. Let Σ be the alphabet set of nucleotides (Σ = {A, C, G, T}). A sequence of length ^{k} possible _{k}_{w}, is counted by moving a sliding window of length

To explore the correlation properties of DNA, Nair et al. _{r} is the location of the ^{th} occurrence of _{0} = 0. And _{r} is given as,

in which _{r} reflects the density of _{1} position, and {_{1,}_{2}_{m}} for repetition of ^{th} element indicates the relative position of two neighboring

To characterize the order of _{r}, we define _{j} as a partial sum of {_{r}}. _{j} is calculated by the following formula:

{_{r}} is a list of non-negative real numbers, and _{j} is totally ordered by ≤, so _{1}, _{2}, …, _{m} is also an ordered set. {_{1}, _{2},…, _{m}} and {_{1}, _{2},…, _{m}} determine each other uniquely. _{j} is only dependent of the number and positions of _{1}, _{2},…, _{m}}, one can obtain where

_{1}_{2},…, _{m}} to calculate the probabilities, the Shannon entropy reflects the degree of importance of position in a sequence. We construct a discrete probability distribution

For each

For a fixed ^{k} distinct ^{k}-dimension feature vector are denoted by _{i} means the feature representation of the

Cluster analysis algorithms partition objects into groups based on the distances between objects. Euclidean distance is the square root of the summation of the squares of the differences between all pairs of corresponding objects. The

where ^{th}

Algorithm Name: DMk for similarity measure

Input: sequences {_{1}, _{2},…, _{N}}.

Output: similarity matrix, (_{N*N}.

Steps:

1. For each sequence, search and locate each

1.1 For each

1.2 For each

1.3 For each

2. For each sequence, construct 4^{k} -component vector by

3. For any two sequences, use Equation (4) to calculate the distance between the two sequences.

4. Return {

A new clustering algorithm: mBKM

KM can be used to obtain a hierarchical clustering solution using a repeated bisecting approach

BKM has a linear time complexity in each bisecting step. Recent study

BKM initially regards the whole data set as a cluster, and splits one cluster into two subclusters at each bisecting step using KM until singleton clusters are obtained at the leafs or until

1) Choosing the cluster with largest size;

2) Selecting the cluster with the overall similarity

The overall similarity is either minimized or maximize, depending on the definition of ^{’}).

3) Using a criterion based on both size and overall similarity.

Because the differences between these methods are small in terms of the final clustering result, the way of splitting the largest remaining cluster is recommended

There are two problems in BKM algorithm:

1. Randomly choosing the initial centroids in BKM may result in too adjacent elements selected. If the initial centroids are too close, the algorithm will reach a local optimization. Moreover, different sets of initial cluster centroids can lead to different final clustering results.

2. The algorithm for choosing one existing cluster to split in each bisecting step usually selects the cluster with the largest size. Although this leads to reasonably good and balanced clustering solution, it cannot gracefully work for datasets where the natural clusters are of different sizes, as it will tend to partition larger clusters first. In real biological data, the number of elements in every cluster may not always be similar.

To address the above two problems and obtain more natural hierarchical solutions, we develop a modified bisecting K-means, mBKM, which choose the initial centroids by the maximum and minimum principle and select the cluster to split based on the compactness of clusters.

1) Selecting Initial Cluster Centroids

In order to achieve stable and reliable clustering results, we use the maximum distance, which can avoid obtaining adjacent elements, to select the initial centroids. For a set of sequences, {_{1}, _{2}, …, _{N}}, let _{i}, _{j})(

2) Selecting the Cluster to Split

BKM algorithm usually partitions the largest size cluster into two smaller ones and yields clusters with similar size. However, a cluster with large number is not always the loose one. If one existing cluster is a loose one, in which its members are not closely related to each other, the cluster will be selected to be split.

Variance is a measure of how far a set of numbers are spread out from each other, and it can measure the compactness of the clusters. So we select the cluster to split on the basis of the compactness of clusters measured by variance. The variance of cluster _{j} is defined as following:

where _{j} is the centroid of sequences in _{j}, _{i}, _{j}) is the distance between _{i} and _{j}, and _{j} is the number of sequences in the cluster.

A small variance of a cluster indicates that the members in the cluster tend to be closely related to the mean. In other words, the smaller the variance is, the more compact the cluster is, and vice versa.

Based on the above idea, we outline mBKM algorithm as follows.

Algorithm Name: mBKM for clustering sequences

Input: sequences {_{1}, _{2}, …, _{N}}, a distance function

Output: Set of

Steps:

1. Initialization: Regard the whole dataset {_{1}, _{2}, …, _{N}} as a single cluster.

2. Pick a cluster to split.

3. Find two sub-clusters:

3.1 Select two initial centroids using Equation (6);

3.2 Assign the sequences to the closest centroid;

3.3 Recalculate two centroids based on the sequences assigned to the cluster;

3.4 Repeat steps 3.2 and 3.3 until no change in cluster centroid calculation.

4. Calculate the variance of each cluster according Equation (7) and take the split that produces the clustering result with the highest variance.

5. Repeat steps 2, 3 and 4 until the desired number

This algorithm outputs a binary tree of sequences, where each leaf represents a sequences and each node represents a sequence collection.

Results and discussion

The proposed method is evaluated by clustering functionally related gene sequences and by phylogenetic analysis. We present our evaluation results in two parts. The first one aims at testing the efficiency of our similarity measure, DMk. The second one is to illustrate the efficiency of the proposed clustering method, mBKM.

To measure the quality of the clustering results, our experiments adopt F-measure

where _{ij}/_{j}_{ij}/_{i}_{ij} is the number of the sequences of class _{i} is the number of the sequences of class _{j} is the number of the sequences of cluster

The F-measure of the whole clustering result is defined as:

where

Evaluation of similarity measure

To evaluate the proposed similarity measure, we test DMk on gene sequence data sets and compare it with the

Gene sequences clustering

Genes of the same family usually share similar sequences, functional domains, and even interacting partners. When a new gene is assigned to a cluster, the biological function of this cluster can be attributed to this gene with high confidence.

Four data sets are extracted from different gene repositories as shown in Table

**Data**

**Name**

**Number**

**Average length (bp)**

**Description**

DS1

beta-globin

176

1531

Cytochrome P450

beta-Hemoglobin

89

448

Hemoglobin subunit

integrin_alpha

142

3360

Integrin, alpha

ketoacyl-synt1

43

754

Estradiol 17-beta-dehydrogenase 8

myoglobin

55

478

Cytoglobin Myoglobin

RWD

93

825

RWD domain-containing protein

VCL

92

2746

Vinculin

Histone

81

668

Histone

DS2

HBG106679

22

446

Copper uptake protein 2

HBG108349

49

718

Prolactin

HBG079775

26

3152

Transcription elongation factor SPT5

HBG058842

34

1351

TNFR superfamily member 1A

HBG002834

92

951

Calumenin/Reticulocalbin

HBG050441

58

1899

ATP-binding cassette sub-family G member

DS3

HBG093787

32

1769

Hypothetical membrane proteins

HBG099893

34

430

Putative membrane protein precursor

HBG415481

65

557

Phasin like/family protein

HBG423057

32

236

Hypothetical proteins

HBG050644

99

3129

Beta galactosidase, beta glucuronidase, Evolved beta-D-galactosidase alpha subunit

HBG364776

48

1069

Formate dehydrogenase gamma subunit precursor

DS4

HBG000080

29

674

BWK-1,CG6617-PA , Zgc:73100 C20orf11 homolog , RH01588p

HBG060165

28

163

ATP synthase, H + transporting mitochondrial F1 complex/epsilon subunit

HBG010471

48

1802

Hypothetical Glycosyl transferase, family 25/Endoplasmic reticulum targeting sequence containing protein

HBG000013

70

318

60 S ribosomal protein L36a-like, 60 S ribosomal protein L42, L44, IP15820p, RPL

HBG000026

18

3157

Eukaryotic translation initiation factor 2-alpha kinase 3 precursor, Eukaryotic translation initiati

HBG065748

48

1238

AT20832p,AT27361p, CG10513-PA, CG10514-PA, CG10550-PA, isoform A, CG10553-PA,CG10559-PA,CG10560-P

Four widely used clustering algorithms, including KM, single-linkage clustering (SL), complete-linkage clustering (CL) and average-linkage clustering (AL), have been chosen in the experiments. For comparison, we perform the clustering tests on all data sets using the

**Supplementary Data.**

Click here for file

KM algorithm would yield different results during multiple executions due to its stochastic feature for initialization. We examine KM in ten runs and report the average performance. The AL, CL and SL hierarchical algorithms generate one solution for each of them. We obtain the result of hierarchical clustering algorithms by analyzing the hierarchical tree using the expected number of cluster as input parameters.

According to Table

**Method**

**DS1**

**DS2**

**DS3**

**DS4**

KM with

0.5738

0.7828

0.5543

0.6532

SL with

0.3544

0.4148

0.3307

0.3244

CL with

0.5153

0.7253

0.5588

0.516

AL with

0.5113

0.6956

0.5578

0.3185

BKM with

0.5725

0.7876

0.5498

0.6551

mBKM with

0.5882

0.7913

0.5691

0.6722

KM with DMk

0.7

0.8261

0.7716

0.8284

SL with DMk

0.601

0.7948

0.8188

0.6535

CL with DMk

0.7172

0.9295

0.6868

0.7468

AL with DMk

0.7898

0.9365

0.6963

0.8498

BKM with DMk

0.7346

0.8511

0.8044

0.8813

mBKM with DMk

0.808

0.9645

0.9143

0.9587

Phylogenetic analysis

In this experiment, the proposed similarity measure DMk is further tested by phylogenetic analysis. In order to evaluate the similarity measures, we use UPGMA in the PHYLIP package, a widely used clustering algorithm in phylogenetic analysis. The tree is drawn by TREEVIEW program

The selected data set includes the full β-globin gene sequences of 10 species reported by Feng et al.

**Species**

**Human**

**Goat**

**Opossum**

**Gallus**

**Lemur**

**Mouse**

**Rat**

**Gorilla**

**Bovine**

**Chimpanzee**

Human

0

22.95

37.65

111.47

14.02

35.21

20.68

3.42

25.07

3.54

Goat

0

41.22

65.70

18.80

35.05

33.93

32.36

6.04

33.05

Opossum

0

42.54

33.29

64.03

51.64

46.35

40.41

49.73

Gallus

0

90.93

80.07

95.26

121.09

61.69

122.65

Lemur

0

21.39

18.50

17.19

18.12

18.74

Mouse

0

16.04

33.64

27.60

37.59

Rat

0

17.69

30.53

20.58

Gorilla

0

33.66

0.80

Bovine

0

35.46

Chimpanzee

0

In Table

The quality of the constructed tree shows the quality of the distance matrix and the method of abstracting information from DNA sequences. In Figure

The phylogenetic trees for 10 species using the full DNA sequences of β-globin

**The phylogenetic trees for 10 species using the full DNA sequences of β-globin.**

The tree in Figure

DMk measures the similarity between DNA sequences more effective than the

Evaluation of clustering methods

To evaluate the effectiveness of the proposed clustering algorithm, mBKM, we apply mBKM in clustering gene sequences and compare it with several clustering algorithms. Moreover, we use our method, mBKM with similarity measure DMk, in phylogenetic analysis to show how well the genes are grouped together and how well the resulting trees agree with existing phylogenies.

Performance comparison of clustering methods

In order to illustrate the efficiency of mBKM in gene sequence clustering, we ran mBKM with the

The clustering performance of different clustering methods is the result of a combination of factors, including the types of sequence distances used for clustering and the choice of clustering algorithms. Table

From Table

Because the clustering methods listed in Table

The distribution of F-measure as a function of the number of clusters based on the

**The distribution of F-measure as a function of the number of clusters based on the **
**
k
**

The distribution of F-measure as a function of the number of clusters based on DMk (The real numbers of DS1, DS2, DS3 and DS4 are 8, 6, 6, and 6, respectively)

**The distribution of F-measure as a function of the number of clusters based on DMk (The real numbers of DS1, DS2, DS3 and DS4 are 8, 6, 6, and 6, respectively).**

Figure

Figure

From Figures

With regard to clustering algorithms, SL performs poorly in many cases, and this may be because that SL uses the nearest pair of sequences and may lead to bad splits of one cluster if two or more clusters show different pattern densities. For KM and BKM, the results of many runs are lower than those of mBKM. On the whole, mBKM achieves better results than other clustering algorithms, and mBKM combining with DMk achieves best results among these clustering methods in our experiments.

The task of sequence clustering is to group given sequences into clusters. The similarity measure, DMk, measures the similarity between DNA sequences based solely on the

In order to further illustrate the efficiency of our method, combining mBKM and DMk, we compare mBKM with DMk to two other clustering programs: BlastClust

We perform tests using BlastClust and CD-HIT-EST on the data sets listed in Table

**mBKM with DMk**

**BlastClust**

**CD-HIT-EST**

(Time contains the time of similarity measuring and clustering)

Data

F-measure

Time(s)

F-measure

Time(s)

F-measure

Time(s)

DS1

0.8080

6.875

0.4525

48

0.2713

39.8

DS2

0.9645

1.844

0.7515

13.6

0.5924

6.4

DS3

0.9143

2.375

0.3693

12.7

0.3157

17.1

DS4

0.9587

1.328

0.5224

9.3

0.4007

6.8

Table

For the cases that the real number of clusters is unknown, the performance of our algorithm will be affected. In order to compare with BlastClust and CD-HIT-EST on a relatively fair ground, we can vary the number of clusters and take the average of the F-measure values over the different numbers of clusters. For instance, we run mBKM with DMk with the range of 3–20 numbers and the average values of F-measure are 0.7065, 0.8533, 0.8205 and 0.8429 for DS1, DS2, DS3 and DS4, respectively. As shown in Additional file

Phylogenetic analysis

In this experiment, we used mBKM with DMk to construct phylogenetic trees.

1) The clustering result of 10 species

We apply mBKM with DMk to the 10 DNA sequences of β-globin gene in Table

The phylogenetic trees for 10 species using the full DNA sequences of β-globin

**The phylogenetic trees for 10 species using the full DNA sequences of β-globin.**

In Figure

2) The Clustering result of 60 H1N1 viruses

H1N1 is subtype of the influenza A virus which can cause illness in humans and many other animal species. Analysis of H1N1 is critical for preparing a strategy to prevent and to control influenza epidemics and pandemics. The H1N1 avian influenza is characterized by its continuous antigen variation, which is mainly caused by the HA and NA proteins in which HA protein has highest rate of mutation. HA protein plays a critical role in identifying and adsorbing the host cell receptor in the infection process, and it is the decisive factor of host specific. We use our method to verify the phylogenetic relationships of H1N1, and the result is included in Additional file 1. The clustering result using mBKM with DMk is shown in Figure

The phylogenetic trees for 60 H1N1 viruses

**The phylogenetic trees for 60 H1N1 viruses.**

As is seen from Figure

Our method analyzed the 60 H1N1 viruses within 1 second, while UPGMA with CLUSTALW and MUSCLE of the same data set took 460 and 60.1 seconds to build the tree, and ML with CLUSTALW and MUSCLE took 571 and 188.1 seconds to build the tree, respectively.

Our method, mBKM with DMk, performs well when clustering 10 species and 60 H1N1 viruses. It obtains similar results to the alignment-based method. Furthermore, our method is much faster than the alignment-based methods.

In order to compare the speed of our method with the multiple sequence alignment based methods, CLUSTALW and MUSCLE, we performed the test on two sets of sequences. The first set consists of six datasets. All the six datasets include 100 sequences. The lengths of all sequences in the six datasets are around 1000, 2000, 3000, 4000, 5000 and 6000 respectively. Another set also consists of six datasets. The number of sequences in each dataset is 20, 40, 60, 80, 100, 120 respectively; the lengths of all the sequences are around 3000. Because ML method is slower than UPGMA, we use UPGMA to build the phylogenetic tree of the results from CLUSTALW and MUSCLE and record the time used for each method. The results in Figure

The time comparison of three methods

**The time comparison of three methods.**

Scalability test

For DMk, the time complexity of transforming the gene sequence _{1}⋯_{l} to a vector is ^{K}), thus the time complexity of generating the vectors for the whole sequence database is

The time consumed for mBKM calculation is primarily determined by choosing the initial cluster centroids. For ^{2}). The time complexity of clustering step in mBKM is

Figure ^{2}). The scalability with respect to the length of sequences was tested on five datasets with five different sequence lengths: 10000, 20000, 30000, 40000, 50000 and each set consists of 4 clusters and 100 sequences. The sensitivity with respect to the length of the sequence is illustrated in Figure

The relationship between the runtime and different numbers of sequences and length of sequences

**The relationship between the runtime and different numbers of sequences and length of sequences.**

Conclusions

In this paper, we presented a novel approach for DNA sequence clustering, mBKM, based on a new sequence similarity measure, DMk, which is extracted from DNA sequences based on the position and composition of oligonucleotide pattern. The experimental results show the method of combining mBKM with DMk is effective in classifying DNA sequences with similar biological characteristics and in discovering the underlying relationship among the sequences. In addition, DMk can achieve comparable or better accuracy than the frequency-based distance measure. Our proposed method can be applied to study gene families and it can also help with the prediction of novel genes. Furthermore, mBKM with DMk can generate cluster trees that are useful to understand the processes governing the gene evolution. In addition, our method may be extended for protein sequence analysis and metagenomics of identifying source organisms of metagenmic data. Our method has limitations too. For example, the method did not consider edge length, and has not address problems with long repeated sequences or long insertions. In future we will try to address these problems.

Competing interests

The authors declare that there are no competing interests.

Authors’ contributions

DW designed the algorithm, conducted the experiments, and wrote the manuscript. QJ supervised the project and proposed data mining algorithm. YW guided the experiments, wrote the manuscript and analyzed the results. SW guided the experiment analysis, and proposed ideas for sequence clustering algorithm. All authors read and approved the final manuscript.

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant No.61175123 and No.10771176, and the Shenzhen New Industry Development Fund under grant No.CXB201005250021A.