Department of General Biology, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Av. Antônio Carlos, 6627, MG, 31.270-901, Brazil

Computer Science Departament, Instituto de Ciências Exatas, Universidade Federal de Minas Gerais, Belo Horizonte, Av. Antonio Carlos, 6627, 31.270-901, MG, Brazil

Max Planck Institute for Informatics, Campus E2 1, Saarbrücken, Germany

CEBio and Laboratory of Cellular and Molecular Parasitology, Instituto René Rachou, Oswaldo Cruz Foundation, Belo Horizonte, Av. Augusto de Lima 1715, 30190-002, MG, Brazil

Genome and Proteome Network of the State of Pará, Universidade Federal do Pará, Belém, R. Augusto Corrêa, 66.075-110, PA, Brazil

Abstract

Background

Singular value decomposition (SVD) is a powerful technique for information retrieval; it helps uncover relationships between elements that are not

Results

We found that SVD applied to amino acid sequences demonstrates relationships and provides a basis for producing clusters and cladograms, demonstrating evolutionary relatedness of species that correlates well with Linnaean taxonomy. The choice of a reasonable number of singular values is crucial for SVD-based studies. We found that fewer singular values are needed to produce biologically significant clusters when SVD is employed. Subsequently, we developed a method to determine the lowest number of singular values and fewest clusters needed to guarantee biological significance; this system was developed and validated by comparison with Linnaean taxonomic classification.

Conclusions

By using SVD, we can reduce uncertainty concerning the appropriate rank value necessary to perform accurate information retrieval analyses. In tests, clusters that we developed with SVD perfectly matched what was expected based on Linnaean taxonomy.

Background

We developed a methodology, based on singular value decomposition (SVD), for improved inference of evolutionary relationships between amino acid sequences of different species

The reason we chose this methodology is the proven capacity that SVD has to establish non-obvious, relevant relationships among clustered elements

The rationale behind SVD is that a matrix A can be represented by a set of derived matrices

where U is an orthonormal m x m matrix, and Σ is an m x n matrix, known as the diagonal matrix, with real and non negative numbers. The matrix ^{T}_{k}

Thus, data approximation depends on how many singular values are used _{k}_{k}_{k}_{k}

The justification for using only _{k}_{k}_{k}_{k}_{k}_{k}

The main data set that we used was obtained from a previous study involving SVD

Dataset2 schema.

**Dataset2 schema.** Construction scheme for a set of species that were used as a negative control for the partitioning techniques.

The quality of the clusters that were generated was measured by the number of Linnaean taxonomy levels each species within the cluster bore in common with the other species; this was calculated as a function of an increasing rank value. When certain rank values are reached, larger values do not improve cluster quality, because there is no increase in taxonomic levels that the species have in common; in some cases a decrease is observed. The cluster quality obtained from a certain rank value maintains the number of shared common Linnaean taxonomy levels constant. This is evidence that there is an intrinsic relationship between these species that is mirrored in the distance matrix derived from these clusters; this quality helps build relevant cladograms.

Results and discussion

Singular value decomposition and number of clusters matters

In this study we give support to the hypothesis that choice of an appropriate data representation and a fixed number of clusters, combined with a good algorithm for categorizing this data, is sufficient for the production of biologically significant clusters. An A matrix has rank n, where n indicates the number of distinct species. The rank value (k) defines the degree of resolution of matrix _{k}^{3}), also called N-gram with N=3, with 60 singular values (the original matrix, since all possible singular values were used) and another matrix derived from the former with only nine singular values; these quantities of singular values and clusters gave good SVD results in final clustering. The first column shows the cluster identification. The columns that follow are in groups of four, showing the results of K-Means and ASAP, using a trigrams frequency matrix created by SVD with 60 or nine singular values. The four columns under the label 'Number of species clusters Joined by' show the number of species obtained in each cluster. The four columns under the label 'Linnaean Taxonomy levels in common by clusters' show the number of Linnaean levels in common for each cluster, and the four columns under the label 'common Linnaean taxonomy frequency levels (cLtlf) by cluster' show the results of the metrics that we suggested. These come from multiplication of the column 'Number of species clusters Joined by' by the column 'Linnaean Taxonomy levels in common by clusters'. There were no significant differences (Student t test) in the quality of clusters generated by the algorithms, based on a comparison of the mean number of Linnaean levels in common and cLtlf, even though they used different singular values, as shown in the Additional file

Using the distance matrix that corrected separated

**Number of species joined by clusters**

**Linnaean Taxonomy levels in common by clusters**

**common Linnaean taxonomy levels frequency (cLtlf) by cluster**

**Cluster**

**K-means with rank 60**

**SNJ with rank 60**

**K-means with rank 09**

**SNJ with rank 09**

**K-means with rank 60**

**SNJ with rank 60**

**K-means with rank 09**

**SNJ with rank 09**

**K-means with rank 60**

**SNJ with rank 60**

**K-means with rank 09**

**SNJ with rank 09**

**1**

10

10

10

10

10

10

10

10

100

100

100

100

**2**

14

27

14

25

10

9

10

9

140

243

140

225

**3**

4

1

4

7

12

13

12

8

48

13

48

56

**4**

7

17

4

7

8

8

10

8

56

136

40

56

**5**

2

2

9

2

13

11

9

12

26

22

81

24

**6**

6

1

4

4

10

13

10

10

60

13

40

40

**7**

5

1

6

4

9

13

10

12

45

13

60

48

**8**

11

1

8

1

9

13

8

13

99

13

64

13

This table displays the results of K-Means and ASAP on a cluster of 60 species obtained in the first ASAP clustering round, when 76 species were separated into clusters.

**Qualitative cluster measures.** In this document, we elaborate on aspects of the qualitative cluster measures that are not discussed in this paper, such as the demand for specific metrics for clusters based on Linnaean taxonomic classification, how sequences size influence kdcSearch, a proof that amino acid trigams do not occur by chance, how to make a graphic cluster approximation by cladograms, how the evaluated algorithms were executed and the kdcSearch algorithm pseudo-code.

Click here for file

Inferring quality from clustering methods

**Algorithm/ software**

**Rank**

**N**

**Min cLtlf**

**Max cLtlf**

**Mean cLtlf**

**cLtlf clusters sum (∑cLtlf)**

**cLtlf standard deviation (σ)**

**Linnaean clusters quality (∑cLtlf/σ)**

**Linnaean clusters quality gain (K09/K60)%**

**cLtlf median**

**Median clusters quality gain (K09/K60)%**

AQBC-javaml

K09

8

32

180

71.25

570

52.27

10.90

49.58%

42.50

26.87%

K60

8

0

220

64.38

515

70.64

7.29

33.50

EM-weka

K09

8

40

120

70.12

561

31.53

17.79

48.99%

57.00

1.79%

K60

8

16

160

70.25

562

47.06

11.94

56.00

Kmeans-weka

K09

8

30

180

69.38

555

46.70

11.88

9.26%

61.50

-2.38%

K60

8

16

180

69.88

559

51.39

10.88

63.00

Kmeans-R

K09

8

40

140

71.62

573

34.48

16.62

9.21%

62.00

6.90%

K60

8

26

140

71.75

574

37.72

15.22

58.00

K-Medoids-R

K09

8

24

160

70.12

561

44.37

12.64

15.92%

60.00

13.21%

K60

8

26

180

68.50

548

50.24

10.91

53.00

MDBC-weka

K09

8

30

180

69.38

555

46.70

11.88

9.26%

61.50

-2.38%

K60

8

16

180

69.88

559

51.39

10.88

63.00

ASAP-in house

K09

8

13

225

70.25

562

67.68

8.30

27.51%

52.00

197.14%

K60

8

13

243

69.12

553

84.92

6.51

17.50

All evaluated partitioning's algorithms showed improved performance considering the Linnaean clusters quality when used the optimized distance matrix created by the better kdc parameters tested.

In the remainder of this paper, we show preliminary findings and methods that helped us reach our final conclusions, including how we arrived at an adequate number of singular values that allowed us to separate a set of species into groups with biological significance. To this end, we found that using arrays of trigram frequencies of amino acids to determine statistical properties was as good as using 4-gram frequencies

Algorithm kdcSearch: parameterizing rank and number of partitions

The objective of the algorithm kdcSearch (Figure

kdcSearch algorithm schema.

**kdcSearch algorithm schema.** Main procedures, datasets and products. Multiple rectangles mean recurring calls.

When one of the recursions of the algorithm kdcSearch finds one or more groups of variables k, d and c that give correct separation of the positive control group, the algorithm recursions are finalized. In this case, there is no reason to continue making recursions, since the desired level of cohesion for the elements of the partitions has reached its limit, measured by the positioning of the positive control. In the case of the data that we analyzed here, this situation occurs after the end of the first recursion by the algorithm kdcSearch, culminating in the plotting of the final graphs and implementing the function 'Finalize'. The code for the function 'Finalize' was left open because at this stage of execution, the algorithm finds various groups of the variables k, d and c (kdc) that promote correct separation of the positive control group in a partition separate from the other species. At this point, the question is which group of values kdc is a good result. What differentiates one group of variables kdc from another is the quality of the partitioning of the other species compared with Linnaean taxonomic classification. We think that it would not be useful to develop an algorithm that one particular kdc group is better than others because they give different levels of separation of species. A researcher can be trying to separate a group of species at the level of 'Classis' with nine Linnaean levels in common (Table

Linnaean taxonomy levels

**Linnaean Taxonomy levels**

**Number**

**Name**

**Value**

14

13

12

11

10

9

8

7

6

5

4

3

2

1

Linnaean taxonomy levels used to classify the species in this paper. The numbers denote an increasing degree of nomenclature specialization.

Function Finalize: sample data

**06clusters k03**

**06clusters k06**

08clusters k06

**08clusters k09**

08clusters k45

12clusters k12

14clusters k18

14clusters k21

14clusters k60

**100**

**100**

100

**100**

100

100

100

100

100

**243**

**243**

200

**225**

243

216

240

250

220

**45**

**64**

56

**56**

13

13

13

13

13

**96**

**100**

13

**56**

136

64

90

88

112

**13**

**13**

100

**24**

22

22

22

22

22

**40**

**40**

45

**40**

13

13

13

13

20

24

**48**

13

16

16

13

13

40

**13**

13

24

24

13

24

40

40

30

13

48

13

13

13

13

13

13

13

13

13

13

13

13

13

13

13

13

13

The statistic cLtlf for all of the partitionings of species obtained with nine kdc values that separate the positive control group in the function Finalize of the algorithm kdsSearch, along with three kdc values as a negative control (-).

Function Finalize: sample statistics

**ASAP/ Clusters**

**Rank**

**N**

**Min cLtlf**

**Max cLtlf**

**Mean cLtlf**

**cLtlf clusters sum (ΣcLtlf)**

**cLtlf standard deviation (σ)**

**Linnaean clusters quality (ΣcLtlf/σ)**

**cLtlf median**

**06clusters**

**K03**

**6**

**13**

**243**

**89.50**

**537**

**82.46**

**6.51**

**70.50**

**06clusters**

**K06**

**6**

**13**

**243**

**93.33**

**560**

**80.81**

**6.93**

**82.00**

08clusters

K06

8

13

200

72.25

578

60.65

9.53

50.50

**08clusters**

**K09**

**8**

**13**

**225**

**70.25**

**562**

**67.68**

**8.30**

**52.00**

08clusters

K45

8

13

243

69.12

553

84.92

6.51

17.50

12clusters

K12

12

13

216

48.50

582

59.10

9.85

23.00

14clusters

K18

14

13

240

44.50

623

63.29

9.84

14.50

14clusters

K21

14

13

250

43.36

607

66.12

9.18

13.00

14clusters

K60

14

13

220

43.00

602

60.68

9.92

13.00

Comparison of the Lcq values and the ΣcLtlf medians for partitionings of the species obtained in Table

From 76 to 60 species and eight clusters

We decided to use a 76 species data set (dataset2), incorporating 12 species that were less related to the original group, in order to develop relationship trees that included clusters with distantly related species. The 64 species data set (dataset1) from the study by Stuart contains closely related species, as all of them share 8 of the 13 Linnaean taxonomy levels used in our study to differentiate species

Exploring the number of species in the

**Exploring the number of species in the Aves cluster.** The number of species grouped into the Aves cluster as a function of rank value and number of clusters. Ordinates are multiplied by the respective maximum Linnaean taxonomy levels shared by species in Figure

Exploring

**Exploring Aves cluster with maximum shared linnaean taxonomy levels.** The number of Linnaean levels shared by all species is plotted against rank value and number of imposed clusters. Ordinates are multiplied by the respective number of species that produced Figure

Determining the best algorithm parameters.

**Determining the best algorithm parameters.** Aves cluster quality as a function of rank value and different numbers of clusters. The number of clustered elements multiplied by maximum common Linnaean taxonomy levels shared between species gives the quality measure.

Analyses were then carried out on only the 60 species from the data set that were joined as a single cluster; the ASAP algorithm was run with 15 clusters and a rank value of 39. When the ASAP algorithm was run with the original 64 species data set, some elements were separated into isolated clusters despite actually sharing several Linnaean taxonomy levels in common with all of the other species.

This could be due to the fact that mitochondrial protein sequences for some species within the data set used in this study were not available. Since our algorithm only uses the frequency of occurrence of amino acid triplets, a lower frequency can affect the quality of the clusters that are generated, as does the presence or absence of a triplet sequence. Presence or absence of amino acid triplets are also responsible for early cluster separation of the 12 additional taxonomically distantly related species, incorporated into the original 64 species data set. Consequently, we worked with this 60 species data subset. To do so, we included a recurrence step prediction in our algorithm in order to develop a species subset. We worked with the concept that a good separation of species in clusters distributes the elements in groups of more than one element, whereas a group with only one element gives no information about species ancestry. When we correctly separated the

Determining the best algorithm parameters at the first algorithm recurrence step.

**Determining the best algorithm parameters at the first algorithm recurrence step.** Aves cluster quality measured with a reduced numbered of species than in dataset2. Now is possible to cluster the Aves species separately and the best algorithm adjustment to this cluster is preferred. Higher curves do not represent better quality.

60 species from the Stuart data set.

**60 species from the Stuart data set.** A 60 species data set unrooted tree generated from a distance matrix created with the ASAP algorithm. The original algorithm from this paper provided the distance matrix. Blue labels denote clusters.

The results of the first execution of our recurrence algorithm based on the 60 species data set can be seen in Table

Eight clusters from 60 data set

**Cluster**

**Number of species joined**

**Linnaean taxonomy levels in common**

**Deepest Linnaean taxonomy level**

1

10

10

2

25

9

3

7

8

4

7

8

5

2

12

6

4

10

7

4

12

8

1

13

Eight clusters created from the first recurrence algorithm execution calibrated with a rank value of nine. Species were grouped according to their deepest evolutionary relatedness based on Linnaean taxonomy levels. Clusters 2, 5 and 8 belong to the mammalian class.

Conclusions

Clusters and cladistic trees drawn from distance matrices, which were generated with SVD, showed a good correlation with Linnaean taxonomy. Considering the best estimate, when a difference is found, this does not necessarily mean strong divergence from taxonomic methods, but perhaps a more accurate picture of the relationship between the species that clustered together. This was demonstrated by clusters that were separated from mammalian clusters due to their greater protein sequence relatedness. It also was reinforced by Linnaean taxonomy information.

The similarity between clusters generated by our distance matrix and Linnaean taxonomy is indicative that distance matrices generated by SVD can demonstrate evolutionary relationships of species and construct better quality clusters and phylogenetic trees. These clusters and phylogenetic trees would benefit from amino acid trigrams and the Euclidean distance property of displaying a distance proportional to the number of necessary edits needed to perform a global alignment sequence within a polynomial execution time.

Methods

Datasets

The set of species used in this work is not original

Positive control group and statistics

In order to show how rank values and the number of imposed clusters affect SVD, we ran ASAP algorithm with different rank values and numbers of clusters. Figure

Figure

When we evaluate cluster quality measured by cLtlf, (Figure

Table

Euclidean distance

We can produce a distance matrix that contains a measure of how each species is related to each other. To construct this matrix, each species rank values set is treated as a vector in a k-dimension space. One can choose the best measure to calculate the distance among vectors, depending on the particular characteristics in a data set. We decided to use Euclidean distance instead of the cosine distance used by Stuart

ASAP algorithm: in house agglomerative clustering

We implemented a clustering algorithm that was called ASAP (As Simple As Possible) and showed that even a naive algorithm can benefit from data adequately treated by SVD. Thus, it is not our intention to demonstrate it's worth using this clustering algorithm, but we want to leave the message that regardless of the algorithm, it is worth using SVD conjugated with positive controls in information retrieval, as an initial filter against noise

ASAP is an algorithm designed to facilitate the work of measuring the impact of using SVD in clustering algorithms. This algorithm somewhat resembles single-linkage clustering; the differences are that no clustering starts from the two elements with the lowest Euclidean distance. Clustering starts with a random element; also, a new entry is not inserted in the matrix of Euclidean distances for each cluster created between the algorithm interactions.

The idea is quite simple; randomly select a species from the distance matrix, cluster together with other species according to a fixed 'd' distance and remove the clustered species from the distance matrix. Do it again randomly selecting other species, and so on.

(1) Repeat as long as the number of columns in the distance matrix is greater than one:

1.1. Fix the first column as the pivotal element;

1.2. Create a cluster of elements so that the Euclidean distance is smaller than a 'd' value for the pivotal element;

1.3. Remove elements from the novel cluster (lines and columns) from the distance matrix;

1.4 End repeat.

This algorithm was implemented using Scilab1 5.2.1 run on GNU linux Ubuntu, core 2.6.22-16. This implementation is available in the Additional file

**Scilab algorithms and raw data.** In this file, we elaborate on aspects of the algorithms and data used in this research. Algorithms were written in Scilab version "5.2.0.1266391513", scilab-5.2.1.

Click here for file

Clustering algorithms evaluated

K-Means-R

The K-Means algorithm implemented

K-Means-WEKA

The K-Means algorithm implemented in the WEKA software is denominated SimpleKMeans. This implementation can use either the Euclidean distance or the Manhattan distance. If the Manhattan distance is used, then centroids are computed as the component-wise median rather than mean

Expectation Maximization (EM)

The EM algorithm

Adaptive Quality-based Clustering Algorithm (AQBC)

It's a heuristic iterative two-step algorithm with computational complexity approximately linear. The first step consists in finding a sphere in the high-dimensional representation of the data where the density of expression profiles is locally maximal. In a second step, an optimal radius of the cluster is calculated based only on the significantly coexpressed items which are included in the cluster. By inferring the radius from the data itself, there is no need to find manually an optimal value for this radius by trial-and-error

K-Medoids

It's an exact algorithm based on a binary linear programming formulation of the optimization problem

MakeDensityBasedClusterer (MDBC)

It’s an algorithm wrapping the SimpleKMeans and possibly others clusterers algorithms. Makes SimpleKmeans return a distribution and density. Fits normal distributions and discrete distributions within each cluster produced by the wrapped clusterer. For the SimpleKMeans supports the number of clusters requestable

Cladograms

The clustering operations were made by calculating the Euclidean distance from the first alphabetically ordered species, defined as the pivotal species, to all the other species. Therefore, when ASAP created the clusters, it already had a symmetric distance matrix containing a data set with all the species. All we needed to do was to create a phylogenetic tree expressed as a Newick phylogenetic tree. We developed an unrooted tree created by the software NEIGHBOR from the PHYLIP package. We drew the unrooted tree in Figure

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

MAS encouraged the research and writing, BMC application, provided references and applied mathematical knowledge and gave final approval of the version to be published. ARS downloaded all the data and conducted all the tests, decided to use Linnaean taxonomy as a measure of cluster quality, developed the algorithm and wrote the paper.

JB made substantial contributions to conception and design, analysis and interpretation of data. VAA encouraged submission to BMC and gave final approval of the version to be published. JAM, GCO, AM and AS have given final approval of the version to be published.

Acknowledgements

This article has been published as part of