Abstract
Background
The key idea of DNA barcode initiative is to identify, for each group of species belonging to different kingdoms of life, a short DNA sequence that can act as a true taxon barcode. DNA barcode represents a valuable type of information that can be integrated with ecological, genetic, and morphological data in order to obtain a more consistent taxonomy. Recent studies have shown that, for the animal kingdom, the mitochondrial gene cytochrome c oxidase I (COI), about 650 bp long, can be used as a barcode sequence for identification and taxonomic purposes of animals. In the present work we aims at introducing the use of an alignmentfree approach in order to make taxonomic analysis of barcode sequences. Our approach is based on the use of two compressionbased versions of noncomputable Universal Similarity Metric (USM) class of distances. Our purpose is to justify the employ of USM also for the analysis of short DNA barcode sequences, showing how USM is able to correctly extract taxonomic information among those kind of sequences.
Results
We downloaded from Barcode of Life Data System (BOLD) database 30 datasets of barcode sequences belonging to different animal species. We built phylogenetic trees of every dataset, according to compressionbased and classic evolutionary methods, and compared them in terms of topology preservation. In the experimental tests, we obtained scores with a percentage of similarity between evolutionary and compressionbased trees between 80% and 100% for the most of datasets (94%). Moreover we carried out experimental tests using simulated barcode datasets composed of 100, 150, 200 and 500 sequences, each simulation replicated 25fold. In this case, mean similarity scores between evolutionary and compressionbased trees span between 83% and 99% for all simulated datasets.
Conclusions
In the present work we aims at introducing the use of an alignmentfree approach in order to make taxonomic analysis of barcode sequences. Our approach is based on the use of two compressionbased versions of noncomputable Universal Similarity Metric (USM) class of distances. This way we demonstrate the reliability of compressionbased methods even for the analysis of short barcode sequences. Compressionbased methods, with their strong theoretical assumptions, may then represent a valid alignmentfree and parameterfree approach for barcode studies.
Background
The use of DNA sequences in order to integrate ecological, morphological and genetic information to improve taxonomic studies of biological species [1] has been carried out since 2003 by Herbert et al. [2]. The authors introduced and discussed the need of having DNA sequences as taxon "barcodes". The main purpose was to identify, for each kingdom of life (animals, plants, fungi, and so on) a short DNA fragment that could exploit biodiversity among different species. This way taxonomists can focus above all on discovering new species and describing and fixing existing taxa, leaving identification issues to barcodebased tools [3].
A 648bp region of the cytochrome c oxidase I (COI) gene has been identified as a DNA barcode sequence for the animal kingdom [4]. DNA barcode approach has proven to be useful for the study of biodiversity of very different species, including fishes [5,6], birds [7], bugs [810].
The analysis of DNA barcode sequences is usually done by means of clustering methods, like for instance Neighbor Joining (NJ) method [11], that allow to obtain phylogenetic trees (dendograms) of input sequences. Taxonomic studies with DNA barcoding data relies on traditional approaches, that consist of evaluating genetic distances among species in order to perform distancebased clustering analysis [12]. Moreover genetic distances computation needs a preprocessing step, that is sequence alignment, in order to compare corresponding loci. Genetic distances, also called evolutionary distances, are stochastic estimates and they do not define a distance metric [13].
In this work we propose a novel alignmentfree approach, for the analysis of DNA barcode data based on information theory concepts. Our aim is to employ Universal Similarity Metric (USM) [14] in order to compute genetic distances among biological species described by DNA barcode sequences. USM represents a class of distance measures based on Kolmogorov complexity [15] and that defines, under some assumptions, a distance metric.
USM is said to be universal because it can be applied for the analysis of data belonging to very different domains: it, in fact, has been used in the field of text and language analysis, image and sound processing [16]. As said earlier, USM is based on Kolmogorov complexity which is, unfortunately, not computable. For this reason, USM needs to be approximated. One of USM's approximation, called Normalized Compression Distance (NCD), has been adopted for the first time for the analysis of biological sequences in [16], where it has been built a coherent phylogenetic tree of 24 species belonging to Eutherian orders considering complete mammalian mtDNA sequences. Another compressionbased approximation, the InformationBased Distance (IBD) [17], was applied for the study of whole mitochondrial genome phylogeny. USM and its compressionbased approximations have also been used for the analysis of different biological datasets in [18], including protein and genomic (complete mithocondrial genome) sequences. The authors compared phylogenetic trees obtained through USM with gold standard trees using Fmeasure [19] and Robinson metric [20], obtaining encouraging results about USM use in bioinformatics. NCD has also been adopted for clustering of bacteria considering 16S rRNA gene sequences and topographic representations obtained by means of SelfOrganizing Map algorithm [21,22].
Our proposed approach, then, wants to demonstrate that it is possible to apply information theory techniques to the study of short biological sequences for taxonomic and phylogenetic purposes. Genetic distances, obtained through USM's approximations, will be used in order to compute phylogenetic trees of 30 barcode sequence datasets and then those trees will be compared with the ones obtained using traditional bioinformatics approaches depending on sequencealignment and evolutionary distances computation. The presented results, showing a trees' similarity between 80% and 100%, demonstrates our approach can be adopted for the afore mentioned analysis. In order to further validate our results, we also made experimental tests with simulated barcode datasets, composed of 100, 150, 200 and 500 sequences. For each dataset composition, we considered 25 different barcode datasets, for a total of 100 experiments. The presented results, showing a trees' similarity between 83% and 99% for all simulations, strenghten our findings with real barcode datasets.
In this work, we use USM's compressionbased approximations for a deep study and analysis of short DNA barcode sequences. Preliminary results about this topic were presented in [23].
Methods
The study of application of USM's compressionbased approximations to barcode sequences data has been carried out considering both Normalized Compression Distance (NCD) and InformationBased Distance (IBD). Those two distances have been used to compute dissimilarities among species belonging to different kingdoms of life. DNA barcode datasets have been downloaded from Barcode of Life Data System (BOLD) [24], which represents the best source and repository for barcode sequences. In our work we considered 30 datasets of different size and species composition. Using NCD and IBD dissimilarity matrices, we built phylogenetic trees of each of the thirty datasets through two stateoftheart phylogenetic algorithms, Neighbor Joining and Unweighted Pair Group Method with Arithmetic Mean. Those trees were compared with the ones obtained from five different kinds of evolutionary distances (see next Sections). Figure 1 shows the flowchart of the experimental setup.
Figure 1. General flowchart of the proposed comparison approach for real barcode datasets. Global flowchart of the proposed approach showing all the phases of our experimental setup with real barcode datasets.
In the following subsections a brief explanation of all the employed techniques and algorithms will be provided.
USM and compressionbased distances
Universal Similarity Metric is a class of distance measures defined in terms of Kolmogorov complexity. The Kolmogorov complexity K(x) of a string x is the length of the shortest binary program x* to compute x on a universal Turing machine [14,15]. K(xy) represents the conditional Kolmogorov complexity of two strings, x and y, and it is defined as the length of the shortest binary program that produces x as output, given the input y [14,15]. In other terms, K(xy) is the amount of minimal information needed to generate x when y is given as input.
The Information Distance (ID) [25] between two objects is then defined as:
It has been shown [25] that ID represents a metric, that means it satisfies the following conditions:
1. ID(x, y) ≥ 0 (separation axiom);
2. ID(x, y) = 0 if and only if x = y (identity axiom);
3. ID(x, y) = ID(y, x) (symmetry);
4. ID(x, z) ≤ ID(x, y) + ID(y, z) (triangle inequality).
USM has been presented in [14] and defined as:
It has been demonstrated [14] that USM is a metric, is normalized (it ranges between 0 an 1) and is universal.
In order to adopt USM as a distance measure, it needs to be approximated since Kolmogorov complexity is not computable. In our work we considered two USM approximations based on data compression: Normalized Compression Distance (NCD) and the InformationBased Distance (IBD) defined in [17]. We chose NCD and IBD because they have been successfully used for the analysis of biological data [1618,21,22].
NCD and IBD are respectively defined as:
In Eq. (3) and (4), C(x) is the size, in byte, of the compression version of string x; C(xy) is the size of the compressed version of the concatenation of string x and y; C(xy) is the size of the conditional compression of string x given string y. The basic idea of a string compression algorithm is to find portions of input string that are repeated and to substitute them with a shorter reference. The set of repeated string portions is indicated as "dictionary". Compressing a string x given a string y means that the compression algorithm builds the dictionary using the string y and makes the references on string x using that dictionary. This gives a measure of the similarity between the two strings. Both NCD and IBD give better USM approximations if the string are compressed with optimized compressionratios.
In our experiments, it has been used GenCompress [26] compressor in order to compute both NCD and IBD. GenCompress, in fact, is a Lempel and Ziv dictionary based compressor [27] optimized to work with DNA sequences. If GenCompress is used with generic text strings, as input, it works as a generic asciitext compressor, without any optimization property.
Evolutionary distances and phylogenetic trees
Evolutionary distances are distance measures used in order to compute the dissimilarity among genetic sequences [13]. Evolutionary distances are estimates obtained through stochastic methods that take into account many biological phenomena such as convergent substitutions, multiple substitutions per site or retromutations. There exist several kinds of evolutionary distance according to the prior assumptions of the stochastic model adopted and their related complexity. The more complex the model, the more accurate and computational expensive the resulting evolutionary distance. In our work, we used five different evolutionary distances, sorted by complexity level, in order to compute phylogenetic trees: Kimura 2parameter [28], TajimaNei [29], Tamura 3parameter [30] TamuraNei [31] and Maximum Composite Likelihood (MCL) [32]. Kimura 2parameter distance model corrects for multiple hits, taking into account transitional and transversional substitution rates, while assuming that the four nucleotide frequencies are the same and that rates of substitution do not vary among sites. TajimaNei distance model derives from the simpler JukesCantor distance [33]and it gives a better estimate of the number of nucleotide substitutions. TajimaNei model assumes an equality of substitution rates among sites and between transitional and transversional substitutions. Tamura 3parameter model corrects for multiple hits, taking into account the differences in transitional and transversional rates and the G+Ccontent bias. The TamuraNei distance with the gamma model corrects for multiple hits, taking into account the different rates of substitution between nucleotides and the inequality of nucleotide frequencies. As for MCL model, a composite likelihood is defined as a sum of loglikelihoods for related estimates. In [32] it is showed that pairwise evolutionary distances and the related parameters are accurately estimated by maximizing the composite likelihood. It is also stated that a complex model had virtually no disadvantage in the composite likelihood method for phylogenetic analyses. In our case, the maximum composite likelihood method is used for describing the sum of loglikelihoods for all pairwise distances estimated by using the TamuraNei model. Evolutionary distances were computed using MEGA 5 software [34].
Phylogenetic relationships among biological species are usually inferred by means of phylogenetic trees [35]. In our work we considered the two most popular distancebased algorithms to build phylogenetic trees: Neighbor Joining (NJ) [11] and Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [36]. NJ and UPGMA are said "distancebased" because they need as input a dissimilarity (distance) matrix among elements. Our goal is not to compare the two tree construction methods, but to build and to compare two trees, one with evolutionary distance and the other with compression distance, first using NJ and after using UPGMA.
Phylogenetic trees comparison algorithms
It is possible to obtain different phylogenetic trees, for the same input dataset, according to the adopted distance measure and/or the used algorithm. That's the reason why there are methods to compute similarity between trees, so that it is possible to understand the shared information content among them. One of the most popular similarity measures between phylogenetic trees is the symmetric distance introduced by Robinson and Foulds [20]. Robinson's metric considers as tree distance the number of "shifts", i.e. edit operations, required to obtain the second tree from the first one (and viceversa). This approach makes the symmetric distance a "local" similarity algorithm, because it penalizes, in the same way, all the mispairings without considering the global clustering results and the tree's topology representing the actual phylogenetic relationships.
For this reason, in our work, we adopted one more recent algorithm for trees' comparison: the PhyloCore algorithm developed by Nye et al. [37], that has a different approach from Robinson's one. PhyloCore, in fact, builds an alignment between trees by matching corresponding branches that share the same leaf elements. Each edge (branch) in a phylogenetic tree divides the tree into two subtrees, creating this way a partition of the leaf nodes into two subsets. Each pair of edges between two trees is given a score by comparing the two corresponding partition of leaf elements. Trees partitions with the same leaf nodes represent corresponding clusters and then a similarity in terms of topology and phylogenetic preservation. PhyloCore gives the percentage of topology similarity between trees.
Results and discussion
In order to extensively test the proposed compressionbased approach we used both real and synthetic datasets and compared the results with the ones obtained using the evolutionary distances. In the following subsections we will describe the proposed methodologies and we will discuss the comparison between the two approaches.
Barcode datasets
We performed our experiments considering real barcode datasets all taken from Barcode Of Life Database (BOLD). Since our purpose was to test the reliability of compressionbased distance models, we considered a subset of the whole database. We selected 30 datasets that differ each other on the basis of the type of species (birds, fish, and so on), the number of species, the number of barcode sequences per species (specimens), the sequence length and the sequence quality, expressed in terms of the percentage of sequences with undefined nucleotides, marked with the "N" character. We did not consider all BOLD database because we had no interest in obtaining a phylogenetic tree for all available datasets. It is very important to consider the percentage of sequences containing undefined bases because, as highlighted in Section "Methods", GenCompress works as an optimized compressor for DNA sequences only when dealing with string having the four letters A,C,G,T. In all other situations, GenCompress works as a generic ascii text compressor. That means GenCompress will give bad compression ratios for those sequences, and as a consequence NCD and IBD distance (see Eq. (3) and (4)) will not properly approximate USM. Since typical sequence length of COI barcode gene is about 650 bp [4], longer sequences contain information content related to other genes; whereas shorter sequences have incomplete information content. In our study, we then considered as "good" those datasets having a low percentage of sequences with undefined bases and sequences of about the same length (the 650 bp length of typical COI barcode sequence).
The complete list of the barcode datasets of our experiments is summarized in Table 1 and Table 2.
Data simulation
In order to test our approach even in case of synthetic data, we simulated some barcode datasets obtained using a generation strategy similar to the one reported in [38,39]. First of all we started by simulating a random ultrametric species tree with Mesquite software (version 2.75, build 564) [40] using the Yule model [41]. We generated four different simulated species trees considering respectively 10, 15, 20 and 50 species, with a total tree depth of 1 million generations. Gene trees were then simulated on the species trees, using the Coalescent package of Mesquite, considering 10 individuals (specimens) per species, obtaining this way gene trees with, respectively, 100, 150, 200 and 500 individuals. Gene trees were simulated using an effective population size of 10000 elements. We finally added noise to the gene trees in order to produce nonultrametric trees. We considered normally distributed noise with a variance of 0.7 times the original branch length, ad done in [38].
Sequences barcode datasets were simulated, from the gene trees, using the Seqgen software (version 1.3.3) [42]. We adopted the HKY model of evolution [43], with a transition/transversion ratio of 3, nucleotide frequencies of 0.3 (A), 0.2 (C), 0.2 (G), 0.3 (T), and sequence length of 650 bp, representing the typical COI gene length. For each gene tree, we obtained 25 barcode datasets, resulting in a total of 100 simulated datasets.
Experimental results
The purpose of the proposed experimental tests is to demonstrate that compressionbased distances represent a valid alignmentfree approach for the analysis of phylogenetic relationships among short barcode sequences. In Tables 3, 4, 5, 6, 7 there are summarized the similarity scores, obtained using PhyloCore score, among evolutionary based trees and compression based trees of real barcode datasets. More in detail, for every pair of compressionbased distances (NCD and IBD) and for every pair of phylogenetic tree inference algorithms (NJ and UPGMA), each table gives the similarity scores according to a reference evolutionary distance model (Kimura 2parameter, TamuraNei and so on).
Table 3. Tree similarity score among compressionbased trees and evolutionary trees obtained with Kimura 2parameter distance.
Table 4. Tree similarity score among compressionbased trees and evolutionary trees obtained with TajimaNei distance.
Table 5. Tree similarity score among compressionbased trees and evolutionary trees obtained with Tamura 3parameter distance.
Table 6. Tree similarity score among compressionbased trees and evolutionary trees obtained with TamuraNei distance.
Table 7. Tree similarity score among compressionbased trees and evolutionary trees obtained with MCL distance.
Since, in our experiments, we use two kinds of compressionbased distances, NCD and IBD, and two different phylogenetic tree inference algorithms, NJ and UPGMA, we are interested in the specific behavior of each distance measure and algorithm. In Figure 2(a) we show the curve trends, related to NCD and IBD methods, representing the PhyloCore similarity mean scores, considering every evolutionary distance model, for the input datasets. The two curves have a similar trend, that is NCD and IBD give very close similarity scores, except for AGWEB, CLNVA, DSFCH and RDMYS datasets. That chart does not give enough information about which compressionbased distance produces the most regular results in terms of topology similarity. Our next step was then to check, separately, the similarity scores obtained using the NJ and UPGMA algorithms. In Figure 2(b) and 2(c) we show the trend curves of, respectively, the PhyloCore similarity mean scores, considering every evolutionary distance model and only the NJ algorithm; and the PhyloCore similarity mean scores, considering every evolutionary distance model and only the UPGMA algorithm. From those charts we can state NCD and IBD distance models give quite identical similarity scores in trees' comparison when using UPGMA algorithm for tree inference. Using NJ algorithm, otherwise, we obtain a very unstable trend, with similarity scores generally below than the corresponding scores obtained through UPGMA algorithm. Moreover, in Figure 3 we show in an histogram the highest similarity values, considering all the evolutionary distance models and input datasets, obtained using NJ and UPGMA algorithm. From that chart, we can see that in 90% (27/30) of cases, the best similarity scores from comparison among evolutionary based trees and compression based trees are obtained using UPGMA. That means UPGMA algorithm is the best tree inference algorithm when adopting a compressionbased distance models. Looking again at Figure 2(c), the lesser scores, below 80% of similarity, are obtained for AGWEB, JTB, and RDMYS datasets. According to Table 2, AGWEB and RDMYS are the datases with the highest percentage of sequences with undefined bases, respectively 87% and 32%. These low similarity results are then justified by considering the low quality of input datasets, that gave bad compression ratios using GenCompress that in turn produced a bad estimate of NCD and ICD and consequently a wrong phylogenetic tree. As for JTB, its low similarity score is explained considering the different lengths of its sequences, ranging from 658 to 899 bp. As early said in Section "Barcode Datasets", longer sequences contain additional information not related to COI barcode gene and furthermore the spread of sequence length influences NCD and IBD computation (Eq. (3) and (4)).
Figure 2. Mean PhyloCore similarity scores of 30 input datasets. Mean PhyloCore similarity scores resulting from the comparison among NCD and IBD based trees with the trees obtained from all the five evolutionary distance models. We considered separetely the results obtained using both NJ and UPGMA algorithm(a), only NJ algorithm (b), only UPGMA algorithm (c). The trend curves show NCD and IBD distance models give a quite identical similarity scores in trees' comparison when using UPGMA algorithm for tree inference.
Figure 3. Histogram of the best similarity scores, for all the evolutionary distance models and input datasets, using NJ and UPGMA algorithm. In 90% (27/30) of cases, the best similarity scores from comparison among evolutionary based trees and compression based trees are obtained using UPGMA.
In order to realize what are the most similar compressionbased and evolutionarybased trees, with regards to the evolutionary distance model adopted, we draw the histogram of Figure 4. The histogram is obtained considering the highest similarity values from Tables 3, 4, 5, 6, 7, that is considering both NJ and UPGMA algorithms and both NCD and IBD distance models. The chart in Figure 4 shows the highest similarity scores are reached in the comparison among compressionbased trees and evolutionarybased trees obtained through MCL distance model. Moreover in Figure 5 we show the boxplot of similarity scores obtained comparing MCLbased trees and compressionbased (NCD and IBD) trees using both NJ and UPGMA algorithm. This chart confirms the best similarity scores, in terms of minimum value, maximum value and mean values, are reached in the comparison between MCLbased trees and compressionbased trees using UPGMA algorithm. Finally, in the piechart of Figure 6, we summarize the mean similarity scores for the 30 datasets resulting from the comparison between both compressionbased trees and MCLbased trees using UPGMA algorithm. The piechart shows that in 6% of cases (2/30) we obtain similarity score below 80% (corresponding to AGWEB and JTB datasets); in 58% of cases we have a similarity scores ranging from 80% and 90% (17/30); in 33% of considered datasets (10/30) we obtain a similarity score over 90% and in the 3% of cases (1/30) we reach a 100% of tree similarity. It interesting to note that the perfect similarity score (100%) is obtained for BPRP dataset that, as reported in Table 2, represents an ideal barcode dataset, with 658bp sequence lenght and 0% of sequences with undefined bases. As explained in Section "Evolutionary Distances and Phylogenetic Trees", MCL method gives a better estimates of evolutionary distance than the other four distance models, and consequently more accurate phylogenetic trees. From our experimental study we found NCD and IBD compressionbased distances,using UPGMA algorithm, build phylogenetic trees that have the best similarity scores with MCLbased trees, which, in turn, give the most accurate phylogenetic relationships.
Figure 4. Histogram of the best Phylocore similarity scores for all input datasets. For each dataset, it is shown the best similarity score resulting from the pairwise comparison of compressionbased trees and the five trees derived from the five evolutionary distance models. The chart shows the highest similarity scores are reached in the comparison among compressionbased trees and evolutionarybased trees obtained through MCL distance model.
Figure 5. Boxplot of similarity scores obtained comparing MCLbased trees and compressionbased trees using both NJ and UPGMA algorithm. The best similarity scores, in terms of minimum value, maximum value and mean values, are reached in the comparison between MCLbased trees and compressionbased trees, using both NCD and IBD distances, with UPGMA algorithm.
Figure 6. Piechart summarizing the mean similarity scores among compressionbased trees and MCLbased trees obtained using UPGMA algorithm. From the chart it is shown that in 7% of cases (2/30) we obtain similarity score below 80% (corresponding to AGWEB and JTB datasets); in 57% of cases we have a similarity scores ranging from 80% and 90% (17/30); in 33% of considered datasets (10/30) we obtain a similarity score over 90% and in the 3% of cases (1/30) we reach a 100% of tree similarity.
In order to strengthen our experimental results, we carried out other tests using simulated data, as described in Section "Data Simulation". Results obtained with simulated datasets are summarized in Table 8 and 9. Since we obtained analogous results using both NCD and IBD distance measures, we report only the similarity scores obtained using NCD for sake of simplicity. For each number of input sequences (100, 150, 200, 500), we replicated the simulation 25fold, for a total of 100 new experiments. Considering all five evolutionary models and the NJ algorithm we evaluated the comparison between compressionbased and evolutionary trees, obtaining a very high mean similarity score (83% with a variance between 10^{−3 }and 10^{−4}). Using the UPMGMA algorithm the similarity score was even higher with a mean of 99% and a variance between 10^{−3 }and 10^{−6 }.
Table 8. Tree similarity score (mean and variance) among compressionbased trees and evolutionary trees, obtained with NJ, of simulated datasets.
Table 9. Tree similarity score (mean and variance) among compressionbased trees and evolutionary trees, obtained with UPGMA, of simulated datasets.
We can state, then, that our proposed approach is very reliable using simulated data and robust enough to be applied with real barcode datasets.
Speed evaluation
In order to compare the processing time of the proposed algorithm with the speed of evolutionary distance methods, we performed additional experiments. It is possible to notice that the compressionbased distance can be calculated separately for each sequence versus all the other, so that, in principle we can calculate all the distance running all the programs at the same time (one program for each sequence running on one processor core), this makes the compressionbased method intrinsically parallel. If we want to compare the performance of the proposed method to the one using the alignment distance, we have to take into account a parallel version of the alignment algorithm. We used the algorithm described in [44], that exploits the multicore processor and becomes faster each time a processor core is available. In this algorithm the speed increment decreases in nonlinear way each time we double the number of cores. On the other hand, as said above, in the compressionbased distance method the speed increment is constant and each time we double the number of cores, the speed doubles. For this reason if we compare the running time of the two methods in term of number of cores we will find a tradeoff point. Experiments for evaluation of running times were carried out using a multicore system up to 16 cores. We tested the execution times of both compression and alignment for barcode dataset of 500 sequences versus the number of cores. Running times are summarized in Figure 7, that shows real (solid line) and estimated (dashed line) times in log_{2 }base. Compressionbased approach overcomes alignment approach using a multicore system after 32 cores.
Figure 7. Execution times, in log_{2 }base, of compression and alignment for dataset of 500 sequences versus the number of processing cores. The chart, in log_{2 }base, shows real (solid line) and estimated (dashed line) execution times of both compression and alignment for barcode dataset of 500 sequences. Compressionbased approach overcomes alignment approach using a multicore system after 32 cores.
Conclusions
In this paper we presented a novel alignmentfree approach for the study of barcode genetic sequences. We used two compressionbased approximations of USM, namely NCD and IBD, for reconstructing phylogenetic trees of short barcode sequences. In previous works, in fact, compressionbased distances were used only for the analysis of whole mithocondrial genomes. We tested our approach considering 30 barcode datasets, of different size and belonging to different species, and 100 simulated datasets composed of different number of sequences (100, 150, 200, 400). Compressionbased trees, obtained from NCD and IBD distances, were compared with evolutionarybased trees derived using five evolutionary distance models: Kimura 2parameter, TajimaNei, Tamura 3parameter, TamuraNei and MCL. Trees were obtained using NJ and UPGMA algorithms. Our experimental tests demonstrated that using NCD and IBD compressionbased distances we were able to obtain phylogenetic trees quite similar to evolutionarybased trees, with similarity scores ranging from 80% to 100%. More in detail, the highset similarity scores were reached comparing compressionbased trees with MCLbased trees using UPGMA algorithm, with no substantial differences between NCD and IBD. MCL provides a better esitmates of evolutionary distance, and as a consequence more accurate phylogenetic trees, than the remaining considered methods. As for simulated data, our experimental trials show very stable results with regards to the number of input sequences and evolutionary model considered, with similarity scores spanning from 83%, using NJ algorithm, and 99%, using UPGMA algorithm. NCD and IBD compression distance models represent a sound alignmentfree and parameterindependent approach, based on strong theoretical assumptions. Using these models it is possible to obtain very reliable phylogenetic trees and they are a valid tool for the analysis of barcode sequences.
Competing interests
The authors declare that there are no competing interests.
Authors' contributions
MLR: project conception, implementation, experimental tests, writing, assessment, discussions. AF: project conception, writing, assessment, discussions. RR: project conception, discussions, writing. AU: project conception, discussions, writing, funding. All authors read and approved the final manuscript.
Declarations
The publication costs for this article were funded by the CNR Interomics Flagship Project " Development of an integrated platform for the application of "omic" sciences to biomarker definition and theranostic, predictive and diagnostic profiles".
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 7, 2013: Italian Society of Bioinformatics (BITS): Annual Meeting 2012. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S7
References

Miller SE: DNA barcoding and the renaissance of taxonomy.
Proceedings of the National Academy of Sciences of the United States of America 2007, 104(12):47754776. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hebert PDN, Cywinska A, Ball SL, DeWaard JR: Biological identifications through DNA barcodes.
Proceedings of the Royal Society. Series B, Biological sciences 2003, 270(1512):313321. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Savolainen V, Cowan RS, Vogler AP, Roderick GK, Lane R: Towards writing the encyclopedia of life: an introduction to DNA barcoding.
Philosophical transactions of the Royal Society of London. Series B, Biological sciences 2005, 360(1462):18051811. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hebert PDN, Ratnasingham S, DeWaard JR: Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species.
Proceedings of the Royal Society. Series B, Biological sciences 2003, 270(Suppl):S96S99. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ward RD, Zemlak TS, Innes BH, Last PR, Hebert PDN: DNA barcoding Australia's fish species.
Philosophical transactions of the Royal Society of London. Series B, Biological sciences 2005, 360(1462):18471857. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Costa F, Carvahlo G: The Barcode of Life Initiative: synopsis and prospective societal impacts of DNA barcoding of fish.

Hebert PDN, Stoeckle MY, Zemlak TS, Francis CM: Identification of Birds through DNA Barcodes.
PLoS biology 2004, 2(10):16571663. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Smith MA, Fisher BL, Hebert PDN: DNA barcoding for effective biodiversity assessment of a hyperdiverse arthropod group: the ants of Madagascar.
Philosophical transactions of the Royal Society of London. Series B, Biological sciences 2005, 360(1462):18251834. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Smith MA, Woodley NE, Janzen DH, Hallwachs W, Hebert PDN: DNA barcodes reveal cryptic hostspecificity within the presumed polyphagous members of a genus of parasitoid flies (Diptera: Tachinidae).
Proceedings of the National Academy of Sciences of the United States of America 2006, 103(10):36573662. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hajibabaei M, Janzen DH, Burns JM, Hallwachs W, Hebert PDN: DNA barcodes distinguish species of tropical Lepidoptera.
Proceedings of the National Academy of Sciences of the United States of America 2006, 103(4):968971. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Saitou N, Nei M: The NeighborJoining Method: a new method for reconstructing phylogenetic trees.
Molecular biology and evolution 1987, 4(4):406425. PubMed Abstract  Publisher Full Text

Hajibabaei M, Singer GaC, Hebert PDN, Hickey Da: DNA barcoding: how it complements taxonomy, molecular phylogenetics and population genetics.
Trends in genetics 2007, 23(4):167172. PubMed Abstract  Publisher Full Text

Nei M, Kumar M: Molecular Evolution and Phylogenetics. New York: Oxford University Press; 2000.

Li M, Chen X, Li X: The similarity metric.
IEEE Transactions on Information Theory 2004, 50(12):32503264. Publisher Full Text

Li M, Vitanyi P: An Introduction to Kolmogorov Complexity and its Applications. New York: Springer; 1997.

Cilibrasi R, Vitányi P: Clustering by compression.
IEEE Transactions on Information Theory 2005, 51(4):15231545. Publisher Full Text

Li M, Badger J, Chen X, Kwong S: An informationbased sequence distance and its application to whole mitochondrial genome phylogeny.
Bioinformatics 2001, 17(2):149154. PubMed Abstract  Publisher Full Text

Ferragina P, Giancarlo R, Greco V, Manzini G, Valiente G: Compressionbased classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment.
BMC Bioinformatics 2007., 8(252) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Robinson D, Foulds L: Comparison of phylogenetic trees.
Mathematical Biosciences 1981, 53:131147. Publisher Full Text

La Rosa M, Rizzo R, Urso A, Gaglio S: Comparison of genomic sequences clustering using normalized compression distance and evolutionary distance. In KnowledgeBased Intelligent Information and Engineering Systems. Springer; 2008:740746.

La Rosa M, Gaglio S, Rizzo R, Urso A: Normalised compression distance and evolutionary distance of genomic sequences: comparison of clustering results.
International Journal of Knowledge Engineering and Soft Data Paradigms 2009, 1(4):345362. Publisher Full Text

Fiannaca A, La Rosa M, Rizzo R, Urso A: A Study of CompressionBased Methods for the Analysis of Barcode Sequences.
Proceedings of 2012 Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB), 1 2012.

Ratnasingham R, Hebert P: BOLD: The Barcode of Life Data System.
Molecular Ecology Notes 2007. PubMed Abstract  PubMed Central Full Text

Bennett C, Gács P, Li M, Vitányi P, Zurek W: Information Distance.
IEEE Transactions on Information Theory 1998, 44(4):14071423.

Chen X, Kwong S, Li M: A compression algorithm for DNA sequences.
IEEE Engineering in Medicine and Biology 2001, 6166.
(August)

Ziv J, Lempel A: A universal algorithm for sequential data compression.
IEEE Transactions on Information Theory 1977, 23(3):337343. Publisher Full Text

Kimura M: Estimation of evolutionary distances between homologous nucleotide sequences.
Proceedings of the National Academy of Sciences 1981, 78:454458. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tajima F, Nei M: Estimation of evolutionary distance between nucleotide sequences.
Molecular biology and evolution 1984, 1:269285. PubMed Abstract  Publisher Full Text

Tamura K: Estimation of the number of nucleotide substitutions when there are strong transitiontransversion and G + Ccontent biases.
Molecular Biology and Evolution 1992, 9:678687. PubMed Abstract  Publisher Full Text

Tamura F, Nei M: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees.
Molecular biology and evolution 1993, 10:512526. PubMed Abstract  Publisher Full Text

Tamura F, Nei M, Kumar M: Prospects for inferring very large phylogenies by using the neighborjoining method.
Proceedings of the National Academy of Sciences 2004, 101:1103011035. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jukes T, Cantor C: Evolution of protein molecules. New York: Academic Press; 1969.

Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods.
Molecular biology and evolution 2011, 28(10):27312739. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Makarenkov V, Kevorkov D, Legendre P: Phylogenetic network construction approaches.

Sneath PH, Sokal RR: Numerical Taxonomy: The Principles and Practice of Numerical Classification. San Francisco: W.H. Freeman; 1973.

Nye TMW, Liò P, Gilks WR: A novel algorithm and webbased tool for comparing two alternative phylogenetic trees.
Bioinformatics 2006, 22:117119. PubMed Abstract  Publisher Full Text

van Velzen R, Weitschek E, Felici G, Bakker FT: DNA barcoding of recently diverged species: relative performance of matching methods.
PloS one 2012, 7:e30490. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ross Ha, Murugan S, Li WLS: Testing the reliability of genetic methods of species identification via simulation.
Systematic biology 2008, 57(2):21630. PubMed Abstract  Publisher Full Text

Maddison W, Maddison D: Mesquite: a modular system for evolutionary analysis. [http://mesquiteproject.org] webcite
2011.

Steel M, McKenzie A: Properties of phylogenetic trees generated by Yuletype speciation models.
Mathematical Biosciences 2001, 170:91112. PubMed Abstract  Publisher Full Text

Rambaut A, Grassly N: SeqGen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees.
Comput Appl Biosci 1997, 13(3):23538. PubMed Abstract

Hasegawa M, Kishino H, Yano T: Dating of the humanape splitting by a molecular clock of mitochondrial DNA.
Journal of molecular evolution 1985, 22(2):16074. PubMed Abstract  Publisher Full Text

Chaichoompu K, Kittitornkun S, Tongsima S: MTClustalW: multithreading multiple sequence alignment.
Parallel and Distributed Processing Symposium 2006, 590594.