Laboratoire de Recherche en Sciences Végétales, UMR CNRS-Université Paul Sabatier 5546, Chemin de Borde Rouge - Auzeville 31326, Castanet Tolosan, France

Department of Plant Biology, Carnegie Institution for Science, Stanford, California 94305, USA

USDA ARS, Plant Pathology Department, UC Davis, Davis, CA, 95616, USA

Department of Plant and Microbial Biology, University of California Berkeley, Berkeley CA 94720, USA

Abstract

Background

Comparative Genomic Hybridization (CGH) with DNA microarrays has many biological applications including surveys of copy number changes in tumorogenesis, species detection and identification, and functional genomics studies among related organisms. Array CGH has also been used to infer phylogenetic relatedness among species or strains. Although the use of the entire genome can be seen as a considerable advantage for use in phylogenetic analysis, few such studies have questioned the reliability of array CGH to correctly determine evolutionary relationships. A potential flaw in this application lies in the fact that all comparisons are made to a single reference species. This situation differs from traditional DNA sequence, distance-based phylogenetic analyses where all possible pairwise comparisons are made for the isolates in question. By simulating array data based on the

Results

Our simulation data indicates that having a single reference can, in some cases, be a serious limitation when using this technique. Additionally, the tree building process with a single reference is sensitive to many factors including tree topology, choice of tree reconstruction method, and the distance metric used.

Conclusions

Without prior knowledge of the topology and placement of the reference taxon in the topology, the outcome is likely to be wrong and the error undetected. Given these limitations, using CGH to reveal phylogeny based on sequence divergence does not offer a robust alternative to traditional phylogenetic analysis.

Background

The field of comparative genomics, particularly in microbes, has benefited greatly by the proliferation of whole genome sequences. The advantages of having sequences from multiple, related organisms range from improving annotation to characterizing the genetic basis of major phenotypic differences between strains or species

Array CGH (aCGH) for two color array platforms uses DNA samples from a reference individual and a test individual, each labelled with a different fluorescent dye, and competitively hybridizes them to an array composed of immobilized DNA fragments from the reference individual

The relative ease with which aCGH provides large amounts of discriminating information between individuals makes it a very attractive technique to determine relatedness. Comparisons between aCGH derived trees and trees based on DNA sequence of one or a few loci have supported this assumption

Practitioners of aCGH have inferred evolutionary relationships using distance methods implemented by programs such as Cluster, Genespring, and Acuity

A complicating factor not addressed in these studies is that typically a single individual is represented on an array. Having a single reference species or strain is at odds with traditional sequence based phylogenetic distance methods, where all pairwise comparisons among the taxa are used in constructing a species tree. This difference is illustrated in Figure

The single reference design

**The single reference design**. 1A shows an example of a pairwise distance matrix for the tree below the matrix. In 1B the distances of other taxa to W, typical of a single reference design. Below it is the star topology constructed from these distances when no other information is available to differentiate distances among X, Y, and Z.

In this study we ask, what effect does having a single reference taxon have on recovering phylogenetic relationships? What effects do varying amounts of sequence divergence among taxa have on recovering the "true" tree? What impact do different tree topologies have? Does the position of the reference taxon in that topology have any importance or do all taxa serve equally well as the reference? Are distance and parsimony tree construction methods equivalent when applied to this type of data?

In order to address these questions, we chose to simulate CGH array data to avoid the added complexity of experimental noise associated with microarray technology. We chose the

Methods

CGH Data Simulation

We chose 70 mer sequences to represent the length of probes in a standard long oligo array platform because they have been a popular choice for groups investing in arrays for their chosen organism

Because not all genes in a genome evolve at the same rate, empirical data were analyzed to determine an appropriate distribution of sequence divergence to evolve our sequences. Based on the whole genome comparisons for different yeast species

To simulate variation in evolutionary rate among genes, for each 70 mer, the branch lengths of the input tree were multiplied by a scaling factor, randomly drawn from the distribution of evolutionary rates. By varying the mean and standard deviation of the distribution of evolutionary rates, we varied the amount of evolutionary distance between the taxa of our chosen tree topology. These parameters were used to evolve sequences for each of 10,500 genes for each taxon using the Jukes-Cantor model of evolutionary change, allowing nucleotide substitution but not indels. This simple model of evolutionary change models a scenario where species are close relatives and the problem of multiple substitutions at single nucleotide positions is minimal, i.e. where the signal to noise ratio is highest. The genetic distances between all pairs of taxa were calculated for each locus using dnadist from the PHYLIP package v3.6

To simulate the use of a microarray based on a single reference taxon, one species from each tree was chosen as the reference and only those pairwise comparisons involving the reference individual were saved, corresponding to the genetic distance from the reference species to all other taxa. These distances were combined into one supermatrix with 10,500 genes for

Program Flow

**Program Flow**. The diagram indicates the simulation program flow used to construct the supermatrix, our simulated CGH data matrix. Variables that can be modified include the number of sequences input, the topology of the tree used, and the amount of divergence among taxa in the simulated alignment, and the choice of evolutionary model to evolve and calculate genetic distance among taxa (see methods). Neighbor-Joining (NJ) and Parsimony Majority-Rule (PMR) trees were constructed from the supermatrix.

To evaluate the effect of using much longer oligomers to construct an array, as with cDNA arrays, a limited number of simulations were conducted for probes of 500 bases. The results using the longer arrays were not substantially different from those of the 70 base trials (see supplemental data).

Tree Construction

To construct a distance tree, a Pearson correlation-based distance matrix and a Euclidean distance matrix were derived from the original supermatrix using the statistical package R (

Tree to tree distance metric quantification

To assess the outcome of a simulation, we compared the phylogenetic tree used to produce the data sets to the phylogenetic tree recovered from the CGH simulation. Initially, we measured the differences between the input tree and that produced by the simulation with two tree-to-tree distance metrics - symmetric distance and agreement subtree - implemented in PAUP

Cophenetic correlation distributions

Neither the symmetric nor the agreement subtree metric compares branch lengths between the input and output trees. To investigate differences in branch length, we calculated the cophenetic correlation coefficient (CCC, from the cophenetic.phylo and cor functions implemented in the R stats and ape packages

In this work, we used the CCC to compare the tree topology used to initiate the simulated data set (input tree) with the distance matrices resulting from the various CGH analyses. The R functions extrapolate distances from the branch lengths from the starting topology to compare to the distance matrix estimated by the simulation. The converse comparisons, where tree topologies resulting from simulations are compared to a distance matrix calculated from the starting topology, were very similar to the first comparisons and are not presented here.

Results

The efficacy of CGH as a phylogenetic method as judged by two metrics (SymD or D1) varied considerably with topology (balanced, pectinate or empirical

**Additional Table S1**. Tree to Tree distances for all Neighbor-Joining Results.

Click here for file

**Additional Table S2**. Tree to Tree distances for all Parsimony Results.

Click here for file

The parameter with the least effect was substitution rate (5%, 7.5% or 10%). The simplest topology, the balanced tree, was recovered most successfully. This special case is not likely to be found in nature, and it should be noted that permuting the branches individually may lead to different results. For the other two topologies, the mixture of close and more distantly related taxa presented a complication for the different tree-building algorithms. For both the pectinate topology and the

Generally, the parsimony trees were more sensitive to the changes in average nucleotide substitutions than those made by the Neighbor-Joining method. When sequence divergence was high (10%), parsimony phylogenies were poorly resolved. At 10% divergence, we checked the effect of changing the threshold for considering a hybridization difference to be diverged, that is, from 1 (present) to 0 (absent). Altering the threshold from 75% (where the most diverged 25% of the genes were given a score of 0) to 20%, 50%, 80% or 90%, however, failed to significantly improve the performance of parsimony analysis (see Additional File

To report the details of our simulation we have organized the results to feature the input tree topology and tree-building algorithm. The four major variables tested in this simulation: tree topology, placement of reference taxon, tree building algorithm, and substitution rate, are discussed separately as much as possible given their inherent interrelationships.

Balanced topology (Figure

Balanced Topology Results

**Balanced Topology Results**. Figure 3A is a cladogram of an eight taxa balanced tree topology. In our simulation a sequence alignment was created using this input tree as a guide, where the branch lengths govern the number and pattern of substitutions. A number randomly drawn from a Gaussian distribution, empirically determined from published yeast genomic data, was used to multiply these branch lengths to produce a set of genes evolved for a range of substitution rates. The average of each range of substitution rates is indicated under each column. Figure 3B shows the symmetric distances for 200 simulation runs of the correlation-based NJ analysis evolved at the average sequence divergence indicated. Columns 1-3 use taxon A as the reference, Columns 4-6 use taxon H. Zero steps indicate perfect agreement between the input topology and the trees output by the simulation (red portions). One step away indicates a single collapsed branch (orange portions). A mispairing of two taxa is given a score of two (yellow portions). Scores in subsequent figures are cumulative and color-coded as indicated in the chart legends. Figure 3C shows the stacked histogram for the symmetric distances from the input topology for 100 replicate NJ trees calculated with the Euclidean distance metric. The order for reference taxa is the same as above. The stacked histogram in Figure 3D is of the symmetric distances from the input tree for 100 replicate 50% Majority-Rule Parsimony (PMR) trees. An example of this topology, two steps away from the reference, is given in Figure 4B.

Figure

NJ with the Pearson and Euclidean distance metric

When NJ was used with the Pearson correlation-based distance metric to make phylogenetic trees, comparison of input and output tree topologies by either symmetric distance or distance D1 showed no difference between the two topologies, regardless of nucleotide substitution rate (Figure

Parsimony Analysis

With parsimony analysis (Figure

The expected and recovered balanced topologies

**The expected and recovered balanced topologies**. Figure 4A illustrates the eight taxa balanced topology as an unrooted cladogram. Figure 4B is the Euclidean NJ tree, when taxon A is the reference, as an unrooted cladogram and is two steps away according to the symmetric distance from the desired topology in Figure 4A. It was evolved at 7.5% average nucleotide substitutions.

Pectinate topology (Figure

Pectinate Topology Results

**Pectinate Topology Results**. Figure 5A shows the twelve taxa pectinate topology input into the simulation pipeline. Figure 5B shows the stacked histograms for the symmetric distances for 200 replicate correlation-based NJ. Columns 1-3 show S1 as the reference taxa, 4-6 show S6, and 7-9 represent taxon S12. Figure 5C give the stacked histogram for the symmetric distance for the Euclidean distance based NJ trees, 100 replicates. The order for reference taxa is the same as above. Figure 5D gives the corresponding stacked histogram for 100 replicate PMR trees. Two examples of this topology, six and eight steps away from the reference, are given in Figure 6B and 6C.

This 12-taxon tree (Figure

NJ with the Pearson and Euclidean distance metric

The simulations show that the pectinate topology is not recovered as frequently as the balanced topology. NJ using the Pearson correlation-based distance recovered the input tree most frequently, with almost 2/3 of the trees 0 steps away. Choice of the reference taxon had a strong effect with the pectinate tree. For example, with NJ using Pearson's correlation distances, the input tree was recovered at least 95% of the time when the most basal (S12) or the most derived taxon (S1) were used as reference, but never when an interior taxon (S6) was designated as reference (Figure

With NJ using Euclidean distances, the input topology was recovered in 100% of simulations when the most derived taxon was the reference taxon, but almost never when either an internal (S6) or basal (S12) taxon was designated as reference (Figure

Parsimony Analysis

With parsimony analysis, the input topology was recovered in all simulations when the most derived taxon was the reference, but never when an internal or basal taxon was so designated (Figure

The expected and recovered pectinate topologies

**The expected and recovered pectinate topologies**. Figure 6A shows the twelve taxon pectinate topology drawn as an unrooted cladogram. 6B show an example where S12 acts as the reference for the NJ Euclidean method, it is 6 steps away according to the symmetric distance from the reference tree in A. C shows another example tree where S12 is the reference but for the parsimony algorithm, and is eight steps away. Both 6B and 6C were evolved at 7.5% average nucleotide substitutions.

The

Neurospora Topology Results

**Neurospora Topology Results**. Figure 7A gives the topology of the natural phylogeny for the eight conidiating species of

As with the other topologies, we evaluated the effects of choosing different reference taxa,

NJ with the Pearson and Euclidean distance metric

With NJ using Pearson's correlation (Figure

With NJ using Euclidean distance, the input topology was never recovered and the majority of output trees were 4 or more steps different from the input topology (Figure

The expected and recovered

**The expected and recovered Neurospora topologies**. Figure 8A is the

Parsimony Analysis

With parsimony, again the input tree was recovered only when the most basal taxon,

**Additional Figure S1**. Example topologies of Neurospora Parsimony simulation results in pdf format.

Click here for file

Probe Length

Increasing the length of the probe sequence in the microarray from 70 nt to 500 nt might be expected to improve the recovery of input topologies by providing more polymorphisms for phylogenetic analysis. To test this hypothesis, 25 replicate runs at 10% average sequence divergence were completed using 500 bases as the length of the probe for the Neurospora topology. There was no significant improvement in recovery of the input topology regardless of which taxon was designated as the reference taxon with the longer alignments (Additional File

Cophenetic Correlations

To evaluate differences in the branch lengths, irrespective of topology, of the input and output trees obtained by NJ using Pearson's correlation or Euclidean distances, we determined the cophenetic correlation coefficient (CCC) for the

The cophenetic correlation of the correlation-based NJ for the

**The cophenetic correlation of the correlation-based NJ for the Neurospora topology**. The distribution of the cophenetic correlations for Tree 3 using the correlation-based NJ trees. Figures 9A, 9B, and 9C give the distribution of correlations using

The cophenetic correlation of the Euclidean-based NJ for the

**The cophenetic correlation of the Euclidean-based NJ for the Neurospora topology**. The distribution of the cophenetic correlations for Tree 3 for 100 replicate Euclidean distance based NJ Trees. Histograms 10A, 10B, and 10C are

**Additional Figure S2 Correlation-Based CCC distributions**. Balanced topology Cophenetic Correlations Coefficient Distributions for the correlation-based Neighbor-Joining Analysis in pdf format.

Click here for file

**Additional Figure S3 Euclidean-Based CCC distributions**. Balanced topology Cophenetic Correlations Coefficient Distributions for the Euclidean-based Neighbor-Joining Analysis in pdf format.

Click here for file

**Additional Figure S4 Correlation-Based CCC distributions**. Pectinate topology Cophenetic Correlations Coefficient Distributions for the correlation-based Neighbor-Joining Analysis in pdf format.

Click here for file

**Additional Figure S5 Euclidean-Based CCC distributions**. Pectinate topology Cophenetic Correlations Coefficient Distributions for the Euclidean-based Neighbor-Joining Analysis in pdf format.

Click here for file

Discussion

Array Comparative Genomic Hybridization (array CGH) is a technique in which the genomic DNAs of a reference individual and a close relative are hybridized to a microarray made of probes designed to match the genome of the reference individual. To assess the appropriateness of using microarray data for phylogenetic analysis, we asked if having a single reference species, to which all other taxa must be compared, limits the ability to accurately recover evolutionary relationships. In our simulations, to make phylogenies from CGH data, we used both distance and parsimony phylogenetic methods. We assessed the effects of tree topology, the position of the reference taxon in the topology, the rate of nucleotide substitution, and the length of microarray probes. We assessed the difference between the input and output phylogenies using two tree-to-tree distance metrics and also assessed the difference in rates of evolution for two topologies as measured from branch length differences.

Our results show that under specific conditions, using CGH to quantify inter-genomic sequence variation can yield data that support the input topology. However, it is just as common to get a tree with little resemblance to the true tree. The user would be unaware of this, as studies often have no prior knowledge of two key parameters, the topology and the position of the reference taxon. Unfortunately, these are the goals that phylogenetic inference is designed to determine. CGH, however, may have a role to play in assessing sequence divergence, by identifying genes that are evolving rapidly, but that are still present across taxa, that would be good candidates for multi-locus sequence analysis.

Summary of Simulation Results

Tree topology and the position of the reference taxon had the greatest effects on successful recovery of the input tree by CGH. The balanced topology was recovered more successfully than the pectinate or

In general, the NJ distance method was more successful at recovering the input topology than was parsimony, and distance data matrices made using Pearson's correlation coefficient performed better than those made by Euclidean distance. For the correlation, in the tree-building process it was found to be more expedient to exclude the reference as the correlation method was not able to cope with the reference taxon's lack of variability in the evolved data. However, branch lengths were better recovered with NJ using a Euclidean distance matrix than with the Pearson's coefficient. While the Pearson's correlation is more robust to missing data and is less sensitive to small variations than Euclidean distance, the Euclidean potentially preserves more information about the genetic distance

For parsimony analysis all "species" hybridization values were binned using a single cutoff, a simple approach we found adequate for distribution of sequence polymorphisms simulated here. Modifying the threshold for assigning shared character states (1,1) from the 75^{th }percentile to the 80^{th }percentile or higher occasionally improved the results, but only for some locations of the reference taxon and topologies.

Varying the rate of nucleotide substitution had little effect on recovery of the input topology by CGH, although at high substitution rates parsimony analysis performed better when the threshold was high for scoring a hybridization spot as absent. Increasing the length of the probe from 70 bases to 500 bases had no substantial effect on recovery of the input topology.

Of particular importance to a correct CGH phylogeny is the choice of taxa included in the analysis. Taxa with reasonable distance to each other, avoiding mixtures of close and more distantly related taxa, are more easily captured. It is also best to have a reference taxon in the most basal position of the tree, or second-best the most derived position.

Comparison to other studies using CGH to infer phylogenetic relationships

Initial studies of CGH for phylogeny include research with

Other researchers utilizing

It is possible that a CGH phylogeny from a multi-species array would better reflect evolutionary distances among taxa. While answering this question is outside the scope of this study at least one other, Wan et al

In a more recent study aCGH data from bovine and human data as well as simulated data were examined by the developers of a wavelet based de-noising algorhythm aimed at quantifying structural variation at the genome level

Conclusions

Our results show that CGH cannot be counted on to reveal genomic sequence divergence reliably enough to recover a known phylogeny. Consequently, relying on this technique to recover an unknown phylogeny is problematic, particularly when there is no traditional phylogeny to compare to or when an existing phylogeny is based only a few genes or obsolete techniques like MLEE.

The results from our

Given the drawbacks of CGH for phylogeny, resulting in high uncertainty in the correct topology, it would be perhaps be more prudent to utilize the cross-taxa CGH data to identify a set of genes suitable for MLST, those still present but with a high to moderate amount of variation, then pursue a traditional phylogenetic analysis.

Authors' contributions

LBG participated in the simulation design and coordination and drafted the manuscript. LC, TK, and JWT participated in the design of the study. LC and LBG scripted code for the completion of the simulation. LBG completed the statistical analysis. All authors read and approved the final manuscript.

Acknowledgements

Special thanks to Michael B. Eisen for conceiving the study. Thanks to Charles Yong for his work scripting essential parts of the simulation pipeline. Thanks to Sandrine Dudoit and Jeffrey P. Townsend for statistical advice. Thanks to Tom Sharpton and Jason Staich for scripting advice. Special thanks to Elizabeth Turner and Tracy K. Powell for editorial assistance. Funding for LBG provided by the Ford Pre-Doctoral and Dissertation Year Fellowships. Funding for TK was provided by NIH Program Project Grant GM068087 to N. Louise Glass, PMB, UC Berkeley.