Laboratoire de Recherche en Sciences Végétales, UMR CNRS-Université Paul Sabatier 5546, Chemin de Borde Rouge - Auzeville 31326, Castanet Tolosan, France

USDA ARS, Plant Pathology Department, UC Davis, Davis, CA, 95616, USA

Department of Plant and Microbial Biology, UC Berkeley, Berkeley, CA 94720, USA

Abstract

Background

Array-based Comparative Genomic Hybridization (CGH) data have been used to infer phylogenetic relationships. However, the reliability of array CGH analysis to determine evolutionary relationships has not been well established. In most CGH work, all species and strains are compared to a single reference species, whose genome was used to design the array. In the accompanying work, we critically evaluated CGH-based phylogeny using simulated competitive hybridization data. This work showed that a limited number of conditions, principally the tree topology and placement of the reference taxon in the tree, had a strong effect on the ability to recover the correct tree topology. Here, we add to our simulation study by testing the use of CGH as a phylogenetic tool with experimental CGH data from competitive hybridizations between

Results

Array ratio data for

Conclusion

Our results indicate that CGH data can be problematic for phylogenetic analysis. Success fluctuates based on the methods utilized to construct the tree and the taxa included. Selective pruning of the taxa improves the results - an impractical approach for normal phylogenetic analysis. From the more successful methods we make suggestions on the normalization and post-normalization methods that work best in estimating genetic distance between taxa.

Background

Microarray-based Comparative Genomic Hybridization (Array CGH) for two-color array platforms uses DNA samples from a reference individual and a test individual, each labelled with a different fluorescent dye, and competitively hybridizes them to an array composed of immobilized DNA fragments based on genomic sequence of the reference individual

A complication that has not been addressed in these studies involves the use of one species to design the array, which requires that all competitive hybridizations have as one partner the same reference species. This situation has been termed "unbalanced gene content"

This approach has been applied to bacterial species with largely clonal reproduction, for example studies by Dagerhamn, Wan, and Guidot

In the accompanying study we examined this question using an

To challenge our

Based on published accounts, we used eight general methods of phylogenetic analysis of aCGH data. These approaches, which are detailed in the methods section, encompass different methods of data normalization and post-normalization, two basic tree construction methods (Neighbor-Joining and Parsimony), and three more sophisticated methods of processing CGH ratio data (BAGEL, MPP, and GACK - see Additional File

**Additional table S1 - **Method matrix listing data treatments that are represented in each figure.

Click here for file

We show that CGH cannot be counted on to recover the established multilocus phylogeny for

Methods

DNA Preparation and Hybridization

Fungal strains used in this study are listed in Table

Strains used in this study

FGSC 2489

Louisiana

A

conidiating taxa

FGSC 8781

D21

Florida

A

conidiating taxa

FGSC 8858

D98

Tamil Nadu, India

A

conidiating taxa

FGSC 8813

D53

Thailand

A

conidiating taxa

FGSC 8775

D15

Hawaii

a

conidiating taxa

FGSC 8906

D146

New Mexico

a

conidiating taxa

FGSC 1889

homothallic

homothallic non conidiating

S48977

homothallic

K strain, courtesy of the Kuck lab, Ruhr-Universität Bochum

S strain

+

courtesy of Phiippe Silar of the Institut de Génétique et Microbiologie Université de Paris-Sud

Genomic DNA was isolated with the DNeasy Plant Tissue kit (Qiagen, Valencia, CA) with the following modifications: tissue was lyophilized, ground, and then incubated at 65 C for an hour with 50 mM Tris-HCl, 50 mM EDTA, 3% SDS solution with 100 μl of Proteinaise K (20 mg/ml). This was followed by chloroform:isoamyl alcohol extraction. The aqueous phase was added to the Qiagen extraction buffer and extraction proceeded according to the manufacturer's instructions.

Genomic DNA was sheared mechanically with a target range of 1 kb using a Hydroshear^{® }(GeneMachines™, San Carlos, CA). Test species and reference species were labelled using the BioPrime^{® }Plus Array CGH Indirect Genomic Labeling kit (Invitrogen, Carlsbad, CA). A full-genome 70 mer oligonucleotide microarray representing 10,918 individual elements was constructed as described for a partial

Images were scanned with an Axon GenePix 4000 B scanner (Molecular Devices Corporation, Sunnyvale, CA). GenePix Pro 6 software was used to quantify hybridization signals. Bad spots were flagged automatically by GenePix software and each slide was manually inspected. Each species comparison was done in at least quadruplicate with dye swaps.

Data Filtration and Normalization

Data were normalized for non-biological variation related to printing and hybridization of spotted microarrays by four methods. The first normalization, linear and based on the ratio of means (Acuity 4.0, Molecular Devices Corporation, Sunnyvale, CA), used a set of twelve control spots with no known sequence polymorphisms between

In addition to the four kinds of normalization, filtered according to the criteria discussed in the methods, an additional filtering criterion was applied for a duplicate subset of the linear and lowess normalizations, where a spot had to scored as present in at least 40% of the slides to be included. These additionally filtered data sets are referred to as the "40% present" set while the rest are referred to as the standard set. All were imported to R to calculate correlation and Euclidean distance matrices. PAUP Neighbor-Joining trees were constructed from these matrices.

Distance and Binary Matrix Calculation

Using the functions cor() and dist() from the R stats package

To convert continuous distances to binary characters (1,0), we used the program GACK, Genome Composition analysis by Charles Kim, to implement a "genomotyping" method. This program employs a dynamic cutoff based on the signal ratio distribution to classify genes as present or absent

Phylogenetic Analysis

Phylogenies were created from distance data by the Neighbor-Joining (NJ) algorithm from PAUP version 4.0b10

BAGEL

We used Bayesian analysis of gene expression levels (BAGEL) because it estimates hybridization levels for the reference species differently than all of the other methods that we employed, except MPP. Typically, hybridization values for the reference species are taken from hybridizations between the reference species and itself. However, with BAGEL, for each gene, the ratio data from all the hybridizations involving the reference and test species are used to estimate a relative hybridization level for each species, including an extrapolated value for the reference species

MPP

In addition to genomotyping with GACK, we utilized a second genomotyping program -the Microarray to Phylogeny Pipeline (MPP)

GPR files from the GenePix program were input into MPP and replicate spots were averaged according to species. The ratio data were filtered and then log_{-}transformed or transformed with the inverse hyperbolic sine (arsinh) function

Tree to tree distance metric quantification

To quantitatively assess the differences between CGH-derived and MLSA trees, we compared the MLSA topology for the nine Sordariomycete fungi

**Additional table S2 - Figure 2: Neighbor-Joining and GACK Parsimony Analysis when the dataset is only normalized**. Figure 2 in table form, with actual tree scores. Zeros are bolded. In excel format.

Click here for file

**Additional table S3 - Figure 3: NJ and Parsimony Analysis after Bayesian estimation of a relative hybridization level**. Figure 3 in table form, with actual tree scores. Zeros are bolded. In excel format.

Click here for file

**Additional table S4 - Figure 4: MPP-Based Tree Construction**. Figure 4 in table form, with actual tree scores. Zeros are bolded. In excel format.

Click here for file

Results

Almost all of the steps needed to process array CGH ratio data for phylogenetic analysis can influence the result. These include the filtering, normalization and tree-building procedures applied to the data. Using empirical data, we tested the effect of different analytical approaches on distance and parsimony analysis. Different methods of normalization combined with mean or median hybridization values for each species were used. For Neighbor-Joining analysis, different metrics for converting normalized intensity ratios to genetic distances were used, as well as different thresholds for converting ratios to discrete character data for parsimony analysis. One such method was a probabilistic method for converting ratios to discrete character data (GACK) and the other was a self-contained work-flow (MPP), which converts raw CGH data to data suitable for phylogenetic analysis. An alternative to average or median CGH ratio values, BAGEL (a Bayesian approach to estimate a representative hybridization value for each species), was also tested.

To consider the effect of evolutionary distance among taxa on the analyses, we grouped taxa into three datasets that varied in combined genetic distance. The CON set consisted of the six most closely related outbreeding individuals of

Desired topology as cladograms

**Desired topology as cladograms**. The figure shows six different trees, different permutations of the

Summaries of our analyses given below are supported by analyses in the additional material as follows: Analysis of filtered normalized data for NJ (Additional File

Neighbor-Joining Analysis Normalized data, (Figure

In no case did analysis of the ALL taxa set by Neighbor-Joining of CGH data that had been simply normalized produce the same tree as MLSA tree (Figure

Results for NJ and Parsimony analysis of normalized ratio data

**Results for NJ and Parsimony analysis of normalized ratio data**. These stacked histograms in this figure represent the SymD measures (symmetric distance away from the desired topology) for the Neighbor-Joining (Figures 2A and 2B) and parsimony (Figures 2C and 2D) CGH trees constructed from the ACUITY and Limma-based normalizations. Each stack represents the twelve iterations of the four different normalization procedures, detailed in Additional File

With the NEU taxa set, the MLSA tree was recovered perfectly by NJ using Euclidean distance, but only with linear normalization based on the ratio of means. There was no effect of including or excluding the reference taxon, using the mean or median value for each gene, or adding the additional 40% filter. Clearly, exclusion of the distant taxa,

With the CON taxa set, the MLSA tree was recovered less frequently. Here, again, the most robust result (0 steps away) was by NJ analysis using Euclidean distance of data linearly normalized by the linear ratio of means. However, results were better when the reference taxon was included and the additional 40% filter was omitted on mean values for each gene. Again, inclusion or exclusion of the more divergent taxa had the largest effect on recovery of the MLSA tree and trees with topologies close to the MLSA tree were found only with a narrow combination of methods.

Parsimony Analysis Normalized data, (Figure

For the ALL dataset, with or without the reference, no trees with topologies identical to the MLSA tree were produced for any normalization. With the reference taxon included, the averaged loess and median of the spline normalization give trees two to four steps distant for most values of the %EPP cutoff. The lowess trees were 1 to 10 steps longer than the MLSA tree, depending on the values of the %EPP. Trees that included the reference taxon were substantially worse (see Additional File

With the NEU taxon set, several of the thresholds based on %EPP resulted in trees with topologies identical to the MLSA tree when the reference taxon was excluded. These trees identical to MLSA trees included those made using the averaged values of the loess and the median values of the spline normalizations, and many of the percent EEP values using averaged linear normalization. Adding the additional 40% filter had a negative effect such that only the 50% EPP threshold gave the MLSA tree topology. With the reference taxon included in the analysis, the MLSA tree topology was not recovered as judged by the SymD metric (no closer than six steps) or the D1 metric (as close as one step, see Additional File

For the CON dataset, excluding the reference, the averaged values of the linear and loess normalizations recovered the MLSA tree topology in four and five of the eleven percent EPP thresholds respectively. The median and average values of the robust spline normalization were also successful in capturing the MLSA tree. Again, the additional 40% filter resulted in poorer trees overall. The remaining iterations of the data were two to four steps away. Including the reference species gave trees that were no closer to the MLSA tree than four steps and then only for the spline and loess normalizations.

NJ after Bayesian estimation of a relative hybridization level (Figure

For the ALL taxa set, with and without the reference species, the MLSA tree topology was not recovered with either the Euclidean or correlation-based metric. The closest approximation of the MLSA tree was achieved by the robust spline and loess normalizations (two steps longer by both the Euclidean and correlation distance metrics). For the correlation metric, including the reference species had no effect on results. However, for the Euclidean metric, including the reference species in some cases increased the length as compared to the MLSA tree by from two to six steps.

For the NEU dataset, the Euclidean distance metric captured the MLSA tree for both the spline and loess normalizations while the correlation metric did so solely with the robust spline normalization. This result was found regardless of whether or not the reference taxon was included. For the CON dataset, the correlation metric outperformed the Euclidean metric by capturing the MLSA tree with almost all approaches (except the lowess normalization) regardless of whether the reference taxon was included or excluded. The Euclidean metric recovered the MSLA tree topology with fewer combinations of approaches, and performed worse when the reference taxon was included (Limma spline normalization moved from no to two steps distant).

One noteworthy result with BAGEL NJ trees is that, unlike the other methods tested, the results are far less sensitive to inclusion or exclusion of the reference taxon. This insensitivity is presumably due to BAGEL's extrapolation of the reference value, which appears to be a more robust way of including the reference taxon than including self-self controls for tree construction.

Bagel Parsimony (Figure

Results for NJ and Parsimony analysis of Relative Bayesian Estimated Hybridization Levels

**Results for NJ and Parsimony analysis of Relative Bayesian Estimated Hybridization Levels**. Figure 3 show the stacked histograms of the SymD measures for Neighbor-Joining tree construction of the ACUITY and Limma-based normalizations processed with the BAGEL program. Five trees are represented in each stack, constructed from each of the four normalizations done and the additional linear normalization based on the ratio of the medians (see Additional File

For parsimony analysis, the BAGEL estimates of hybridization levels described above were binned at the first, second, and the third quartile. For the ALL dataset, no method of analysis recovered the MLSA tree topology. The loess, spline, and lowess normalizations were four steps distant irrespective of inclusion of the reference taxon. The linear normalization was worse (six steps distant) and the worst result was obtained when the reference taxon was included (eight steps distant).

For the NEU set, no method of analysis recovered the MLSA tree topology, although the loess, spline, and loess normalizations again performed best (two steps distant when binning data at the first quartile). The CGH trees constructed from the linear normalization were at best four steps longer than the MLSA tree when the reference taxon was excluded, and six steps longer when it was included. Binning at the second or third quartile resulted in trees that were typically four steps longer.

For the CON taxon set the MLSA tree topology was recovered only when the reference taxon included, and then only when binning at the first quartile for the spline, loess or lowess normalizations, or at the third quartile for spline and loess normalizations. Other approaches gave trees two to four steps longer than the MLSA tree.

Unlike the BAGEL NJ analyses, which were insensitive to inclusion or exclusion of the reference taxon, the BAGEL parsimony analysis improved when the reference taxon was included in the CGH phylogeny. However, in the BAGEL parsimony analyses, the MLSA tree was recovered only for the CON taxa dataset, any only for a narrow set of approaches, as noted above.

Treebuilding with MPP, the Microarray to Phylogeny Pipeline (Figure

Results for MPP based NJ and Parsimony Analysis

**Results for MPP based NJ and Parsimony Analysis**. The stacked histograms show the SymD measures of trees constructed using the MPP method for the sixteen different iterations detailed in Additional File

As described in the methods, we used the MPP pipeline to construct both Neighbor-Joining and Parsimony trees for the three groups of taxa. The MPP method begins by using CGH data to score hybridization probes as present or absent. These data can be exported for parsimony analysis or they can be used to make pairwise distance matrices by a likelihood approach that is designed to compensate for the single reference design. These distance data are then used for phylogenetic analysis by Neighbor-Joining analysis. MPP allows the user to control various options: the CGH data can be transformed using either a log or an inverse hyperbolic sine function (arsinh), the presence or absence of a probe can be estimated by EPP or by BPP, and the binwidth for assigning probe presence or absence can be set at either 0.05 (norm) or determined experimentally (exp). Applying these options in all combinations gave us eight basic combinations of options for both parsimony and NJ phylogenetic analyses.

For the ALL dataset, MPP using NJ did not recover the MLSA tree for any of the eight options. A tree two steps longer than the MLSA tree was recovered using arsinh, BPP, with a binwidth set at norm and excluding the reference taxon. Use of EPP or inclusion of the reference taxon gave trees at least twice as distant.

For the NEU taxon set using MPP and NJ, with the reference taxon excluded, trees concordant with the MLSA tree were recovered for five of the eight options. These five included all of the log-transformed data and the arsinh-transformed data option with BPP and norm binwidth. When the reference was included, the same iterations gave CGH trees four to six steps longer than the MLSA tree.

For the CON dataset using MPP with NJ, the results were similar to the NEU set in that all log-transformed data options recovered the MLSA tree topology when the reference taxon was excluded and no options recovered the MLSA tree topology when the reference taxon was included.

Parsimony MPP tree construction (Figure

For parsimony analysis, the same eight combinations were used to convert GCH data to presence/absence data sets and trees were made using 50% Majority-Rule consensus.

For MPP with parsimony analysis of the ALL dataset with the reference taxon excluded, the best trees using any option were one step longer than the MLSA tree (data log-transformed with either EPP or BPP followed by exp binning, or data arsinh-transformed with BPP followed by norm binning). When the reference taxon was included, trees from the four BPP and EPP arsinh-transformed sets were eight steps longer and those from the log-transformed data were even more distant.

For MPP with parsimony analysis of the NEU dataset with the reference taxon excluded, the five iterations that recovered the MLSA tree topology for the NJ analysis did the same for parsimony analysis, i.e., all of the log-transformed data and the arsinh-transformed data option with BPP and norm binwidth. When the reference taxon was included in the CGH phylogeny, no option returned the MLSA topology.

For MPP with parsimony analysis of the CON dataset, when the reference was excluded from the CGH phylogeny, the results were identical to those the results of the NEU taxon set. When the reference was included, the results were nearly identical to those for the NEU data set, i.e., no option returned the MLSA tree topology.

Discussion

In order to assess if CGH data can be used to infer phylogeny, we applied a variety of approaches to CGH data from nine species of filamentous fungi in the Sordariales derived from a microarray constructed for one of the species, the reference taxon

The diversity of approaches that we used to apply CGH data to phylogenetics included the following. To filter the data and normalize them to estimate hybridization levels we used four methods, two from the Acuity package (linear ratio of means, print-tip lowess) and two from the Limma package in R (loess and robust spline). We filtered for pixel saturation and for consistency among replicates. To complement these four approaches, we also used a Bayesian approach (BAGEL) to estimate hybridization levels. Euclidean and correlation methods were used to determine genetic distances from the hybridization levels. The distance method, Neighbor-Joining, was used for phylogenetic analyses of the genetic distances. To allow the use of parsimony phylogenetic methods, genetic distances were converted to binary data using GACK. We also investigated the microarray-to-phylogenetics pipeline (MPP), which transforms the data with either of two methods (log or arsinh) and converts the hybridization levels to binary data by either of two methods (EPP or BPP) for use in parsimony analysis. The binary data are then converted to genetic distances using a likelihood method intended to compensate for the shortcomings of using a single reference taxon for CGH. To assess the utility of the many permutations of these methods, we compared phylogenetic trees made from CGH data to the MLSA tree for the sordariaceous fungi using both symmetric distance (SymD) and taxon pruning (D1).

We found no single method that consistently produced a CGH phylogeny equivalent to an MLSA phylogeny. Instead, all the methods had different degrees of success depending on the combination of treatments applied to the data. Two trends stood out: that the greater the genetic distance among taxa the lower success, and that distance phylogenetic analysis, Neighbor-Joining, performed better than parsimony analysis. However, even with distance methods and data sets with restricted genetic distance, success was low; the NJ trees the NEU and CON topologies were recovered 20.6% and 25% of the time, respectively. It should be noted that the greatest distance among taxa was only 10.5%, roughly at the acknowledged limit of utility for long oligomer arrays

There was considerable variation between the normalization methods. For the distance-based trees, the most successful recovery was with a basic linear normalization (25% overall) and the worst was lowess normalization (6.25%). For the parsimony trees, the linear normalization was the worst, with a 6.3% recovery rate and the best was the robust spline, with a recovery rate of 12%.

The Neighbor-Joining method was better than the parsimony method (15% vs. 6.93% recovery). Of the two distance metrics used to construct distance trees, the Euclidean method performed better than the correlation-based metric (21% v. 9.8%). This result was in contrast to our

When hybridization levels for the CON and NEU datasets were estimated using BAGEL, distance phylogenetic analysis recovered the MLSA tree more often, but this advantage was not seen with parsimony analysis.

Tree construction with Bayesian estimates of relative hybridization levels for each species was slightly more robust than a simple average or median of ratio values. For the distance trees of the CON and NEU datasets, more BAGEL-treated normalizations recovered the MLSA tree than normalization alone. The advantage of BAGEL-treated datasets was not seen with parsimony analysis.

MPP NJ and Parsimony

The MPP platform, an all-inclusive pipeline designed specifically for CGH phylogeny construction, performed similarly to the traditional filtering and normalization methods. MPP was sensitive to inclusion of the reference taxon, that is, no NJ tree equalled the MLSA tree when the reference taxon was included, but 37.5% of NJ trees and 41.7% of parsimony trees did equal the MLSA tree with the reference taxon was excluded. MPP was also sensitive to the genetic distance among taxa, no analysis of the ALL data set found the MLSA tree, while the MLSA tree was recovered from the CON and NEU datasets in between 25% and 32% of NJ or parsimony analyses. In MPP, the Log transformation, with a fixed binwidth, performed better than the arsinh transformation, indicating that the former is better suited to our empirical CGH data.

Our analysis of CGH for eukaryotic microbes may be compared to a similar study of yeast species

As a final point of discussion, our results with empirical CGH data can be compared to our previous analyses of simulated CGH data, which allowed for comparison of three different topologies. In both cases, distance analysis was superior to parsimony analysis, probably due to the loss of information when genetic distances are converted to discrete data. Similarly, using distance analysis, highly filtered data sets produced less well-resolved phylogenies than data sets that included more ratio data. Finally, it was more difficult to recover MLSA phylogenies using empirical CGH data than using simulated CGH data, likely due to the additional noise in empirical data.

Conclusions

Our results with empirical CGH data and those of the accompanying

Our analysis does suggest that some normalization and post-processing methods may best reflect the underlying genetic distance between taxa and these methods might be best for other analyses of CGH data. Of the normalization methods implemented, the linear and robust spline methods worked better than the lowess/loess methods. The BAGEL estimation of hybridization levels also performed well. Unlike most other methods, it allowed for inclusion of the reference without a penalty. If a quick approximation of a topology is sufficient for the user's needs, the MPP pipeline offers a simple and easy way to construct a tree from CGH data. However, even the MPP approach recovered the MLSA topology less than half the time. If phylogeny is the aim, it would be better to invest in a modest MLSA approach.

Authors' contributions

LBG participated in its design and coordination and drafted the manuscript. LG, TK, and JWT also participated in the design of the study. LBG modified existing code to automate programs used in this work. LBG completed the statistical analysis. LBG and JWT wrote the manuscript. All authors read and approved the final manuscript.

Acknowledgements

Thanks to Michael B. Eisen and Sandrine Dudoit for statistical advice. Thanks to Tom Sharpton and Jason Staich for scripting advice. Thanks to H. Matthew Fourcade for technical assistance. Special thanks to Tracy K. Powell for editorial assistance. Funding for LBG provided by the Ford Pre-Doctoral and Dissertation Year Fellowships. Funding for TK was provided by NIH Program Project Grant GM068087 to N. Louise Glass, PMB, UC Berkeley. The research was supported by NSF DEB -0516511 and NIH RO1 AI070891 to JWT.