Department of Statistics, University of Kentucky, 725 Rose Street, Lexington, KY 40536-0082, USA

Department of Biology, University of Kentucky, 101 TH Morgan Building, Lexington, KY 40506, USA

Robotics Institute, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213, USA

Abstract

Background

The increased use of multi-locus data sets for phylogenetic reconstruction has increased the need to determine whether a set of gene trees significantly deviate from the phylogenetic patterns of other genes. Such unusual gene trees may have been influenced by other evolutionary processes such as selection, gene duplication, or horizontal gene transfer.

Results

Motivated by this problem we propose a nonparametric goodness-of-fit test for two empirical distributions of gene trees, and we developed the software

Conclusions

The non-parametric nature of our statistical test provides fast and efficient analyses, and makes it an applicable test for any scenario where evolutionary or other factors can lead to trees with different multi-dimensional distributions. The software

Background

Systematists often wish to compare gene trees, or sets of trees, to each other in a statistical framework and ask whether or not they are significantly different. These efforts have been more traditionally applied to the evaluation of competing phylogenetic hypotheses

Overall, this is not meant to be an exhaustive list of situations where trees, or sets of trees, need to be compared with each other, but it highlights a general need in phylogenetics for tools to assess congruence, particularly from a statistical perspective. A non-parametric test is a preferable tool to use for these purposes in light of the growing availability of phylogenomic data sets because of the simplicity in its implementation and efficiency in providing results.

Projecting and visualizing trees in a multi-dimensional framework provides a useful mechanism for comparing large numbers of phylogenetic trees

Here we propose a non-parametric test combined with a permutation test and the use of support vector machines (SVMs) as a quantitative tool of a statistical test to determine if sets of vectorized gene trees have significantly different multi-dimensional distributions. SVMs can be applied to any two collections of trees which may or may not have been sampled from the same underlying distribution (e.g., reconstructed gene trees for host and parasite species), or two posterior sets of trees independently generated from Bayesian analysis of a single dataset. From a practical perspective, a major reason for the popularity of SVMs in machine learning is their efficiency and accuracy at classifying data in a high dimensional vector space (see

In our approach, trees can be incorporated into a statistical framework by converting them into a numerical vector format based on a distance matrix or map, see Figure

Schematic of how trees are converted to vectors

**Schematic of how trees are converted to vectors.** Numbers on branches in the unrooted tree are branch lengths. In this example, the tree is first converted to either a branch length-based dissimilarity map (matrix of distances between tips) or topological dissimilarity maps (matrix of number of edges between tips). Moving from left to right across rows in one half of a matrix, values are placed into a single column to yield a vector of distances between tips in the tree.

SVMs are supervised learning algorithms that can be used to compute the separation between two sets of points, or point-clouds, in a multi-dimensional space
_{ + } and _{−} in high dimensional space, an SVM finds a hyperplane _{ + } and _{−} (see Figure
^{ + } and ^{−} . The separation percentage _{ + }in ^{ + } , plus half the percentage of points of _{−}in ^{−} . For data sets _{ + } and _{−} which are not entirely separable, the separation percentage produced by the SVM hyperplane is a quantitative and intuitive measure of separation. Overall, the classification of data with SVMs is a two-step process. In the first step (i.e. training), the SVM algorithm uses a set of pre-classified examples each belonging to one of two categories to learn a hyperplane that maximizes an objective that balances between separating the two categories while avoiding overfitting. In the second step (i.e. testing), new examples are mapped into the same space and predicted to belong to a category based on which side of the established hyperplane they fall.

A two dimensional example of a support vector machine (SVM) training to find the best hyperplane (dashed line) separating two pre-defined groups of points (labeled ’x’ and ’o’)

**A two dimensional example of a support vector machine (SVM) training to find the best hyperplane (dashed line) separating two pre-defined groups of points (labeled ’x’ and ’o’).** In this example, the hyperplane correctly classifies 9 of the 12 o’s and 16 of the 20 x’s, and thus the data has a separation percentage of (9 + 16)/32 = 0.78125 , or 78.125

To implement SVMs in the statistical testing of tree distributions, we developed a permutation test, augmented by bootstrapping for application to DNA sequence alignments, that assesses the significance of SVM separation percentages between two predefined sets of vectorized trees in multidimensional space. We emphasize that the SVM separation alone is not an indication that the two sets of trees are incongruent. That is, the SVM separation percentage is only relevant when compared to all possible SVM separation percentages when permuting the data. For example, suppose 100 gene trees were sampled under the coalescent. Most likely the trees will not be identical but the SVM separation percentages will be indistinguishable for all possible test with 1 tree in one set and the other 99 trees in the other test, implying that no single tree will appear as an outlier. Also, we note that the SVM separation percentages may be above 50% and this does not present a problem as all other SVM separation percentages when permuting the data will be similar.

To demonstrate the utility of our statistical test in discriminating between different sets of trees, we apply it in a simulation study that compares gene trees sampled from two different eight-taxa species trees. By varying the total depth of the species trees, this framework serves as a general proxy for generating sets of trees with varying levels of overlap in multidimensional space. In addition to exploring the sensitivity of our statistical test in detecting differences among gene tree distributions, we also explore its performance using different mapping techniques (dissimilarity maps vs. topological dissimilarity maps) and tree reconstruction methods (Bayesian, Maximum Likelihood, and Neighbor Joining). Finally, we assess the scalability of our statistical test to trees with larger numbers of taxa.

Methods

Representing trees as vectors

To apply SVMs, we represent gene trees as vectors as follows. Given a tree

Testing for incongruence between sets of reconstructed gene trees using SVM

We present a goodness-of-fit test, which takes two sets of sequence alignments as input and tests the null hypothesis that the underlying distributions of phylogenetic trees are the same. We require some terminology in order to state our formal hypothesis. Suppose gene trees have been mapped into m-dimensional real space (
^{ + } . Here the notation ^{ + }) denotes the total probability (under ^{ + } , and similarly for ^{ + }) . That is, any half-space ^{ + } will contain a subset (or all) of all possible vectorized trees in ^{
m
} . Then ^{ + }) is the total probability of the trees contained in the half-space ^{ + }, i.e.
^{ + }) .

Our statistical hypotheses is

In a model where trees are generated according to a distribution

Our statistical test includes a novel non-parametric statistical procedure that estimates a p-value for the statistical hypotheses described above, from input DNA sequences. At the core of our statistical test is the sub-process of using an SVM to compute a separation percentage between vectorized gene trees inferred from two sets of DNA sequences. This sub-process is outlined in Figure
_{1},…,_{
m
}
_{1}} and _{1},…,_{
m
}
_{2}} as input, shown in the left of Figure
_{
A
} and _{
B
} respectively, are inferred. These are labeled “training set” in Figure
_{
A
} and _{
B
} (training set) are vectorized and an SVM is used to compute a separating hyperplane, as depicted in the center oval of Figure

A flowchart describing how our statistical test calculates the separation percentage of the evolutionary histories of two sets of DNA sequence alignments, A and B

**A flowchart describing how our statistical test calculates the separation percentage of the evolutionary histories of two sets of DNA sequence alignments, A and B.** First, gene trees, labeled “training set”, are inferred from alignments A and B. The training set gene trees are vectorized and an SVM is trained to find a hyperplane separating the vectorized gene trees. Next, a new set of gene trees, labeled “testing set”, are inferred from alignments A and B. The testing set gene trees are vectorized and the hyperplane previously computed by the SVM is used to calculate a separation percentage.

In order to estimate the null distribution ^{∗} and ^{∗} generated by a permutation procedure as follows. First alignment labels are permuted to create hypothetical sets of alignments ^{∗},^{∗} . Then each alignment in ^{∗}is replaced by a bootstrap replicate with the same number of columns as the corresponding alignment in ^{∗}). See the appendix for pseudo-code of the ^{∗},^{∗} is identical to ^{∗}and ^{∗}follows the same marginal empirical distribution derived from

In the

Our use of the SVM separation percentage is motivated by the observation that systematic differences between sets of trees may manifest as a separating direction in feature space (e.g. if tree space is defined by using splits as features, then a separating direction indicates which splits tend to occur in one set of trees and not the other). The SVM tries to find a maximal separating direction by deep analysis of the data, without making Gaussian assumptions like Fisher’s linear discriminant. Furthermore, for two sets of points with high variance and a small but reliable separation (e.g. two parallel discs with only a small separation between), the separation statistic gives a more representative indicator of how likely the two point sets come from different distributions, versus distance-only statistics such as comparing within group to between group variance

Results and discussion

To obtain simulated trees with different distributions, we used coalescent-modeled gene trees simulated within different species tree histories. We first simulated pairs of species trees (_{1} , _{2} ) with _{
e
}) of 100,000 haploid individuals, and various tree depths ranging from 0.1_{
e
} to 10_{
e
} . We then simulated sets of 10,000 gene trees (denoted _{1} and _{2}) under the respective species tree histories using a neutral coalescent model. In addition, for the purpose of assessing false positive rates (see below), we generated an additional set of 10,000 gene trees (_{3} ) within _{2} using the same process and model parameters used for _{2}. These simulation conditions were chosen to represent a broad range of coalescent gene trees within each species tree. For example, at low species tree depth we expect considerable variation among gene trees within a species tree, causing overlap in multidimensional space among gene trees from different species trees. All species tree and gene tree simulations were performed in

To independently assess the variation between sets of gene trees simulated under different species tree at different species tree depths, we used principal component analysis (PCA) and Fisher’s linear discriminant (FLD)
_{1} and _{2} onto a line which maximizes the distance between the means of _{1} and _{2} while minimizing the variance within _{1} and _{2} . Larger values of FLD indicate greater separation between different sets of gene trees. Because these data are in high dimensions we used PCA to reduce the dimensionality of the data. To visualize separation between _{1} and _{2} , we graphed the first two principal components for each gene tree at each species tree depth. Both FLD and PCA were applied to gene trees vectorized using the dissimilarity map.

To simulate DNA sequence data, we used the simulated gene trees described above. For each gene tree we simulated sequences of 1,000 nucleotides under a Hasegawa-Kishio-Yano (HKY)+_{
A
}
_{
C
}
_{
G
}
_{
T
}) = (0.3,0.2,0.2,0.3) : and maintained an ^{−8} was used. These parameters were similar to those used in other recent studies of gene tree evolution within species trees

**Species tree depth**

**Average pairwise**

**Average minimum**

**(in N
_{
e
} generations)**

**sequence divergence**

**sequence divergence**

Divergences were calculated using all 3000 simulated data sets for a species tree depth (1000 from the first replicate species tree and 2000 from the second replicate species tree). Standard deviations are given in parenthesis. All species trees were simulated using an _{
e
} of 100,000.

0.1

0.9371(0.3631)

0.08(0.0356)

0.2

1.0410(0.3589)

0.1(0.0570)

0.3

1.0910(0.3832)

0.1(0.06378)

0.4

1.2010(0.3763)

0.1(0.0790)

0.6

1.0510(0.3645)

0.14(0.0948)

0.8

1.2590(0.3757)

0.18(0.1066)

1.0

1.3630(0.3860)

0.24(0.1219)

2.0

1.9040(0.4014)

0.42(0.2092)

4.0

2.6340(0.5113)

0.62(0.2092)

6.0

3.437(0.5556)

0.82(0.4014)

8.0

4.409(0.5312)

0.54(0.3151)

8.5

3.787(0.6200)

0.7(0.3406)

9.0

4.281(0.7800)

0.62(0.2801)

9.5

4.311(0.5041)

0.52(0.3124)

10.0

4.426(0.5165)

0.8(0.4001)

For gene tree reconstruction we used NJ under the Felsenstein 84 (F84) model

**MrBayes**
**parameters.** All Bayesian analyses were run using **Figure S1**. Fifteen data sets, with 100 gene trees (blue diamonds) generated under a coalescent model under a species tree S1, and 100 gene trees (red circles) generated via coalescence under a different species tree S2. All fifteen data sets had a fixed effective population size of 1 Ne individuals. The first two PCA components were used to plot gene trees in two-dimensional space. PCA projections were computed using R
**Figure S2**. Fishers linear discriminant for 20,000 gene trees generated under either the same species tree (blue) or two different species trees (red). Gene trees were vectorized using the dissimilarity map. The dashed line at FDL = 1 indicates where the variance between gene trees is equal to the variance within gene trees. Values of FLD that are greater than 1 suggests clear separation between sets of gene trees. **Figure S3**. Graphs depicting the performance of the SVM-based test in detecting differences between gene trees reconstructed from simulated data using NJ, BI, and ML. Trees were reconstructed using PHYLIP, MrBayes and PhyML. One gene tree from species 1 vs. 10 gene trees from species 2. In all graphs, both topological dissimilarity maps (red crosses) and standard dissimilarity maps (blue circles) of trees are considered. Top panels: ROC curves on the simulated data where gene trees are taken from different species trees. See the section Simulation Study of GeneOut for a description of the ROC curve. Bottom: false positive rates were plotted where gene trees are taken from the same species trees. The X-axis is the ?-level and the Y-axis gives the corresponding false positive rate.

Click here for file

Simulation study using simulated gene trees

In reality, we estimate phylogenetic trees from observed data so that these trees are subject to uncertainty at some level. Thus, in order to determine our statistical tests’ inherent ability to detect separation of the underlying distribution of trees, we first performed a series of experiments where we assume all trees are the true trees. To asses the true positive and false negative rates of our statistical test we conducted our statistical hypothesis test with two samples of gene trees generated under the distributions of different species trees. Similarly, to asses the true negative and false positive rates we conducted our statistical hypothesis test with two samples of gene trees generated under the distributions of the same species tree.

For the first type of tests (assessing true positive and false negative rates) we ran our statistical test using, as input, 10,000 gene trees _{1} and 10,000 gene trees _{2} . We calculated a separation percentage by training and testing an SVM with 168 and 336 (respectively) gene trees sampled from _{1} and _{2} . That is, we sampled 168 gene trees from _{1} , and 168 gene trees from _{2} , and trained an SVM. Next, we sampled 336 gene trees from _{1}, and 336 gene trees from _{2}, and we used the previously trained SVM to compute the separation percentage. We calculated the separation percentage 100 times and took its average. We approximated the null distribution by repeating the following 100 times: we trained and tested an SVM with 168 and 336 gene trees sampled just from _{2} . We estimated a p-value using the separation percentage and the null distribution approximation. We performed this statistical test for all fifteen species tree depths and using either the dissimilarity or topological dissimilarity map vectors.

For the second type of tests (assessing true negative and false positive rates) we ran our statistical test using, as input, 10,000 gene trees _{2} and 10,000 gene trees _{3} . We calculated a separation percentage by training and testing an SVM with 168 and 336 (respectively) gene trees sampled from _{2} and _{3} . We calculated the separation percentage 100 times and we took its average. We approximated the null distribution by repeating the following 100 times: we trained and tested an SVM with 168 and 336 gene trees sampled just from _{3} . We estimated a p-value using the separation percentage and the null distribution approximation. We performed this test for all fifteen species tree depths and using either the dissimilarity or topological dissimilarity map vectors.

Simulation study using simulated DNA sequences

We explored a range of options when testing our statistical test in order to assess the effects of balanced vs. unbalanced sets, species tree depth, tree reconstruction method, and tree vectorization method. To test our statistical tests’ ability to detect separation when the underlying tree distributions were not the same, we performed statistical tests with alignments generated from gene trees within different species trees. To assess false positive error, we also performed tests where the alignments were generated from gene trees within the same species tree. We fixed four conditions for all tests: We computed the separation percentage 100 times and we took its average, we repeated the permutation sub-process 100 times in order to estimate the null distribution, and we used the SVM training and testing phases with samples sizes of 168 and 336, respectively. Our statistical test takes, as input, two sets of DNA sequence alignments _{1},_{2},_{3} defined above. The experiments we performed fall into three categories determined by the number of alignments in

**1 vs. 10**: We selected the first ten alignments generated from y and the first ten alignments generated from _{2} . We denoted the two sets of ten alignments

We selected the first eleven alignments generated from _{2}. We called the set of eleven alignments

**1 vs. 50**: We selected the first 50 alignments generated from _{1} and the first 50 alignments generated from _{2}. We denoted the two sets of fifty alignments

We selected the first 51 alignments generated from _{2} and called the set of alignments

**10 vs. 10**: We selected the first 100 alignments generated from _{1} and the first 100 alignments generated from _{2} . We denoted the two sets of 100 alignments _{1},…,_{10} and _{1},…,_{10} where _{
i
} and _{
i
} are the _{
i
},_{
i
}) of two sets of ten alignments from _{
i
} and _{
i
} , resulting in 10 tests. We performed these ten tests using the NJ tree reconstruction method and performed them for all fifteen species tree depths, using both the dissimilarity and the topological dissimilarity maps. Similarly, we repeated the above experiments with the exception that we selected the first 100 alignments generated from _{2} and the first 100 alignments generated from _{3}.

ROC Curves and False positive plots

To assess the overall accuracy of our statistical test, we used receiver operating characteristic (ROC)

We also calculated the area under the curve (AUC) for each ROC curve to provide a summary statistic of classification accuracy. In general terms, the AUC is the probability that a binary classifier will rank a randomly chosen positive example higher than a randomly chosen negative example; therefore the AUC is equivalent to a Wilcoxon signed-rank test. In our simulation study, the classifier was the

To assess how our statistical test controls false positive rates, we created graphical representations of the false positive rates vs.

We computed all empirical plots for false positive rates vs.

As described below, NJ reconstruction exhibited competitive performance with ML and BI reconstruction methods in empirical ROC curves and AUCs, and also controlled false positive rates at the desired

Data sets with large numbers of taxa

To evaluate the scalability of our methods for larger numbers of taxa, we tested three larger simulated data sets, with 30, 50, and 75 taxa. We ran _{
e
}) of 100,000 and a tree depth of 100_{
e
} . Within each species tree, we simulated 10 gene trees along with simulated DNA sequence data (again using a process similar to the 8 taxa data), using scaling factors of 3 × ^{−9}, 3 × ^{−10} , 3 × ^{−10} for the 30, 50, and 75 taxa data sets, respectively. Because this particular exercise was performed primarily to evaluate the computational time required to scale to larger numbers of taxa, species tree depths were chosen to create “tight” distributions of gene trees with low discordance. For tree reconstruction we used NJ and we vectorized gene trees using the dissimilarity map. We used training and testing set sizes of 100 and 200 and also 200 and 400.

Simulation results

Trees in space

The first two principal components of the PCA indicated that, at all species tree depths there was substantial variation in the spread of vectorized gene trees generated under each species tree, and that the amount of overlap between sets of vectorized gene trees, simulated under different species trees, decreased as species tree depth increases (Additional file
_{
e
} and lower we observed that between-species tree FLD was less than 0.3106 , indicating very little separation of the gene trees. Thus, our statistical test applied to gene trees generated from species trees with species depths of 0.4_{
e
} and lower were omitted when constructing ROC curves and curves for false positive rates vs.

Simulation study using simulated gene trees

The application of _{
e
} ≥ 0.1 . However, when trees were vectorized using dissimilarity maps the null hypothesis was rejected for all trees with _{
e
} ≥ 0.6 . Furthermore, when gene trees were generated under the same species tree as input for

Our statistical test directly applied to sets of simulated gene trees

**Our statistical test directly applied to sets of simulated gene trees.** Our statistical test was applied to two sets of 10,000 gene trees, using the dissimilarity and topological dissimilarity maps, and across the fifteen species tree depths. In the first test shown in a line with “D” and in a line with “T”, the two sets of gene trees were generated under different species trees. In the second test shown in a line with “d” and a line with “t”, the two sets of gene trees were generated under the same species tree.

Simulation study using simulated DNA sequences

In an initial application of our statistical test, using an alignment sampling strategy of 1 vs. 10, all three tree reconstruction methods produced ROC curves that were well above the diagonal and empirical AUCs derived from these curves were all greater than 0.805 (Figure

Comparison of our statistical test performance for three choices of tree reconstruction methods: NJ (red/crosses), ML (blue/circles), and BI (red/X’s)

**Comparison of our statistical test performance for three choices of tree reconstruction methods: NJ (red/crosses), ML (blue/circles), and BI (red/X’s).** Trees were reconstructed using **A** and **B** show comparisons of ROC curves on simulated data. See the section **C** and **D** show comparison of curves on false positive rates (**A** and **C** are for dissimilarity map-based tree space; panels **B** and **D** are for topological dissimilarity map. In **C** and **D**, the

In the evaluation of the performance of our statistical test across different alignment sampling strategies (1 vs. 10; 1 vs. 50; 10 vs. 10), the ROC curves were well above the diagonal and produced larger empirical AUCs (

Computation Time

The running of

The running of

Conclusions

Easier access to the genome now provides the opportunity to collect genetic data, either intentionally or unintentionally, from loci that reflect different underlying evolutionary processes. Analysis of trees in multidimensional space has been used previously as a statistical test of trees in a multi-dimensional vector space; however, this has largely been performed as a test for congruence between two given trees

Our use of gene trees simulated across a range of species-tree depths provided us with an opportunity to evaluate the performance of our statistical test across a range of multidimensional tree distributions, from those that were virtually indistinguishable from each other (e.g. at species tree depths of 0.1 _{
e
}; Additional file
_{
e
} ; Additional file
_{
e
} . This result at this species tree depth was particularly surprising due to the exceptional amount of visually-perceived overlap between tree distributions in PCA ordination space (presumably as a function of substantial incomplete lineage sorting). This accuracy at low species tree depths may be be due to the fact of large sample sizes (10,000 vs. 10,000 ). Such large sample sizes are unlikely to be used in empirical tests where smaller numbers of genes are compared and where tree reconstruction will be employed. However, even when these conditions were factored in to the performance of our statistical test, the ROC and AUC results indicated that it is a robust method for detecting differences between tree distributions. Equally important in the discussion of our statistical tests’ performance is its controlling of false positive rates. In our testing sets of gene trees within the same species tree, our statistical test consistently did not reject the null hypothesis. This was evident in high p-values in the application of our statistical test directly to simulated gene trees (Figure

Graphs depicting the performance of the SVM-based test in detecting differences between gene trees reconstructed from simulated data using NJ

**Graphs depicting the performance of the SVM-based test in detecting differences between gene trees reconstructed from simulated data using NJ.** Trees were reconstructed using **A** and **D**), one gene tree from species 1 versus 50 gene trees from species 2 (**B** and **E**), and 10 gene trees from species 1 versus 10 gene trees from species 2 (**C** and **F**). In all graphs, we denote red lines with crosses topological dissimilarity maps and blue lines with circles standard dissimilarity maps of trees. **A**, **B**, and **C** show ROC curves on the simulated data from gene trees generated under different species trees. See the section **D**, **E**, and **F** show curves for false positive rates vs. **D**, **E**, and **F**, the

From our simulation study it seems that our statistical test has more power with topological dissimilarity maps than with dissimilarity maps. Ané discussed in

The generality of our statistical test and its implementation provides a number of benefits. First, the core of our statistical test is based on a non-parametric test, which provides a relatively fast method of analysis. Even when using model-based BI reconstruction methods the majority of our tests required only a couple hours of computation time. Expanded taxon sampling to as many as 75 taxa pushed computation times into the 1–3 day range, which we see as very acceptable computation time in the current field of model-based multi-locus phylogenetics. Second, our statistical tests’ use of reconstructed tree distributions through bootstrapping or sampling from a posterior distribution is expected to help mitigate the problem of tree reconstruction error. This is a likely contributor to the low probability of false positives seen in the ROC plots. Additional file

Systematists often aim to statistically evaluate competing phylogenetic hypotheses with a single gene or concatenated set of genes by comparing trees reconstructed with and without a topological constraint

While the non-parametric nature of our statistical test has the upside that it can be applied to tests of discordance between two sets of trees caused by a range of reasons, the flip-side is that it does not provide an ability to draw specific conclusions about the underlying cause for significant differences between tree distributions. Subsequent model-based analyses that can identify specific genetic processes (e.g. selection

Software

The software

Appendix

**Input:** Two sets of alignments, **Output:** p-value under the null hypothesis that the trees underlying

Set _{
A
} := _{
B
} :=

For each alignment in _{
A
} trees.

For each alignment in _{
B
} trees.

Let _{
A
} := set of trees generated from

Let _{
B
} := set of trees generated from

Train SVM on data (_{
A
},_{
B
}).

Set _{
A
} := _{
B
} :=

For each alignment in _{
A
} trees.

For each alignment in _{
B
} trees.

Let _{
A
} := set of trees generated from

Let _{
B
} := set of trees generated from

Let _{0} := Separation percentage between _{
A
} and _{
B
}.

Set count := 0.

**for**
**do**

Order the alignment sets arbitrarily,

(_{1},…,_{
ℓ
}),_{1},…,_{
m
}).

Randomly permute set membership labels of

alignments in ^{
′
},^{
′
}.

For each
_{
i
}| columns

of

For each
_{
i
}| columns

of

For each alignment in ^{
′
}, reconstruct _{
A
} trees

For each alignment in ^{
′
}, reconstruct _{
B
} trees.

Let
^{
′
}.

Let
^{
′
}.

Train SVM on data

For each alignment in ^{
′
}, reconstruct _{
A
}trees.

For each alignment in ^{
′
}, reconstruct _{
B
} trees.

Let
^{
′
}.

Let
^{
′
}.

Let

**if**
_{0} **then**

count := count + 1.

**end if**

**end for**

Output p-value := count / k.

Competing interests

The authors declare that they have no competing interests.

Authors contributions

DH developed methods and algorithms, wrote all software and testing scripts, generated simulation data, ran all simulations, and drafted and revised the manuscript. PH developed methods and algorithms, and drafted and revised the manuscript. EO designed simulation, and drafted and revised the manuscript. DW supervised and coordinated this project, designed simulation, analyzed the simulation results, and drafted and revised the manuscript. RY developed methods and algorithms, designed statistical analysis on the simulation results, supervised and coordinated this project, analyzed the simulation results, and drafted and revised the manuscript. All authors read and approved the final manuscript.

Authors’ information

Join first authors: David C. Haws and Peter Huggins. Joint last authors: David W. Weisrock and Ruriko Yoshida.

Acknowledgements

This work was supported by a grant from the National Institute of Health to D.H., P.H., and R.Y. (5R01GM086888), a National Science Foundation grant to D.W.W., E.M.O., and R.Y. (DEB-0949532), and through the Lane Fellowship in Computational Biology to P.H. We thank the University of Kentucky’s High Power Computing resources.