Département de Sciences biologiques, Université de Montréal, C.P. 6128, Succ. Centreville, Montréal, Québec, H3C 3J7, Canada
Abstract
Background
CADM is a statistical test used to estimate the level of Congruence Among Distance Matrices. It has been shown in previous studies to have a correct rate of type I error and good power when applied to dissimilarity matrices and to ultrametric distance matrices. Contrary to most other tests of incongruence used in phylogenetic analysis, the null hypothesis of the CADM test assumes complete incongruence of the phylogenetic trees instead of congruence. In this study, we performed computer simulations to assess the type I error rate and power of the test. It was applied to additive distance matrices representing phylogenies and to genetic distance matrices obtained from nucleotide sequences of different lengths that were simulated on randomly generated trees of varying sizes, and under different evolutionary conditions.
Results
Our results showed that the test has an accurate type I error rate and good power. As expected, power increased with the number of objects (i.e., taxa), the number of partially or completely congruent matrices and the level of congruence among distance matrices.
Conclusions
Based on our results, we suggest that CADM is an excellent candidate to test for congruence and, when present, to estimate its level in phylogenomic studies where numerous genes are analysed simultaneously.
Background
In phylogenetic studies, data matrices are assembled and analysed to infer evolutionary relationships among species or higher taxa. Depending on the study, characterstate data or distance matrices may be used, and several different types of data may be available to estimate the phylogeny of a particular group
Different approaches have been proposed to analyse the growing amount of information that may originate from different sources. The total evidence approach
The approach used often depends on the level of congruence or incongruence in the data. In phylogenetic analysis, "incongruence" can be defined as differences in phylogenetic trees. It is observed when different partitions, or data sets, sampled on the same taxa suggest different evolutionary histories
Numerous factors have been described to explain differences in phylogenetic trees obtained from the analysis of data sets containing the same species. A wide range of evolutionary processes may cause nucleotides at different sites to evolve differently, for examples due to their codon positions or to different functional constraints
Alternatively, the term "congruence" is often used to describe data sets, characters or trees that correspond to identical (or compatible) relationships among taxa
As described above, the term "congruence" and "incongruence" can have a more or less strict meaning with regards to the level of similarity. The definitions used in this paper are in concordance with the test of congruence among distance matrices (CADM). CADM was introduced by
More specifically, given two or more data sets (e.g., different genes) studied on the same species, a concordance statistic [Kendall's
Previously published simulations have shown that the global and
Results
Type I error rate
Type I error rate was evaluated by calculating the proportion of replicated simulations that rejected the null hypothesis when H_{0 }was true by construct. To construct data sets under a true H_{0 }of complete incongruence among matrices, IM were compared using CADM. Table
Type I error rates for CADM simulations with nucleotide sequences matrices simulated on independentlygenerated additive trees under a GTR + Γ + I model of evolution.
Number of IM
n
L
2
3
4
5
10
10
1000
0.052
0.047
0.042
0.049
0.046
(0.038, 0.066)
(0.034, 0.060)
(0.030, 0.054)
(0.036, 0.062)
(0.033, 0.059)
5000
0.050
0.050
0.038
0.046
0.046
(0.036, 0.064)
(0.036, 0.064)
(0.026, 0.050)
(0.033, 0.059)
(0.033, 0.059)
10 000
0.049
0.048
0.046
0.046
0.047
(0.036, 0.062)
(0.035, 0.061)
(0.033, 0.059)
(0.033, 0.059)
(0.034, 0.060)
20 000
0.047
0.047
0.039
0.045
0.043
(0.034, 0.060)
(0.034, 0.060)
(0.027, 0.051)
(0.032, 0.058)
(0.030, 0.056)
25
1000
0.054
0.056
0.054
0.056
0.04
(0.040, 0.068)
(0.042, 0.070)
(0.040, 0.068)
(0.042, 0.070)
(0.028, 0.052)
5000
0.053
0.048
0.046
0.05
0.042
(0.039, 0.070)
(0.035, 0.061)
(0.033, 0.059)
(0.036, 0.064)
(0.030, 0.054)
10 000
0.046
0.054
0.05
0.049
0.050
(0.033, 0.059)
(0.040, 0.068)
(0.036, 0.064)
(0.036, 0.062)
(0.036, 0.064)
20 000
0.043
0.050
0.054
0.047
0.040
(0.030, 0.056)
(0.036, 0.064)
(0.040, 0.068)
(0.034, 0.060)
(0.028, 0.052)
50
1000
0.048
0.062
0.059
0.050
0.049
(0.035, 0.061)
(0.047, 0.077)
(0.044, 0.074)
(0.036, 0.064)
(0.036, 0.062)
5000
0.056
0.049
0.055
0.053
0.050*
(0.042, 0.070)
(0.036, 0.062)
(0.041, 0.069)
(0.039, 0.070)
(0.007, 0.093)
10 000
0.041
0.048
0.053
0.051
0.050*
(0.029, 0.053)
(0.035, 0.061)
(0.039, 0.067)
(0.037, 0.065)
(0.007, 0.093)
20 000
0.050
0.053
0.050
0.056
0.060*
(0.036, 0.064)
(0.039, 0.067)
(0.036, 0.064)
(0.042, 0.070)
(0.012, 0.107)
100
1000
0.051
0.042
0.040
0.044
0.030*
(0.037, 0.065)
(0.030, 0.054)
(0.028, 0.052)
(0.031, 0.057)
(0.004,0.064)
5000
0.066
0.040*
0.030*
0.050*
0.060*
(0.051, 0.081)
(0.001, 0.079)
(0.004,0.064)
(0.007, 0.093)
(0.013, 0.107)
10 000
0.030*
0.060*
0.070*
0.050*
0.040*
(0.004,0.064)
(0.013, 0.107)
(0.019, 0.120)
(0.007, 0.093)
(0.001, 0.079)
20 000
0.060*
0.050*
0.040*
0.060*
0.070*
(0.013, 0.107)
(0.007, 0.093)
(0.001, 0.079)
(0.013, 0.107)
(0.019, 0.120)
Rejection rate are given at a significance level of 0.05, with 95% confidence intervals in parentheses. Calculated from 1000 replicates, except for cells with * (100 replicates). IM = incongruent matrix, n = number of taxa, L = nucleotide sequence length.
Power: Different levels of congruence among matrices
The estimated power is the proportion of replicates for which the null hypothesis is rejected when H_{0 }is false by construct. For 1000 replicates, a power of 1.0 (i.e., rejection rates of 1.0) indicates that all replicates rejected the false null hypothesis, and thus power is maximal. Figure
Rejection rates of H_{0 }for the
Rejection rates of H_{0 }for the
In
Rejection rates of H_{0 }for
Rejection rates of H_{0 }for
Rejection rates of H_{0 }for CADM comparing data sets simulated on
CM_{I}
n
L
2
3
4
5
10
1000
0.308
0.928
1.000
1.000
(0.2790.337)
(0.9120.944)


5000
0.363
0.973
1.000
1.000
(0.3330.393
(0.9630.983)


10 000
0.383
0.966
1.000
1.000
(0.3530.413)
(0.9550.977)


20 000
0.380
0.974
1.000
1.000
(0.3500.410)
(0.9640.984)


25
1000
0.569
1.000
1.000
1.000
(0.5380.600)



5000
0.662
1.000
1.000
1.000
(0.6330.691)



10 000
0.675
1.000
1.000
1.000
(0.6460.704)



20 000
0.682
1.000
1.000
1.000
(0.6530.711)



50
1000
0.740
1.000
1.000
1.000
(0.7150.769)



5000
0.851
1.000
1.000
1.000
(0.8290.873)



10 000
0.869
1.000
1.000
1.000
(0.8480.890)



20 000
0.898
1.000
1.000
1.000
(0.8800.917)



100
1000
0.890*
1.000*
1.000*
1.000*
(0.8280.952)



5000
0.970*
1.000*
1.000*
1.000*
(0.9361.000)



10 000
0.970*
1.000*
1.000*
1.000*
(0.9361.000)



20 000
0.970*
1.000*
1.000*
1.000*
(0.9361.000)



A false H_{0 }was constructed by including a different number of completely congruent matrices (CM_{I}) together with a different number of incongruent matrices (IM), for a total of five distance matrices (M = 5). When CM_{I }= 5, all matrices included in the test are congruent. Rejection rates are given at a significance level of 0.05, with 95% confidence intervals in parentheses. Calculated from 1000 replicates, except for cells with * (100 replicates). Dashes () correspond to a CI of 1.000  1.000.
Rejection rates of H_{0 }for CADM comparing data sets simulated on
CM_{P}
n
L
2
3
4
5
10
1000
0.106
0.263
0.523
0.802
(0.0870.125)
(0.2360.290)
(0.4920.554)
(0.7770.827)
5000
0.105
0.300
0.586
0.866
(0.0860.124)
(0.2720.328)
(0.5550.617)
(0.8450.887)
10 000
0.113
0.311
0.608
0.872
(0.0930.133)
(0.2820.340)
(0.5780.638)
(0.8510.893)
20 000
0.122
0.314
0.615
0.875
(0.1020.142)
(0.2850.343)
(0.5850.645)
(0.8540.896)
25
1000
0.130
0.409
0.805
0.977
(0.1090.151)
(0.3780.440)
(0.7800.830)
(0.9680.986)
5000
0.158
0.495
0.893
0.993
(0.1350.181)
(0.4640.526)
(0.8740.912)
(0.9880.998)
10 000
0.151
0.508
0.902
0.997
(0.1290.173)
(0.4770.539)
(0.8840.920)
(0.9941.000)
20 000
0.153
0.514
0.907
0.996
(0.1310.175)
(0.4830.545)
(0.8890.925)
(0.9921.000)
50
1000
0.163
0.560
0.960
1.000
(0.1400.186)
(0.5290.591)
(0.9480.972)

5000
0.206
0.701
0.991
1.000
(0.1810.231)
(0.6730.729)
(0.9851.000)

10 000
0.218
0.730
0.996
1.000
(0.1920.244)
(0.7020.758)
(0.9921.000)

20 000
0.229
0.748
0.997
1.000
(0.2030.255)
(0.7210.775)
(0.9941.000)

100
1000
0.210*
0.730*
0.990*
1.000*
(0.1290.291)
(0.6410.819)
(0.9701.000)

5000
0.260*
0.880*
1.000*
1.000*
(0.1730.347)
(0.8150.945)


10 000
0.270*
0.900*
1.000*
1.000*
(0.1810.359)
(0.8400.960)


20 000
0.310*
0.920*
1.000*
1.000*
(0.2180.402)
(0.8660.974)


A different number of partially congruent matrices (CM_{P}) and a different number of incongruent matrices (IM) were included in each test, for a total of five distance matrices (M = 5). To generate CM_{P}, nucleotide sequences were simulated on partly similar trees (with permutations of 40% of n). Rejection rates are given at a significance level of 0.05, with 95% confidence intervals in parentheses. Calculated from 1000 replicates, except for cells with * (100 replicates). Dashes () correspond to a CI of 1.000  1.000.
Power: Effect of different evolutionary parameters
Power was also calculated for distance matrices obtained from nucleotide sequences simulated on identical trees under a GTR model, with identical or different evolutionary parameters. In Table
Rejection rates of H_{0 }for CADM comparing data sets simulated on
α = 0.06
α = 0.8168
α = 200
n
L
s = 0.02
s = 0.4
s = 0.02
s = 0.4
s = 0.02
s = 0.4
10
1000
0.789
0.958
0.944
1.000
0.940
1.000
(0.7640.814)
(0.9460.970)
(0.9300.958)

(0.9250.955)

5000
0.999
1.000
1.000
1.000
1.000
1.000
(0.9971.000)





10 000
1.000
1.000
1.000
1.000
1.000
1.000






20 000
1.000
1.000
1.000
1.000
1.000
1.000






50
1000
0.891
0.997
0.976
1.000
0.978
1.000
(0.8720.910)
(0.9941.000)
(0.9660.986)

(0.9690.987)

5000
1.000
1.000
1.000
1.000
1.000
1.000






10 000
1.000
1.000
1.000
1.000
1.000
1.000






20 000
1.000
1.000
1.000
1.000
1.000
1.000






Results are shown for a GTR + Γ + I model with different s and α. Rejection rates are given at a significance level of 0.05, with 95% confidence intervals in parentheses. Calculated from 1000 replicates.
Rejection rates of H_{0 }for CADM comparing data sets simulated on
s = 0.02
s = 0.4
α = 0.06
α = 0.8168
α = 200
n
L
α: 200 vs. 0.06
α: 200 vs. 0.8168
α: 200 vs. 0.06
α: 200 vs. 0.8168
s: 0.02 vs. 0.4
s: 0.02 vs. 0.4
s: 0.02 vs. 0.4
10
1000
0.866
0.939
0.993
1.000
0.949
0.998
0.999
(0.8450.887)
(0.9240.954)
(0.9880.998)

(0.9350.963)
(0.9951.000)
(0.9971.000)
5000
1.000
1.000
1.000
1.000
1.000
1.000
1.000







10 000
1.000
1.000
1.000
1.000
1.000
1.000
1.000







20 000
1.000
1.000
1.000
1.000
1.000
1.000
1.000







25
1000
0.927
0.965
1.000
1.000
0.992
1.000
1.000
(0.9110.943)
(0.9540.976)


(0.9860.998)


5000
1.000
1.000
1.000
1.000
1.000
1.000
1.000







10 000
1.000
1.000
1.000
1.000
1.000
1.000
1.000







20 000
1.000
1.000
1.000
1.000
1.000
1.000
1.000







50
1000
0.945
0.980
1.000
1.000
0.999
1.000
1.000
(0.9310.959)
(0.9710.989)


(0.9971.000)


5000
1.000
1.000
1.000
1.000
1.000
1.000
1.000







10 000
1.000
1.000
1.000
1.000
1.000
1.000
1.000







20 000
1.000
1.000
1.000
1.000
1.000
1.000
1.000







Rejection rates are given at a significance level of 0.05, with 95% confidence intervals in parentheses. Calculated from 1000 replicates.
Discussion
Incongruence among data sets is widespread in phylogenetic analyses
In order to investigate type I error rates, which is the proportion of replicates that rejected H_{0 }when it was true by construct, incongruent distance matrices were compared. In every case, the 95% CI of the rejection rate included the nominal significance level of 0.05 used for the test (Table
Numerous congruence tests have also been designed recently such as principal component analysis on loglikelihood ratios or pvalues
As observed in previous simulation studies
When the overall level of congruence decreases among congruent matrices, so does power (Figure
One of the main advantages of CADM lies in its ability to test several matrices in a single analysis, and identify partially or completely congruent and incongruent members of a set of matrices. This is achieved through
Conclusions
In the light of our results, CADM has proven to be statistically valid to detect partial or complete congruence among distance matrices and estimate its level in a phylogenetic context. One important advantage of this permutation method is its computational efficiency in significance testing. CADM offers several other advantages with respect to previously described incongruence tests: (1) The statistic is calculated directly from the distance matrices, thus different types of data can be compared after convertion to distance matrices using an appropriate function. (2) Data that readily come in the form of distance matrices do not have to be further transformed into characterstate data matrices. (3) Given that distances can be calculated directly from the raw data without inferring a phylogenetic tree, possible biases introduced by the use of an inappropriate phylogenetic method can be reduced. (4) Also, appropriate distances can be chosen for each individual data set to accurately model its evolutionary parameters. (5) If needed, pathlength distances calculated on phylogenetic trees can also be used, which provide an interesting method to test for congruence among different trees in a supertree approach. (6) Distance matrices can be weighted differentially to account for different numbers of characters. (7)
Methods
CADM test
The null hypothesis (H_{0}) of the global CADM test is the complete incongruence of the matrices under study, whether these matrices contain pairwise genetic distances, pairwise pathlength distances, or pairwise topological distances. Rejecting H_{0 }indicates that at least two matrices contain a certain amount of congruent information. The global statistic value measures the level of congruence for partially congruent matrices, with a maximum value of 1 indicating complete congruence among the matrices (i.e., identical rankings of distance matrices). One advantage of the test is that congruence can easily be detected and measured at different steps of the analysis, since CADM can be applied to any type of distance matrices (Figure
Performing tests of incongruence
Performing tests of incongruence. Three different incongruence tests are possible: directly on the pairwise genetic distance matrices (Test 1), on the pathlength distance matrices corresponding to the phylogenetic trees (Test 2), and on the topological distance matrices obtained by setting all branch length to 1 in the phylogenies (Test 3).
Application of the CADM test
Application of the CADM test. Graphical and numerical example showing a particular case for which two phylogenetic trees are incongruent in their pathlength distances (Test 2, Figure 3) but topologically congruent (Test 3, Figure 3).
• The upper offdiagonal section of each distance matrix is unfolded and written into a vector corresponding to row
• The entries of each row are transformed into ranks according to their values.
• The sum of ranks (
• The mean (
• The Kendall coefficient of concordance (
where
and
in which
• The observed Friedman's
For the simulations described below, one thousand replicates were simulated for each combination of parameters, unless stated otherwise. For each replicate, 999 random permutations were computed to estimate the reference distribution of the CADM statistic. We calculated the rate of rejection of H_{0 }with its 95% confidence interval (CI), at a nominal significance level of 0.05, for cases where H_{0 }was true (type I error rate) and for cases where H_{0 }was false (power). All the analyses were performed on ten Power Mac G5, with PowerPC 970 MP processors (2 × 2.5 GHz).
Type I error rate
The type I error rate, which is the probability of rejecting H_{0 }when the data conform to this hypothesis, was assessed for both the global and
Simulation protocol to generate distance matrices
Simulation protocol to generate distance matrices. The simulation protocol involves three steps: 1) additive distance matrices (A to X) are generated, 2) phylogenetic trees are inferred, and 3) DNA sequences are simulated on the trees.
Power
Power, which is the rate of rejection of a false H_{0}, was evaluated for different conditions of application of CADM. Rejection rates of H_{0 }were calculated with sets of distance matrices that included varying numbers of congruent matrices (CM) with different levels of similarity and different evolutionary parameters. The number of matrices (M) varied in a set and included incongruent matrices (IM) in addition to CM, for cases where CM < M.
Power: Different levels of congruence among matrices
Nucleotide sequences were simulated under a GTR + Γ + I model on the NJ trees obtained from partly similar matrices (CM_{P}) and from identical matrices (CM_{I}). CM_{P }were generated by random permutations of different numbers of taxa and branch lengths from a random additive distance matrix. As the number of permuted taxa increases, so does the distortion of the original matrix, whereas the level of congruence among matrices decreases. The number of taxa permuted varied according to the total number of taxa (n) included in each matrix, in order to maintain the same proportion of the taxa permuted regardless of the matrix size. The effect of the level of congruence on power was tested for CM_{P }= 3, out of a total of five matrices (M = 5), with n = 10 or 50, and L = 10 000 bp. The power of
Power: Effect of different evolutionary parameters
Because genes controlled by different evolutionary processes can share an identical evolutionary history (i.e., branching pattern), we investigated the effect of different evolutionary parameters on the power of the CADM test. Following
Authors' contributions
FJL and VC conceived the simulation protocol and participated in its design and realization. PL originally conceived and programmed the CADM method. For this paper, he participated in the elaboration of the simulation protocol and wrote Rlanguage functions for CADM. VC performed the simulations and wrote the manuscript. All authors read, commented, and approved the final manuscript.
Acknowledgements
We would like to thank the members of the Laboratoire d'Écologie Moléculaire et d'Évolution (LEMEE) for their constructive comments on a preliminary version of this manuscript. For the phylogenetic analyses, we used the computational resources located in the Laboratoire Interfacultaires de MicroInformatique de l'Université de Montréal and we thank MarieHélène Duplain for granting access to the lab outside business hours. This study was supported by NSERC and FQRNT scholarships to VC and by NSERC grant OGP0155251 to FJL.