Selecting informative subsets of sparse supermatrices increases the chance to find correct trees

Misof, Bernhard; Meyer, Benjamin; von Reumont, Björn Marcus; Kück, Patrick; Misof, Katharina; Meusemann, Karen

doi:10.1186/1471-2105-14-348

Methodology article
Open access
Published: 03 December 2013

Selecting informative subsets of sparse supermatrices increases the chance to find correct trees

Bernhard Misof¹,
Benjamin Meyer²,
Björn Marcus von Reumont³,
Patrick Kück¹,
Katharina Misof¹ &
…
Karen Meusemann^1,4

BMC Bioinformatics volume 14, Article number: 348 (2013) Cite this article

4581 Accesses
67 Citations
8 Altmetric
Metrics details

Abstract

Background

Character matrices with extensive missing data are frequently used in phylogenomics with potentially detrimental effects on the accuracy and robustness of tree inference. Therefore, many investigators select taxa and genes with high data coverage. Drawbacks of these selections are their exclusive reliance on data coverage without consideration of actual signal in the data which might, thus, not deliver optimal data matrices in terms of potential phylogenetic signal. In order to circumvent this problem, we have developed a heuristics implemented in a software called mare which (1) assesses information content of genes in supermatrices using a measure of potential signal combined with data coverage and (2) reduces supermatrices with a simple hill climbing procedure to submatrices with high total information content. We conducted simulation studies using matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10-30%.

Results

With matrices of 50 taxa × 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10-30% Maximum Likelihood (ML) tree reconstructions failed to recover correct trees. A selection of a data subset with the herein proposed approach increased the chance to recover correct partial trees more than 10-fold. The selection of data subsets with the herein proposed simple hill climbing procedure performed well either considering the information content or just a simple presence/absence information of genes. We also applied our approach on an empirical data set, addressing questions of vertebrate systematics. With this empirical dataset selecting a data subset with high information content and supporting a tree with high average boostrap support was most successful if information content of genes was considered.

Conclusions

Our analyses of simulated and empirical data demonstrate that sparse supermatrices can be reduced on a formal basis outperforming the usually used simple selections of taxa and genes with high data coverage.

Background

In most phylogenomic studies supermatrices of concatenated presumably orthologous genes are used for tree inference [1-18]. Due to the failure of consistently identifying orthologous genes among taxa [2] and/or due to general sparse sequence data availability these supermatrices frequently display a low data coverage, down to 8% [2]. Simulation studies showed that in these instances chances of recovering a correct and robust tree can drastically decrease [1, 19]. Additionally, Wiens [20, 21], Philippe [22], Sanderson [1, 19, 23], Driskell [2], Hartmann [24] and colleagues showed that low gene data coverage of single taxa can already impede the success of tree reconstructions. In contrast, other simulation studies demonstrated that single taxa with low gene data coverage can help breaking up long branches and thus improve tree reconstructions [20, 21, 25-28]. These mentioned analyses of empirical and simulated data demonstrate that confounding effects of low gene data coverage on tree inference can hardly be generalized [1, 3, 11, 29-36].

Despite these unresolved issues many investigators select sets of taxa with high gene data coverage assuming that the high gene data coverage will improve the robustness of tree inferences [3, 4, 9, 11, 16, 17]. However, these threshold criteria are arbitrary and do not take into account potential phylogenetic signal of the data. Those approaches might not lead to the desired increase of tree robustness. For example, tree robustness will not increase, if high gene data coverage is achieved by selecting highly conservative orthologous genes with low phylogenetic signal. Alternatively, a robust tree might result if taxa with low gene data coverage but highly informative genes have been selected, Driskell et al. [2] e.g. report an example of plausible tree reconstructions based on a supermatrix with a gene data coverage of just 8-16%. Both cases illustrate that gene data coverage and phylogenetic resolution are not necessarily correlated. Consequently, the practice of selecting data based solely on data coverage is potentially problematic. Therefore, we have developed an approach which focuses on the analyses of selected optimal data subsets (SOS) which have high data coverage and phylogenetic signal. Crucial for this approach is the assessment of potential signal of genes and the development of a heuristics to select such an SOS.

Different quartet mapping approaches have been used to assess potential signal within genes [37, 38]. Among these, geometry mapping is demonstrably the most conservative estimator [37] and the application to genes of supermatrices is straightforward. Consequently, we have chosen the geometry mapping approach [37-40] to assess potential signal of genes in the development of our heuristics.

In order to select an optimal set of taxa and genes, Sanderson and colleagues [23] suggested selecting sets of full data coverage (maximal bicliques [41, 42]). However, the identification of the maximal (maximum) biclique is a NP-complete problem [42, 43] and, thus, there is no guarantee to find the maximal (maximum) biclique. Additionally, Sanderson et al. [23] found that selections of maximal bicliques resulted in very small subsets of size < 15 taxa and < 10 genes. Sanderson’s approach is, thus, not suitable to reconstruct phylogenetic relationships of many taxa. A possible solution might be the selection of quasi-bicliques [44, 45], which potentially combine a much larger set of taxa and genes accepting a predefined level of missing data. This promising direction however has the drawback that it is not time-efficient.

Alternatively Hartmann et al. [24] and Cheng et al. [46] introduced two approaches directly applicable to sequence data. The first approach of Hartmann et al. [24] is a masking technique (REAP) which masks multiple sequence alignments according to predefined thresholds of gap frequencies of sites. The approach of Cheng et al. [46] is a statistical correction for missing data (SIA). A comparison of these two approaches demonstrated that REAP performed better, a result which is compatible with the results of Sanderson’s biclique approach. However, both, alignment masking (REAP) and the biclique approach optimize data only with respect to data coverage and without considering potential signal among genes.

Here, we introduce a simple hill climbing algorithm to select optimal data subsets (SOS) which are assembled by considering data coverage and potential signal of genes. We start with the assumption that any taxon and gene can potentially contribute to the total signal of the matrix. However, taxa or genes with incomplete data coverage and low signal can potentially also contribute noise or cause biases to the total signal of the supermatrix. Therefore, we successively mask taxa and genes of low signal and/or data coverage generating a submatrix of higher data coverage and signal. With this approach we deliberately discard taxa and genes because of their low data coverage and/or potential low signal. The proposed hill climbing algorithm delivers an optimal solution of this trade-off. Using simulated and empirical data, we compare the performance of the herein proposed approach with an often applied approach of simply selecting data subsets using predefined thresholds of data coverage only.

Methods

The approach can be separated into two parts: (1) the determination of information content of genes, taxa and the concatenated supermatrix and (2) the selection of an optimal subset (SOS) of taxa and genes.

Information content of genes, taxa and matrices

Before we define the information content of genes, taxa and matrices used in our approach, we have to introduce the concepts of data coverage representation matrices.

A concatenated supermatrix of N taxa and n gene nucleotide/amino acid sequence alignments can be represented as a matrix B with entries b_ij

B : b_{ij} = (1 ∣ 0), \forall (taxa : i : 1 \dots N, genes : j : 1 \dots n)

(1)

with b_ij= (1) for a present and b_ij= (0) for an absent gene nucleotide/amino acid sequence j for a taxon i. We call this matrix B the data coverage representation matrix.

We define the information content of a gene j, q_j, as the relative data coverage of this gene, defined as

q_{j} = \frac{\sum_{i = 1}^{N} b_{ij}}{N}, \forall taxa : i : 1 \dots N.

(2)

Likewise, the information content of a taxon i, p_iis defined as

p_{i} = \frac{\sum_{j = 1}^{n} b_{ij}}{n}, \forall genes : j : 1 \dots n.

(3)

We define the information content, P, of a matrix B as

P (B) = \frac{\sum_{i = 1}^{N} \sum_{j = 1}^{n} p_{i}}{N \times n} = \frac{\sum_{i = 1}^{N} \sum_{j = 1}^{n} q_{j}}{N \times n}

(4)

with 0 ≤ P (B),p_i,q_j≤ 1. To determine the potential signal of genes we use geometry mapping [37] extended to the amino acid level. Nieselt-Struwe et al. [37] showed that for a given quartet of sequences, relative support for each of the three possible topologies s₁,s₂,s₃ can be computed as

s_{i} = δ_{i} / (δ_{1} + δ_{2} + δ_{3})

(5)

with δ_isupport for tree T_i, 0 ≤ s_i≤ 1 and $\sum_{i} s_{i} = 1$ . Support values δ_ican be computed with any optimality criterion. Relative support values can be interpreted as baricentric coordinates of a bipartite simplex graph S with vectors s = (s₁,s₂,s₃):

S = \{\sum_{i = 1}^{3} s_{i} e_{i} | s_{1} + s_{2} + s_{3} = 1, 0 \leq s_{1}, s_{2}, s_{3} \leq 1\}

(6)

with e_ias unit vectors. Within S, areas T₁,T₂,T₃ at vertices can be defined for resolved quartets, T_1,2,T_1,3,T_2,3 for partly resolved quartets, and T_∗ for star-like, unresolved topologies of quartets [37, see Figure 1]. For all possible quartets k_jof a gene j, $k_{j} = (\binom{N}{4})$ with N the number of taxa, all vectors s_m= (s₁,s₂,s₃),(∀ m : 1 … k) can be calculated, and the frequency of vectors in areas T₁,T₂, and T₃ determine potential signal, t_jof a gene j[37].

t_{j} = \frac{T_{1} + T_{2} + T_{3}}{T_{1} + T_{2} + T_{3} + T_{1, 2} + T_{1, 3} + T_{2, 3} + T_{*}}

(7)

We relaxed the definition of signal by calculating the frequency of vectors in areas T₁,T₂,T₃,T_1,2,T_1,3,T_2,3.

\hat{t_{j}} = \frac{T_{1} + T_{2} + T_{3} + T_{1, 3} + T_{2, 3} + T_{1, 2}}{T_{1} + T_{2} + T_{3} + T_{1, 2} + T_{1, 3} + T_{2, 3} + T_{*}}

(8)

Our approach will, thus, be a more optimistic estimator of potential signal. Signal $\hat{t_{j}}$ will be $0 \leq \hat{t_{j}} \leq 1$ (examples of simulated data, Figure 1).

Geometry mapping is a conservative estimator of $\hat{t_{j}}$ , however, within a narrow range of short internal and long terminal branch lengths, geometry mapping opts for the wrong tree, a classical case of long branch attraction [37]. This phenomenon might inflate the estimation of $\hat{t_{j}}$ under certain circumstances.

Nieselt-Struwe and colleagues [37] showed that for any alphabet of characters of finite length, e.g. nucleotides or amino acids, an enumeration of character states among four sequences can be used to calculate support for all three possible topologies. They further showed that a weight matrix M, defining dissimilarity measures between characters, can equivalently be used to calculate distances between sequences. Therefore, we used BLOSUM62, the amino acid substitution matrix introduced by Henikoff [47], to calculate distances between sequences in correspondence to equation (8) in Nieselt-Struwe et al. [37].

We use $\hat{t_{j}}$ of each gene j to update entries of matrix B. For each gene j, entries of matrix B = (b_ij) are scaled with the corresponding $\hat{t_{j}}$ values. We call this matrix a weighted data coverage representation matrix B^∗, in short, a weighted matrix B^∗, in the following:

\begin{array}{l} B^{*} : b_{ij}^{*} = & (0 \leq b_{ij} \hat{t_{j}} \leq 1), \\ \forall (taxa : i : 1 \dots N, genes : j : 1 \dots n) \end{array}

(9)

Substituting $b_{ij}^{*}$ for b_ijresults in weighted forms of equations 1 and 2. The information content of a gene j, q_j, represents in its weighted form a product of relative data coverage and potential signal of genes.

Selection of an optimal subset (SOS) of taxa and genes

We consider a subset(=submatrix) of taxa and genes optimal, if it has a high information content, P (B) and contains as many taxa and genes as possible. If we discard genes or taxa with low q_jor p_irespectively, we will increase P of the matrix, but will loose information on the excluded taxa and genes. A simple optimization can be performed, searching for the highest possible P while excluding as few taxa/genes as possible.

First, a data coverage representation matrix B is generated from the concatenated supermatrix of multiple gene nucleotide/amino acid sequences corresponding to equation (1). Secondly, for each gene j, ≤20,000 quartets are randomly drawn without duplication and $\hat{t_{j}}$ is calculated. For each gene j, entries of B = (b_ij) are scaled with the corresponding $\hat{t_{j}}$ values, generating a weighted matrix B^∗ corresponding to equation (6). Thirdly, we use a simple hill climbing procedure to select an optimal subset (SOS) of taxa and genes. Elimination of taxa or genes starts with dropping either a taxon or gene with the lowest information content p_ior q_j, generating a new matrix B^′ with P^′ (B^′). In case of ties between q_jand p_i, genes will be excluded. Since taxa or genes with lowest information content will be dropped, P^′ (B^′) > P (B) (it is trivial to show that this will always be the case). After each elimination step, information content of taxa (p_i) and genes (q_j) are recalculated. Every gene represented by less than 4 taxa is automatically dropped from the matrix. Gene overlap between taxa is monitored to a minimum of three taxa and two genes. If the matrix B^′ does not fulfill this criterion, the next best B^′ in terms of P^′ is selected.

Continuous elimination of taxa or genes with low p_i or q_jwill generate a ‘trivial’ SOS containing few taxa and one gene. Therefore, we define an optimality function f (P)

f (P) = 1 - | (λ - P^{α \times (1 - P)}) | if P < 1

(10)

with α as a scaling factor (default set to α = 3) and λ as the size ratio between reduced B^′ and original matrix B

λ = \frac{N_{B^{'}} \times n_{B^{'}}}{N_{B} \times n_{B}} .

(11)

During the process of elimination of taxa and/or genes, P^′ will continually increase, and λ will continually decrease. f(P^′) will reach a maximum of 1. With a scaling factor α = 2, the maximum will be at the intersection of P^′ and λ, with α = 3 it will be reached later, favoring an SOS with a higher P (Figures 2 and 3). If f (P^′) = 1 the process of elimination stops.

The outlined procedure is a simple hill climbing heuristics without guarantee of finding a globally optimal solution due to the interaction of p_iand q_j. The approach can be applied either to B or B^∗. It should be pointed out that removal of taxa will have an influence on the calculation of $\hat{t_{j}}$ which is not recalculated during the process of matrix reduction. This simplification greatly speeds up the heuristics. An iterative recalculation of $\hat{t_{j}}$ can potentially improve the selection of an informative dataset and will be further studied.

Calculation time for this heuristics grows with the number of taxa (N) and genes (n). Therefore, it is time efficient, O (N + n)². The algorithm reduces matrices in a deterministic way which makes matrix reduction reproducible. However, different equally optimal solutions will not be found under identical parameter settings.

By varying the scaling parameter α, however, an SOS of high P (α ≥ 3), versus an SOS of more taxa and genes with lower P (α ≤ 3) can be found.

Simulated data

Our simulations were not set up with the intention of fully exploring the performance of matrix reductions depending on super matrix characteristics, but were set up in order to illustrate the potential of the method in four different cases, resembling observed situations of empirical data.

Simulated data with random distribution of missing data

For two different sets of genes, differing in relative evolutionary rates among genes (Figure 4), we simulated 100 (50 taxa × 50 genes) supermatrices each, composed of genes with 400 amino acids (aa), concatenated for each taxon to 20,000 aa length using Seq-Gen [48] and the BLOSUM62 matrix. For these simulations, we used a topology derived from empirical data with realistic distribution of branch lengths (Figure 5A). Evolutionary rates of genes varied from 0.001 to 15.00 relative rate differences, to mimic different signal strength (Figures 4 and 6). Within each gene, site rates were homogeneous. In order to generate supermartices with missing data, we removed amino acid sequences of taxa using a Binomial distribution with a probability of retaining data entries for each taxon and gene of 0.7 (average data coverage of 0.29, Table 1). This set up generated supermatrices with randomly distributed missing data, closely resembling the observed data coverage of published concatenated supermatrices of Dunn and colleagues [4].

Table 1 Summary of simulation results

Full size table

Simulated data with power-law and non-random distribution of missing data

For two different sets of genes, differing in relative evolutionary rates among genes (Figure 4), we further simulated 100 (50 taxa × 50 genes) supermatrices each, composed of genes with 400 aa, concatenated for each taxon to 20,000 aa length. We used again the topology derived from empirical data with realistic distribution of branch lengths (Figure 5B). We changed seven branch lengths to introduce potential long branch attraction (Figure 5B). In order to generate supermartices with missing data, we followed a proposal of Li and colleagues [49]. These authors showed that the distribution of missing data in many empirical supermatrices is best described by applying a power law function of the probability of having data. Following their observation, we assigned to each taxon and gene a probability of having data randomly drawn from f (x) = (1/10 x^-1/2) - 0.1, for x randomly selected with equal probability from, 0 ≤ x ≤ ∞. Additionally, we constrained data assignment to having at least one gene for each taxon. Following this approach, we concatenated supermatrices with a distribution of missing data approximately similar to observed empirically supermatrices (Misof, unpubl.) (average data coverage 0.13, Table 1). Finally, we raised the probability of data coverage for four predefined taxa, mimicking the often seen high coverage of a few taxa for which genomes are available.

Selecting subsets from simulated data and tree reconstructions

Selecting subsets with the hill climbing algorithm

SOS’s were selected using the mare software (mare: matrix reduction) which implements the herein described novel approach. For each supermatrix, trees were reconstructed 1) using the original supermatrix (data coverage 0.3), 2) an SOS of B and 3) an SOS of B^∗. Trees were reconstructed with RAxML 7.0.0 [50, 51]. The BLOSUM62 amino acid substitution matrix with Γ distributed among site rate heterogeneity was used to account for different substitution rates among genes.

To compare reconstructed trees with the correct trees used in data simulations, we used standardized quartet distances between shared taxa [24, 52-55]. QDistances (d_QD) were standardized in relation to all quartets of shared taxa. We recorded d_QD’s of trees inferred from the unreduced matrix and of the two SOS’s derived from B and B^∗.

Selecting subsets with predefined thresholds of data coverage

From supermatrices with power-law and non-random distribution of missing data we selected subsets in two different ways: (1) we selected all genes with data coverage above or equal to 0.4 and (2) we selected all taxa with data coverage above or equal to 0.04 and all genes with data coverage above or equal to 0.4 (adapted to the new number of taxa). We recorded d_QD’s of trees inferred from unreduced matrices and from subsets.

Selecting subsets from empirical data and tree reconstructions

We studied the performance of using the hill climbing algorithm with matrices B and B^∗ using the published empirical metazoan data set of Driskell et al. [2] comprising 1,131 putative orthologous genes for 70 taxa (Metazoa, Fungi + outgroup). Additionally, we selected data subsets of the Driskell supermatrix applying predefined thresholds of gene - and taxa coverage (Table 2). All ML analyses using RAxML v7.2.6 or 7.2.8 were executed with rapid bootstrapping (PROTCAT) and best tree search (PROTGAMMA) in one step (-f a, 500 or 1,000 BS replicates) and the empirical substitution matrix WAG [56]. A posteriori bootstop tests were performed to test for a sufficient number of bootstrap replicates [57]. All analyses were conducted using RAxML HYBRID and PTHREADS versions on HPC Linux clusters, 8 nodes with 8 or 12 cores each, at the Regionales Rechenzentrum Köln (RRZK) using Cologne High Efficient Operating Platform for Science (CHEOPS). Further, we compared the effects of data reduction on tree robustness with the resolution score as introduced by Holland and colleagues [58]. This resolution score, RS, calculated as the sum of bootstrap support values ≥50 divided by the number of taxa N - 3, represents a measure of average bootstrap support and, thus, robustness of trees.

Table 2 Comparison of matrix reductions with empirical data using mare and simple predefined thresholds

Full size table

Results

Performance with simulated data

Tree reconstructions based on unreduced supermatrices with a Gaussian distribution of missing data did not yield correct trees except for one case in set 1 (columns (org) for set1 and set2, Gaussian distribution of missing data in Figure 7A,B, Table 1). The variability of d_QDvalues was low (columns (org) for set1 and set2, Gaussian distribution of missing data in Figure 7A, Table 1). Tree reconstructions based on all SOSs (unweighted and weighted reductions of set1 and set2) of these supermatrices performed much better (columns (w), (uw) for set1 and set2, Gaussian distribution of missing data in Figure 7A,B, Table 1). Compared with trees derived from unreduced supermatrices, SOSs supported more often correct trees, but had a higher frequencies of wrong quartets (columns (w), (uw) for set1 and set2, Gaussian distribution of missing data in Figure 7A,B, Table 1). However, there was no clear difference of mean d_QDvalues between trees based on SOSs derived from B (uw) or B^∗ (w) (columns (w), (uw) for set1 and set2, Gaussian distribution of missing data in Figure 7A, Table 1). Trees based on SOSs of B^∗ (w) had a much lower amplitude of d_QDvalues (columns (w), (uw) for set1 and set2, Gaussian distribution of missing data in Figure 7A, Table 1). SOSs derived from B^∗ contained on average more taxa (Table 1).

Tree reconstructions based on the unreduced matrix with power-law non-random distribution of missing data did not recover correct trees for set 1 and set 2. In both cases variability of d_QDvalues was high (columns (w), (uw) for set1 and set2, power-law non-random distribution of missing data in Figure 7A,B, Table 1). Tree reconstructions based on all SOSs (unweighted and weighted reductions of set 1 and set2) clearly outperformed reconstructions based on the unreduced matrices (columns (org), (w), (uw) for set1 and set2, power-law non-random distribution of missing data in Figure 7A,B, Table 1). The absolute number of correct trees was again higher for all SOSs (unweighted and weighted reductions of set 1 and set2) compared with the number of correct trees inferred from the unreduced matrices. In cases of low relative rate differences among genes, set 1, SOSs derived from B (uw) performed worse compared to SOSs derived from B^∗ (w), in cases of high relative rate differences among genes, set 2, the opposite was observed (columns (org), (w), (uw) for set1 and set2, power-law non-random distribution of missing data in Figure 7B, Table 1).

Data subsets derived from matrices with power-law non-random distribution of missing data using predefined thresholds of gene coverage supported trees with lower mean d_QDvalues (columns (ca), (cb) in Figure 7A) in comparison with mean d_QD values of trees inferred from SOSs selected with our approach (column (w), (uw) for set 1 and set 2 of the power-law data in Figure 7A, Table 1). The mean d_QDvalues were higher and the amplitude of d_QDwas large (columns (ca), (cb) in Figure 7A). Data subsets from matrices with power-law non-random distribution of missing data using combined thresholds of data coverage for genes and taxa did support trees with mean d_QDvalues (columns (cc), (cd) in Figure 7A) comparable with mean d_QDvalues of trees inferred from SOSs of set 1 and set 2 selected with our approach (column (w), (uw) for set 1 and set 2 of the power-law data in Figure 7A, Table 1). The amplitude of d_QDvalues however was large (columns (cc), (cd) in Figure 7A). Applying only thresholds for gene data coverage yielded a lower absolute number of correct trees (columns (ca), (cb) in Figure 7B) compared with our approach, but the absolute number of correct trees was comparable or even higher if combined thresholds of taxa and genes were used (columns (cc), (cd) in Figure 7, Table 1).

In summary, reduction of supermatrices often increased the chance to find a correct tree, but not consistently. SOSs derived from B^∗ did not always support correct trees more often compared with SOSs derived from B, but had a much smaller amplitude of d_Qvalues. Data subsets derived from predefined thresholds supported fewer correct trees if only applied to genes but supported comparable numbers of correct trees if used with combined thresholds of data coverage for taxa and genes.

Performance with empirical data

We applied our approach to the published metazoan data set of Driskell et al. [2] comprising 1,131 genes for 70 taxa (Metazoa, Fungi + outgroup). The data coverage was low (0.0836), the matrix information content was low (P = 0.0657). Most genes are represented only by few taxa (e.g. Homo sapiens, Mus musculus, Rattus norvegicus, Bos taurus, Sus scofra). We excluded six taxa of which the complete genome was available from the original matrix showing the highest coverage (Homo sapiens, Mus musculus, Rattus norwegicus, Sus scofra, Bos taurus, and Gallus gallus) and selected an SOS from these data. With this procedure we removed the most extreme heterogeneity of data coverage among taxa prior to the selection of an SOS.

Selecting an SOS resulted in a data subset of 48 taxa and 45 genes with a data coverage of 0.316 and P = 0.223. Thus, a SOS was found with a 10.24% loss of taxa and a 9.08-fold increase in data coverage and a 16.043-fold gain in P. However, all outgroup taxa including slime molds, fungi and nematodes had been excluded. We compared tree reconstructions based on 1) the original unreduced supermatrix with 64 taxa (1000 bs replicates, 469,480 aa) and 2) the SOS of 48 taxa and 45 genes (1,000 bs replicates, 11,198 aa). An a posteriori bootstop test (default MR-based bootstopping criterion, WRF average of 100 random splits) revealed that 1,000 BS were by far sufficient for both analyzed data sets.

Tree reconstructions with the 64-taxa set resulted in trees with polyphyletic Tetrapoda, Actinopterygii, monophyletic Marsupialia + Monotrema, and largely unresolved basal splits within Theria (Figure 8A).

The tree based on the SOS was more congruent to general taxonomic views. The topology showed moderately supported monophyletic Tetrapoda, and resolution within Ungulates and Carnivora (Figure 8). However, for example Actinopterygii remained paraphyletic and relationships of Marsupialia and Monotrema were not resolved. The resolution score RS increased from 82.148% (unreduced supermatrix including 64 taxa and 1,1131 genes) to 87.38% (SOS). We also compared reductions of the original Driskell supermatrix using different parameter settings in our approach and simple thresholds of data masking (Table 2). Applying predefined thresholds of gene and taxa coverage never resulted in matrices with comparable resolution scores and comparable number of taxa. Our approach outperformed the application of simple thresholds.

Discussion

We show that supermatrices of simulated amino acid sequence data with low data coverage and relative rate differences among genes can support biased tree inference or low robustness of trees. It can be suspected that these effects will even be stronger for empirical data. These conclusions corroborate results of Hartmann [24], in many aspects Philippe [22] and Wiens and colleagues [28]. Effective techniques to reduce these potential biases in tree inference are therefore clearly needed.

Masking supermatrices and deleting rogue taxa after tree reconstructions could be suitable measures as has been applied by Dunn and colleagues [4]. In their analysis these authors selected taxa and genes according to predefined cutoff values of data coverage. The application of cutoff values considers only the extent of missing data which might favor the selection of the most conserved genes readily identified among all taxa in the data. Additionally, Dunn et al. [4] deleted rogue taxa after tree reconstruction based on an idea introduced by Thorley and colleagues [59, 60]. The major drawback of their approach is that robustly misplaced taxa will not be identified. In this respect, a formal approach to masking of supermatrices as proposed here could be an alternative worth to consider.

We propose to select a subset of taxa and genes with a maximal information content. In doing so, it is necessary to first assess potential signal of genes, for which we use extended geometry mapping (eGM) [37-40]. We opted for geometry mapping, because it tends to be more conservative in discriminating between resolved and star-like trees in contrast to likelihood mapping [61]. Additionally, eGM is easily applied to nucleotide and amino acid sequence data without the need of tree reconstructions. It is, thus, a technically convenient but, admittedly, coarse way of estimating potential signal.

Secondly, it is necessary to select optimal subsets of supermatrices based on the information content of taxa and genes. The information content of taxa and genes is calculated as the ratio of potential signal and data coverage. By introducing this optimality criterion we can select taxa and genes which contribute most signal in tree reconstructions. We select a data subset in a stepwise function penalizing size reduction of the supermatrix and favoring higher matrix information content, monitoring but ignoring optimization of connectivity in the matrix. Our approach is time efficient but will not be effective in discovering a globally optimal subset in terms of taxa/gene overlap (‘connectivity’) and information content. This is in contrast to the approach of Yan [44] in which the quasi-biclique with the highest level of connectivity (‘largest grove’) is searched for.

Improved heuristics considering information content and connectivity in our approach are certainly conceivable. However, the distribution of missing data following a power-law distribution in empirical data suggests that simple hill climbing procedures will be effective in identifying a good (optimal) subset of taxa and genes in terms of matrix information content. The flexibility of our approach offers even the chance to use different parameter settings of the optimality function to identify alternative SOSs.

We observed high amplitudes of d_QDvalues of trees based on SOSs in our simulations. These amplitudes were even higher in SOS’s based on simple data coverage representations. We interpret this occasional high error rate as a possible phenomenon of insufficient taxon sampling in SOSs which might pronounce long branch attraction (LBA), or, alternatively, that connectivity in SOSs was not sufficient to potentially support just one tree [62]. This interpretation highlights a problem of all methods of data reduction. Every reduction process, at least partially, counteracts efforts to reduce biases in tree reconstructions due to insufficient taxon or gene sampling. The analyses of Wiens and colleagues [20, 21, 28] showed that LBA effects can disappear, if data exhibiting LBA are recoded as missing. This implies that an identification of LBA taxa before concatenation and reduction of data would be important. However, we do not have a grip yet on a reliable identification of biases in tree reconstructions which could guide a preselection of taxa. An immediate, however unsatisfying, solution is probably the reconstruction of trees with and without suspect taxa.

Our simulations showed that in the presence of heterogeneous signal among genes the new heuristics increased the chance of finding a correct tree. It is, thus, an alternative to the computationally much more demanding quasi-biclique approach [44, 45]. SOSs derived from B or B^∗ matrices did not differ extensively in their success rate of correct tree reconstructions with simulated data, with small advantages for the B in cases of power-law non-random distribution of missing data. However, the analyses of the empirical data imply that tree reconstructions based on SOSs derived from B^∗ will result in improved tree robustness.

Conclusions

Our analyses of simulated and empirical data demonstrate that sparse supermatrices can be reduced on a formal basis outperforming the usually used simple selections of taxa and genes with high data coverage. The approach prresented here is will be of general inportance in phylogenomic studies based on large concatenated superalignments with incomplete data coverage. It clearly offers an alternative to threshold based data selection.

References

Sanderson MJ, Driskell AC: The challenge of constructing large phylogenetic trees. Trends Plant Sci. 2003, 8: 374-379. 10.1016/S1360-1385(03)00165-1.
Article CAS PubMed Google Scholar
Driskell AC, Ané C, Burleigh JG, McMahon MM, O’Meara BC, Sanderson MJ: Prospects for building the tree of life from large sequence databases. Science. 2004, 306: 1172-1174. 10.1126/science.1102036.
Article CAS PubMed Google Scholar
Philippe H, Delsuc F, Brinkmann H, Lartillot N: Phylogenomics. Annu Rev Ecol Evol Syst. 2005, 36: 541-562. 10.1146/annurev.ecolsys.35.112202.130205.
Article Google Scholar
Dunn CW, Hejnol A, Matus DQ, Pang K, Browne WE, Smith SA, Seaver E, Rouse GW, Obst M, Edgecombe GD, Sorensen MV, Haddock SHD, Schmidt-Rhaesa A, Okusu A, Kristensen RM, Wheeler WC, Martindale MQ, Giribet G: Broad phylogenomic sampling improves resolution of the animal tree of life. Nature. 2008, 452: 745-749. 10.1038/nature06614.
Article CAS PubMed Google Scholar
Bourlat SJ, Nielsen C, Economou AD, Telford MJ: Testing the new animal phylogeny: a phylum level molecular analysis of the animal kingdom. Mol Phylogenet Evol. 2008, 49: 23-31. 10.1016/j.ympev.2008.07.008.
Article CAS PubMed Google Scholar
de Queiroz A, Gatesy J: The supermatrix approach to systematics. Trends Ecol Evol (Amst). 2006, 22: 34-41.
Article Google Scholar
Delsuc F, Brinkmann H, Philippe H: Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 2005, 6: 361-375.
Article CAS PubMed Google Scholar
Galtier N, Daubin V: Dealing with incongruence in phylogenomic analyses. Philos Trans R Soc Lond, B, Biol Sci. 2008, 363: 4023-4029. 10.1098/rstb.2008.0144. [http://dx.doi.org/10.1098/rstb.2008.0144],
Article PubMed Central PubMed Google Scholar
Hausdorf B, Helmkampf M, Meyer A, Witek A, Herlyn H, Bruchhaus I, Hankeln T, Struck TH, Lieb B: Spiralian phylogenomics supports the resurrection of Bryozoa comprising Ectoprocta and Entoprocta. Mol Biol Evol. 2007, 24: 2723-2729. 10.1093/molbev/msm214.
Article CAS PubMed Google Scholar
Murphy WJ, Pevzner PA, O’Brien SJ: Mammalian phylogenomics comes of age. Trends Genet. 2004, 20: 631-639. 10.1016/j.tig.2004.09.005.
Article CAS PubMed Google Scholar
Philippe H, Derelle R, Lopez P, Pick K, Borchiellini C, Boury-Esnault N, Vacelet J, Renard E, Houliston E, Quéinnec E, Da Silva C, Wincker P, Le Guyader H, Leys S, Jackson DJ, Schreiber F, Erpenbeck D, Morgenstern B, Wörheide G, Manuël M: Phylogenomics revives traditional views on deep animal relationships. Curr Biol. 2009, 19: 706-712. 10.1016/j.cub.2009.02.052.
Article CAS PubMed Google Scholar
Regier JC, Shultz JW, Ganley ARD, Hussey A, Shi D, Ball B, Zwick A, Stajich JE, Cummings MP, Martin JW, Cunningham CW: Resolving arthropod phylogeny: exploring phylogenetic signal within 41 kb of protein-coding nuclear gene sequence. Syst Biol. 2008, 57: 920-938. 10.1080/10635150802570791.
Article CAS PubMed Google Scholar
Shedlock AM, Botka CW, Zhao S, Shetty J, Zhang T, Liu JS, Deschavanne PJ, Edwards SV: Phylogenomics of nonavian reptiles and the structure of the ancestral amniote genome. Proc Natl Acad Sci USA. 2007, 104: 2767-2772. 10.1073/pnas.0606204104.
Article PubMed Central CAS PubMed Google Scholar
Smith SA, Beaulieu JM, Donoghue MJ: Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches. BMC Evol Biol. 2009, 9: 37-10.1186/1471-2148-9-37.
Article PubMed Central PubMed Google Scholar
Roeding F, Hagner-Holler S, Ruhberg H, Ebersberger I, von Haeseler A, Kube M, Reinhardt R, Burmester T: EST sequencing of Onychophora and phylogenomic analysis of Metazoa. Mol Phylogenet Evol. 2007, 45: 942-951. 10.1016/j.ympev.2007.09.002.
Article CAS PubMed Google Scholar
Simon S, Strauss S, von Haeseler A, Hadrys H: A phylogenomic approach to resolve the basal pterygote divergence. Mol Biol Evol. 2009, 12: 2719-2730.
Article Google Scholar
Struck T, Paul C, Hill N, et al: Phylogenomic analyses unravel annelid evolution. Nature. 2011, 471: 452-456. 10.1038/471452a.
Article Google Scholar
Kocot K, Cannon J, Todt C, et al: Phylogenomics reveals deep molluscan relationships. Nature. 2011, 477: 452-456. 10.1038/nature10382.
Article PubMed Central CAS PubMed Google Scholar
Sanderson MJ: Construction and annotation of large phylogenetic trees. Aust Syst Bot. 2007, 20: 287-301. 10.1071/SB07006.
Article CAS Google Scholar
Wiens JJ: Missing data, incomplete taxa, and phylogenetic accuracy. Syst Biol. 2003, 52: 528-538. 10.1080/10635150390218330.
Article PubMed Google Scholar
Wiens JJ: Missing data and the design of phylogenetic analyses. J Biomed Inform. 2006, 39: 34-42. 10.1016/j.jbi.2005.04.001.
Article CAS PubMed Google Scholar
Philippe H, Snell EA, Bapteste E, Lopez P, Holland PWH, Casane D: Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol Biol Evol. 2004, 21: 1740-1752. 10.1093/molbev/msh182.
Article CAS PubMed Google Scholar
Sanderson MJ, Driskell AC, Ree RH, Eulenstein O, Langley S: Obtaining maximal concatenated phylogenetic data sets from large sequence databases. Mol Biol Evol. 2003, 20: 1036-1042. 10.1093/molbev/msg115.
Article CAS PubMed Google Scholar
Hartmann S, Vision TJ: Using ESTs for phylogenomics: can one accurately infer a phylogenetic tree from a gappy alignment?. BMC Evol Biol. 2008, 8: 95-10.1186/1471-2148-8-95.
Article PubMed Central PubMed Google Scholar
Poe S: Sensitivity of phylogeny estimation to taxonomic sampling. Syst Biol. 1998, 47: 18-31. 10.1080/106351598261003.
Article CAS PubMed Google Scholar
Kearny M, Clark JM: Problems due to missing data in phylogenetic analyses including fossils: a critical review. J Vertebr Paleontology. 2003, 23: 263-274. 10.1671/0272-4634(2003)023[0263:PDTMDI]2.0.CO;2.
Article Google Scholar
Wiens JJ: Can incomplete taxa rescue phylogenetic analyses from long-branch attraction?. Syst Biol. 2005, 54: 731-742. 10.1080/10635150500234583.
Article PubMed Google Scholar
Wiens JJ, Moen DS: Missing data and the accuracy of Bayesian phylogenetics. J Syst Evol. 2008, 46: 307-314.
Google Scholar
Phillips MJ, Delsuc F, Penny D: Genome-scale phylogeny and the detection of systematic biases. Mol Biol Evol. 2004, 21: 1455-1458. 10.1093/molbev/msh137.
Article CAS PubMed Google Scholar
Jeffroy O, Brinkmann H, Delsuc F, Philippe H: Phylogenomics: the beginning of incongruence?. Trends Genet. 2006, 22: 225-231. 10.1016/j.tig.2006.02.003.
Article CAS PubMed Google Scholar
Rodríguez-Ezpeleta N, Brinkmann H, Roure B, Lartillot N, Lang BF, Philippe H: Detecting and overcoming systematic errors in genome-scale phylogenies. Syst Biol. 2007, 56: 389-399. 10.1080/10635150701397643.
Article PubMed Google Scholar
Ho SYW, Jermiin LS: Tracing the decay of the historical signal in biological sequence data. Syst Biol. 2004, 53: 623-637. 10.1080/10635150490503035.
Article PubMed Google Scholar
Inagaki Y, Nakajima Y, Sato M, Sakaguchi M, Hashimoto T: Gene sampling can bias multi-gene phylogenetic inferences: the relationship between red algae and green plants as a case study. Mol Biol Evol. 2009, 26: 1171-1178. 10.1093/molbev/msp036.
Article CAS PubMed Google Scholar
Jermiin LS, Ho SYW, Ababneh F, Robinson J, Larkum AWD: The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst Biol. 2004, 53: 638-643. 10.1080/10635150490468648.
Article PubMed Google Scholar
Rosenberg MS, Kumar S: Taxon sampling, bioinformatics, and phylogenomics. Syst Biol. 2003, 52: 119-124. 10.1080/10635150390132894.
Article PubMed Central PubMed Google Scholar
Leigh JW, Susko E, Baumgartner M, Roger AJ: Testing congruence in phylogenomic analysis. Syst Biol. 2008, 57: 104-115. 10.1080/10635150801910436.
Article PubMed Google Scholar
Nieselt-Struwe K, von Haeseler A: Quartet-Mapping, a generalization of the likelihood-mapping procedure. Mol Biol Evol. 2001, 18: 1204-1219. 10.1093/oxfordjournals.molbev.a003907.
Article CAS PubMed Google Scholar
Grünewald S, Forslund K, Dress A, Moulton V: QNet: An agglomerative method for the construction of phylogenetic networks from weighted quartets. Mol Biol Evol. 2007, 24: 532-538.
Article PubMed Google Scholar
Eigen M, Winkler-Oswatitsch R, Dress A: Statistical geometry in sequence space: a method of quantitative comparative sequence analysis. Proc Natl Acad Sci USA. 1988, 85: 5913-5917. 10.1073/pnas.85.16.5913.
Article PubMed Central CAS PubMed Google Scholar
Nieselt-Struwe K: Graphs in sequence spaces: a review of statistical geometry. Biophys Chem. 1997, 66: 111-131. 10.1016/S0301-4622(97)00064-1.
Article CAS PubMed Google Scholar
Alexe G, Alexe S, Crama Y, Foldes S, Hammer PL, Simeone B: Consensus algorithms for the generation of all maximal bicliques. DIMACS Technical Reports 2002-52, Rutgers University, Piscataway, NJ, USA 2002. [http://dimacs.rutgers.edu/TechnicalReports/2002.html],
Dias VM, de Figueiredob CM, Szwarcfiter JL: On the generation of bicliques of a graph. Discrete Appl Math. 2007, 155: 1826-1832. 10.1016/j.dam.2007.03.017.
Article Google Scholar
Dawande M, Keskinocak P, Swaminathan J, Tayur S: On bipartite and multipartite clique problems. J Algorithms. 2001, 41: 388-403. 10.1006/jagm.2001.1199.
Article Google Scholar
Yan C, Burleigh JG, Eulenstein O: Identifying optimal incomplete phylogenetic data sets from sequence databases. Mol Phylogenet Evol. 2005, 30 (3): 528-535.
Article Google Scholar
Li J, Sim K, Liu G, Wong L: Maximal quasi-bicliques with balanced noise tolerance: concepts and co-clustering applications. Proceedings of the SIAM International Conference on Data Mining SDM 2008, April 24-26, 2008. 2008, Atlanta, Georgia, USA: SIAM,
Google Scholar
Cheng F, Hartmann S, Gupta M, Ibrahim JG, Vision TJ: A hierarchical model for incomplete alignments in phylogenetic inference. Bioinformatics. 2009, 25: 592-598. 10.1093/bioinformatics/btp015. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/25/5/592],
Article PubMed Central CAS PubMed Google Scholar
Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 1992, 89: 10915-10919. 10.1073/pnas.89.22.10915.
Article PubMed Central CAS PubMed Google Scholar
Rambaut A, Grassly NC: Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. 1997, 13: 235-238.
CAS PubMed Google Scholar
Li W, Liu Y: Modeling species-genes data for efficient phylogenetic inference. Proceedings LSS Computational Systems Bioinformatics Conference, August, 2007., Volume 6. 2007, LSS - Life Sciences Society, 429-440. [http://www.lifesciencessociety.org/CSB2007/toc/429.2007.html],
Google Scholar
Stamatakis A: 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), Proceedings, 25-29 April 2006. 2006, Rhodes Island, Greece: IEEE,
Google Scholar
Ott M, Zola J, Aluru S, Stamatakis A: Large-scale Maximum Likelihood-based phylogenetic analysis on the IBM BlueGene/L. Proceedings of ACM/IEEE Supercomputing conference 2007. 2007, New York, Reno, Nevada: ACM,
Google Scholar
Mailund T, Pedersen CNS: QDist-quartet distance between evolutionary trees. Bioinformatics. 2004, 20: 1636-1637. 10.1093/bioinformatics/bth097.
Article CAS PubMed Google Scholar
Christiansenm C, Mailund T, Pedersen CNS, Randers M: Algorithms for computing the quartet distance between trees of arbitrary degree. Edited by: Casadio R, Myers G. 2005, Springer, 77-88.
Google Scholar
Christiansen C, Mailund T, Pedersen CNS, Randers M, Stissing MS: Fast calculation of the quartet distance between trees of arbitrary degrees. Algorithms Mol Biol. 2006, 1: 16-10.1186/1748-7188-1-16.
Article PubMed Central PubMed Google Scholar
Stissing M, Mailund T, Pedersen CN, Brodal GS, Fagerberg R: Computing the all-pairs quartet distance on a set of evolutionary trees. J Bioinform Comput Biol. 2008, 6: 37-50. 10.1142/S0219720008003266.
Article CAS PubMed Google Scholar
Whelan S, Goldman N: A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol. 2001, 18: 691-699. 10.1093/oxfordjournals.molbev.a003851.
Article CAS PubMed Google Scholar
Pattengale N, Alipour M, Bininda-Emonds O, Moret B, Gottlieb E, Stamatakis A: How many bootstrap replicates are necessary?. J Comput Biol. 2010, 17: 337-354. 10.1089/cmb.2009.0179.
Article CAS PubMed Google Scholar
Holland B, Clarke A, Meudt H: Optimizing Automated AFLP Scoring Parameters to Improve Phylogenetic Resolution. Syst Biol. 2008, 57: 347-366. 10.1080/10635150802044037.
Article PubMed Google Scholar
Thorley JL, Wilkinson M: Testing the phylogenetic stability of early tetrapods. J Theor Biol. 1999, 200 (3): 343-344. 10.1006/jtbi.1999.0999.
Article PubMed Google Scholar
Thorley JL, Page RDM: RadCon: phylogenetic tree comparison and consensus. Bioinformatics. 2000, 16: 486-487. 10.1093/bioinformatics/16.5.486.
Article CAS PubMed Google Scholar
Strimmer K, von Haeseler A: Likelihood-mapping: a simple method to visualize phylogenetic content of a sequence alignment. Proc Natl Acad Sci USA. 1997, 94: 6815-6819. 10.1073/pnas.94.13.6815. [http://www.pnas.org/cgi/content/abstract/94/13/6815],
Article PubMed Central CAS PubMed Google Scholar
Steel M, Sanderson MJ: Characterizing phylogenetically decisive taxon coverage. Applied Mathematics Letters. 2009,
Google Scholar

Download references

Acknowledgements

We acknowledge the important input from all lab members of the zmb, in particular members of the bioinformatics group, including Harald Letsch, Christoph Mayer, Roman Stocsits and Wolfgang Wägele. We thank also John G. Burleigh and Mike Sanderson for kindly providing the metazoan data set of Driskell et al. 2004. The manuscript profited from many constructive comments in particular from comments of anonymous reviewers. B.Mi. and K.Me. were supported by the DFG grant MI 649/6-3, B.M.v.R. was supported by grant WA530/34, and P.K. was supported by grant WA530/33. This is a publication of the Molecular Biology Unit (zmb) of the ZFMK, Bonn.

We provide a software package to perform the proposed matrix reduction. mare is an open source software, a C++ executable is available from http://mare.zfmk.de.

Author information

Authors and Affiliations

Zoologisches Forschungsmuseum Alexander Koenig, zmb, Adenauerallee 160, 53113, Bonn, Germany
Bernhard Misof, Patrick Kück, Katharina Misof & Karen Meusemann
Institut für Systematische Neurowissenschaften, Universitätsklinikum Hamburg Eppendorf, Martinistr. 52, 20246, Hamburg, Germany
Benjamin Meyer
Natural History Museum, London Department of Life Sciences, Cromwell Road, London, SW7 5BD, UK
Björn Marcus von Reumont
CSIRO Ecosystem Sciences, Australian National Insect Collection, Clunies Ross Street, Acton, ACT, Australia
Karen Meusemann

Authors

Bernhard Misof
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Meyer
View author publications
You can also search for this author in PubMed Google Scholar
Björn Marcus von Reumont
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Kück
View author publications
You can also search for this author in PubMed Google Scholar
Katharina Misof
View author publications
You can also search for this author in PubMed Google Scholar
Karen Meusemann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bernhard Misof.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

B.Mi., B.Me. conceived the study, designed the setup and performed all analyses. B.Mi. wrote the paper with comments and revisions from K.Me., K.Mi., P.K., B.v.R. and B.Me. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Misof, B., Meyer, B., von Reumont, B.M. et al. Selecting informative subsets of sparse supermatrices increases the chance to find correct trees. BMC Bioinformatics 14, 348 (2013). https://doi.org/10.1186/1471-2105-14-348

Download citation

Received: 29 May 2013
Accepted: 17 September 2013
Published: 03 December 2013
DOI: https://doi.org/10.1186/1471-2105-14-348

Selecting informative subsets of sparse supermatrices increases the chance to find correct trees

Abstract

Background

Results

Conclusions

Background

Methods

Information content of genes, taxa and matrices

Selection of an optimal subset (SOS) of taxa and genes

Simulated data

Simulated data with random distribution of missing data

Simulated data with power-law and non-random distribution of missing data

Selecting subsets from simulated data and tree reconstructions

Selecting subsets with the hill climbing algorithm

Selecting subsets with predefined thresholds of data coverage

Selecting subsets from empirical data and tree reconstructions

Results

Performance with simulated data

Performance with empirical data

Discussion

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us