Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA

Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA

Department of Biology, University of Florida, Gainesville, FL, USA

Abstract

Background

We consider the problem of finding the maximum frequent agreement subtrees (MFASTs) in a collection of phylogenetic trees. Existing methods for this problem often do not scale beyond datasets with around 100 taxa. Our goal is to address this problem for datasets with over a thousand taxa and hundreds of trees.

Results

We develop a heuristic solution that aims to find MFASTs in sets of many, large phylogenetic trees. Our method works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these small seeds to build larger candidate MFASTs. In the final phase, it performs a post-processing step that ensures that we find a frequent agreement subtree that is not contained in a larger frequent agreement subtree. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods.

Conclusions

Although this heuristic does not guarantee to find all MFASTs or the largest MFAST, it found the MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on large empirical data sets. Its performance is robust to the number and size of the input trees. Overall, this method provides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses.

Background

Phylogenetic trees represent the evolutionary relationships of organisms. While recent advances in genomic sequencing technology and computational methods have enabled construction of extremely large phylogenetic trees (e.g.,

(a) A collection of five input trees

**(a) A collection of five input trees.** The same subtree with taxa A, B, C, and D is present in all input trees, and only the position of taxa E changes. **(b)** The majority rule consensus and maximum agreement subtrees of the 5 input trees in Figure

Alternate approaches have been proposed to reveal highly supported subtrees. The maximum agreement subtree (MAST) problem seeks the largest subtree that is present in all members of a given collection of trees

A less restrictive problem is to find frequent agreement subtrees (FAST), or subtrees that are found in many, but not necessarily all, of the input trees (see

Another approach to reveal highly supported subtrees from a collection of trees is to identify and remove rogue taxa, or taxa whose position in the input trees is least consistent. Recently, several methods have been developed that can identify and remove rogue taxa from collections of trees with thousands of taxa

In this paper, we describe a heuristic approach for identifying MFASTs in collections of trees. Unlike previous methods, our method easily scales to datasets with over a thousand taxa and hundreds of trees. Towards this goal, we develop a heuristic solution that works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these seeds to build larger candidate MFASTs. In the final phase, it performs a post processing step. This step ensures that the size (i.e., number of taxa) of the FAST found can not be increased further by adding a new taxon without reducing its frequency below a user supplied frequency threshold. We demonstrate that this heuristic can easily handle data sets with 1000 taxa. We test the effectiveness of these approaches on simulated data sets and then demonstrate its performance on large, empirical data sets. Although our heuristic does not guarantee to find all MFASTs or the largest MFAST in theory, it found the true MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on the empirical data sets. Its performance is robust with respect to the number of input trees and the size of the input trees.

Methods

In this section we describe our method that aims to find _{1}, _{2}, …, _{
m
}}. Our method follows from the observation that an MFAST is present in a large number of trees in

• **Phase 1.** Seed generation (Section “Phase one: Seed generation”).In the first phase, we identify small subtrees from the input trees that have a potential to be a part of an MFAST. We call each such subtree a

• **Phase 2.** Seed combination (Section “Phase two: Seed combination”).In the second phase, we construct an initial FAST by combining the seeds found in the first phase.

• **Phase 3.** Post processing (Section “Phase three: Post-processing”).In the third phase, we grow the FAST further to obtain the maximal FAST that contains it by individually considering the taxa which are not already in the FAST. We report the resulting maximal FAST as a possible MFAST.

First, we present the the basic definitions needed for this paper in Section “Preliminaries and notation”. We then discuss each of the three phases above in detail.

Preliminaries and notation

In this section, we present the key definitions and notations needed to understand the rest of the paper. We describe our method using rooted and bifurcating phylogenetic trees. However, our method and definitions can easily be applied to unrooted or multifurcating trees with minor or no modifications. Also, we assume that all the taxa are placed at the leaf level nodes of the phylogenetic tree, and all the internal nodes are inferred ancestors. Figure

Definition 1 (**Clade**)

Let

(a) A rooted, bifurcating phylogenetic tree

**(a) A rooted, bifurcating phylogenetic tree ****built on five taxa labeled with ****and **** e**. The internal nodes are shown with

Each internal node of a phylogenetic tree corresponds to a clade of that tree. Figure _{1}.

Definition 2 (**Contraction**)

Let

The contraction operation can extract the clades of a tree by removing all the taxa that are not a part of that clade. It can also extract parts of the tree that are not necessarily clades. We use the term

Definition 3 (**Subtree**)

Let

If a tree ^{
n
} − 1 subtrees of any size including itself.

Definition 4 (**Frequency**)

Let _{1}, _{
T2}, … , _{
m
}} be a set of

Definition 5 (**FAST**)

Let _{1}, _{2}, … , _{
m
}} be a set of

We say that a FAST is

Definition 6 (**MFAST**)

Let _{1}, _{2}, …, _{
m
}} be a set of

Formally, given a set of phylogenetic trees _{1}, _{2}, …, _{
m
}} and a frequency cutoff,

A set of phylogenetic trees

_{
i
}

Number of trees in

Number of taxa in each input tree

_{
i
}

Frequency of the subtree

Frequency cutoff

_{
i
}

Size of a seed

Number of contractions used to create a seed

Phase one: Seed generation

The first phase extracts small subtrees from the given set of trees. From these subtrees we extract the basic building blocks which are used to construct MFASTs. We call these building blocks seeds. Conceptually each seed is a phylogenetic tree that contains a small subset of the taxa that make up the trees in

1.

2. Number of contractions (

3.

We explain the seed features with the help of Figures _{1} can be extracted from _{1} by choosing the clade rooted at _{2}. When _{1}, _{2} and _{3} can be obtained using one contraction (_{3}, _{2} and _{1} respectively) from the clade rooted at _{1}.

_{1} is an input tree built on four taxa _{1}, _{2}, _{3 }and _{4}

_{1}** is an input tree built on four taxa **_{1}, _{2}, _{3}**and **_{4}. The internal nodes of _{1} are labeled as _{0}, _{1} and _{2}. _{1} is the only seed obtained from _{1} when _{1} is identical to the clade rooted at _{2}. _{1}, _{2} and _{3} are the seeds extracted from _{1} when _{1} by contracting _{3}, _{2} and _{1} respectively.

The set of input trees _{1}, _{2}, _{3}and the set of all nine potential seeds _{1}, _{2}_{9}when the seed characteristics are set to ** k**=3 and

**The set of input trees **_{1}, _{2}, _{3}** and the set of all nine potential seeds **_{1}, _{2 }_{9}** when the seed characteristics are set to **** = 3 and ****c = 1**. All the potential seeds have three taxa as k = 3. We need one contraction from the input tree to obtain each seed.

The last feature denotes the number of trees in _{1}, _{2}, …, _{9} extracted from the three input trees using only one contraction. Among these, the frequency of _{1} is 1 as it is present in all the trees. Frequency of _{2} is about 0.67 for it is present in only two out of three trees (_{1} and _{2}). The frequency of the rest of the seeds is only about 0.33. Recall that, by definition, an MFAST is present in at least a fraction

Given the values of

Once we build our potential seed list for all the trees in

In Figure _{1} that has four taxa. For _{1} itself. We extract four potential seeds, each having three leaves from this tree. The potential seeds in this figure are given by _{1}, _{2}, _{5} and _{7} which we extract by contracting _{4}, _{3}, _{2} and _{1} respectively from _{1}.

Phase two: Seed combination

At the end of the first phase, we obtain a set of frequent seeds from the input trees. Notice that each seed is a FAST as each seed is present in sufficient number of trees specified by

We first define what it means to combine two seeds. In order to combine two seeds, it is a necessary condition that both seeds are present in at least one common tree _{1} and _{2} be two seeds and let _{1}, _{2} and _{1}, _{2} and _{1} and _{2} results in the tree that is equivalent to the one obtained by contracting the taxa in _{1} ∪ _{2}) from _{
T
} symbol. For instance we denote combining _{1} and _{2} with _{1} ⊕ _{
T
}
_{2}. To simplify our notation, whenever the identity of the reference tree is irrelevant, we will use the symbol ⊕ instead of ⊕_{
T
}.

Figure _{1} and _{2} are combined with the help of the reference tree _{1} and _{2} are subtrees of _{1} = {_{1},_{3},_{4}}, _{2} = {_{1},_{2},_{5},_{7}}. Thus, we build _{1} ⊕ _{
T
}
_{2} by contracting the taxa in _{1} ∪ _{2}) = {_{6},_{8}} from

**is the reference tree. **_{1} and _{2} are the seeds to be combined, both are present in _{1}, _{2}, _{3}, _{4}, _{5} and _{7} from

So far, we have explained how to combine two seeds _{1} and _{2} using a reference tree. It is possible that many trees in _{1} and _{2} using each such reference tree one by one exhaustively without ignoring any of such trees. We explain them next.

Consider two trees _{1} and _{2} from _{1} and _{2}.

• Case 1: _{1} ⊕_{
T
1
}
_{2} = _{1} ⊕_{
T
2
}
_{2}. In this case, it does not matter whether we use _{1} or _{2} as the reference tree. They will both lead to the same combined subtree. Thus, we use only one.

• Case 2: _{1} ⊕_{
T
1
}
_{2} ≠ _{1} ⊕_{
T
2
}
_{2}. In this case, the trees _{1} and _{2} lead to alternative combination topologies. So, we consider both of them separately.

We utilize the observations above as follows. We start by picking one reference tree arbitrarily. Once we create a combined subtree using that tree, we check whether that subtree is present in the remaining trees in

The next question we need to answer is which seed pairs should we combine? To answer this question we first make the following proposition.

Proposition 1

Assume that we are given a set of phylogenetic trees _{1} and _{2} be two seeds constructed from the trees in

Proof

For any _{1} and _{2} are subtrees of _{1} ⊕_{
T
}
_{2}. Thus if _{1} ⊕_{
T
}
_{2} is present in a tree, then both _{1} and _{2} are present in that tree. As a result, _{1} ⊕_{
T
}
_{2}, _{1}, _{1} ⊕_{
T
}
_{2}, _{2},

□

Proposition 1 states that as we combine pairs of seeds to grow them, their frequency monotonically decreases. This suggests that it is desirable to combine two seeds if both of them have large frequencies. This is because if one of them has a small frequency, regardless of the frequency of the other, the combined tree will have a small frequency. As a result its chance to grow into a larger tree through additional combine operations gets smaller. Following this intuition, we develop two approaches for combining the seeds.

1.

2.

Both approaches accept the list of seeds computed in the first phase as input and produce a larger FAST that is a combination of multiple seeds. Both of them also assume that the list of input seeds are already sorted in decreasing order of their frequencies. We discuss these approaches next.

In-order combination

The in-order combination approach follows from Proposition 1. It assumes that the seeds with higher frequencies have greater potential to be a part of an

Algorithm 1 In order combination

**for all** seeds _{
i
} **do**

^{
′
} ← _{
i
}

Mark _{
i
} as considered

**repeat**

_{
j
} ← seed with highest frequency among unconsidered seeds Mark _{
j
} as considered CUTOFF ← ^{
′
} ← ^{
′
}

**repeat**

Pick the next unconsidered tree

Mark all the trees as that contain ^{
′
} ⊕_{
T
}
_{
j
} as considered

**if** freq(**then**

^{
′
} ← ^{
′
} ⊕_{
T
}
_{
j
}

CUTOFF ← freq(

**end if**

**until** Less than

^{
′
}←^{
′
}

Unmark all trees in

**until** all seeds are considered

**if** size of ^{
′
} ≥ size of **then**

^{
′
}

**end if**

Unmark all seeds

**end for**

In Algorithm Algorithm 1 In order combination we first initialize the FAST as empty. We then consider each seed one by one. We initialize a temporary subtree denoted by _{
i
}
_{
i
} as considered. We combine the FAST’ with a seed _{
j
} which has the highest frequency amongst the seeds that have not been added. If multiple seeds have the highest frequency, we randomly pick one of them and mark that seed _{
j
}
_{
j
} leading to different topologies. We use the trees in _{
j
} as guides to try only the topologies that exist in _{
j
}
_{
i
}
_{
i
}

Notice that in Algorithm 3 each seed _{
i
} can lead to a different FAST. We record only the FAST that has the largest size. However, it is trivial to maintain the top

Minimum overlap combination

The purpose of combining seeds is to construct a FAST that is large in size. Our in-order combination approach (Section “In-order combination”) aimed to maximize the frequency of the combined seeds. In this section, we develop our second approach, named

When we combine two seeds, the size of the resulting tree becomes at least as big as the size of each of these seeds. Formally let _{1} and _{2} be two seeds (i.e., trees). Let _{1} and _{2} be the set of taxa combined in _{1} and _{2}. We denote the size of a set, say _{1}, with |_{1}|. The size of the tree resulting from combination of _{1} and _{2} is |_{1}| + |_{2}| − |_{1} ∩ _{2}|. For a given fixed seed size, the first two terms of this formulation remains unchanged regardless of the seed. The last term determines the growth in the size of the FAST. Thus, in order to grow the FAST rapidly, it is desirable to combine two frequent subtrees with a small number of common taxa.

Our second approach follows from the observation above. We introduce a criteria called the _{
j
} that will be combined with the current temporary FAST (i.e., FAST’). Rather than choosing the seed with the largest frequency, this approach chooses the one that has the least overlap with FAST’ among all the unconsidered and frequent seeds. If multiple seeds have the same smallest overlap, it considers the frequency as the tie breaker and chooses the one with the largest frequency among those.

Phase three: Post-processing

So far we described how to obtain seeds (Section “Phase one: Seed generation”) and how to combine them to construct FAST (Section “Phase two: Seed combination”). The two approaches we developed for combining seeds aim to maximize the size of FAST. However, they do not ensure the maximality of the resulting FAST. There are two main reasons that prevent our seed combining algorithms from constructing maximal FAST. First, some of the taxa of a maximal FAST may not appear in any seed (i.e. false negatives). As a result no combination of seeds will lead to that maximal FAST. Second, even if all the taxa of a maximal FAST are parts of at least one seed, our algorithms will reject combining that seed with the FAST of the seeds if those seeds contain other taxa that are not part of the maximal FAST (i.e. false positives).

In the post-processing phase, we tackle above-mentioned problem. Algorithm 3 describes the post processing phase in detail. We do this by considering all taxa which are not already present in the

Algorithm 2 Post processing

INPUT = FAST from the seed combination phase

INPUT =

OUTPUT = Maximal FAST

RESULT ← FAST

**for all **
_{
i
} not in FAST **do**

CUTOFF ←

t_RESULT ←

**repeat**

Pick the next unconsidered tree

RESULT’ ← RESULT ⊕_{
T
}
_{
i
}

Mark all the trees that contain RESULT’ as considered

**if** frequency of RESULT’ ≥ CUTOFF **then** t_RESULT ← RESULT’ CUTOFF ← frequency of RESULT’

**end if**

**until** Less than

RESULT ← t_RESULT

Unmark all trees in

**end for**

**return** RESULT

We expect the post processing step to identify quickly the taxa that have a potential to be in an MFAST that might have not been considered during the seed generation and seed combination phases. At the end of the post processing step we obtain an MFAST.

Complexity analysis of our method

In this section we discuss the complexity of our method in terms of the three phases involved in it. Let

Phase one

Finding the seeds involves enumerating all the subtrees and checking their frequencies. Given seed size

The complexity of finding whether a seed is present in a single tree is ^{2}
^{2} log

Phase two

Consider a set of

^{2}) iterations. The complexity of checking the frequency of each combined subtree is ^{2}
^{2}

^{2}
^{2} + ^{2}
^{2}

Phase three

Here, we consider the FAST obtained from each of the

Notice that each frequent seed has to appear in at least

That using minimum overlap combination is

In the two summations above, the second term is asymptotically larger than the first and the last terms. Thus, we can simplify the asymptotic time complexity of inorder and minimum overlap combinations as

and

respectively.

Results and discussion

This section evaluates the performance of our MFAST algorithm experimentally.

Implementation details

We implemented our MFAST algorithm using C and Perl. More specifically, we implemented the first two phases (seed generation and seed combination) in C and the third phase (post processing) in Perl. We utilize the functions provided in the newick Utilities

Methods compared against

We have compared our method against Phylominer

Evaluation Criteria

We evaluate our algorithm based on the size of the MFAST found. Larger MFASTs are preferable. When possible, we report the size of the optimal solution as well.

Test Environment

We ran our experiments on Linux servers equipped with dual AMD Opteron dual core processors running at 2.2 GHz and 3 GB of main memory to test the performance of our method.

Datasets

We test the performance and verify the results of our method on synthetic datasets and real datasets.

• Synthetic dataset We built synthetic datasets in which we embedded an MFAST as described below. We characterize each synthetic dataset using five parameters. The first two parameters denote the size and number of trees in ^{
′
} taxa randomly in the MFAST. With probability ∊ we insert each taxa within the clade that contains MFAST. With probability 1 − ∊ we insert it outside that clade. We then created

1. Tree size (

2. Number of trees (

3. MFAST frequency (

4. MFAST size (

5. Noise percentage (∊).

• Real datasets. We use two empirical datasets to evaluate the performance of our heuristic. The data sets contain 200 bootstrap trees generated from phylogenetic analysis of the Gymnosperm

Effects of number of input trees

In our first experiment, we analyze how the number of input trees in

We ran our algorithm on each of these datasets to find the size of the MFAST for

**Number of trees**

**MFAST size**

The number of trees is set to 50, 100 and 200. For each number of trees we run our experiments on ten datasets. Each dataset contains trees with 100 taxa and an embedded MFAST of size 15. We report the average size of the MFAST obtained by our method across the ten datasets.

**Before post processing**

**After post processing**

50

14.5

16.0

100

15.3

15.8

200

14.4

15.4

Effects of tree size

Our second experiment considers the impact of the number of taxa in the input trees contained in

Table

**Number of**

**MFAST size**

The tree size is set to 100, 250, 500 and 1000. For each tree size we run our experiments on ten datasets. Each dataset contains 100 trees with an embedded MFAST of size 15% of the input tree size. Second column shows the embedded MFAST size. Last two columns list the average size of the MFAST found by our method across the ten datasets.

**taxa**

**Embedded**

**Reported**

**Before post**

**After post**

**processing**

**processing**

100

15

15.3

15.8

250

38

32.3

38.8

500

75

43.7

76.0

1000

150

69.8

151.0

The results also suggest that our method identifies a significant percentage of the taxa in the embedded MFAST after the second phase (i.e., before post-processing) when the tree size is small. As the tree size grows, it starts missing some taxa at this phase. It however recovers the missing taxa during the post-processing phase even for the largest tree size. This indicates that at the end of phase two our method could identify a backbone of the actual MFAST. The unidentified taxa at this phase are scattered throughout the clades in the input trees. Thus, there is no clade of size

Effects of noise percentage

Recall that the noise percentage ∊denotes the percentage of taxa that is added inside the clade that contains the MFAST. As ∊increases, the pairs of taxa in the MFAST get farther away from each other in the tree that contains it. As a result, fewer taxa from MFAST will be contained in small clades of size

In this experiment, we answer the question above and analyze the effect of the noise percentage on the success of our method. We create synthetic datasets with various ∊ values. Particularly, we use ∊ = 20, 40 and 60%. We set the size of the embedded MFAST to ^{
′
} = 15, the tree size to

**Noise (%)**

**MFAST size**

The size of the embedded MFAST in all the experiments is 15. We list the average size of the MFAST found by our method before and after the post processing phase.

**Before post processing**

**After post processing**

20

15.3

15.8

40

13.6

15.0

60

12.7

15.0

The results suggest that our method can identify the embedded MFAST successfully even when the noise percentage is very high. We observe that the size of the MFAST found by our method before post processing decreases slowly with increasing amount of noise. This is not surprising as the taxa contained in the embedded MFAST gets more spread out (and thus farther away from each other) in the trees in

Impact of seed creation

So far, in our experiments we consistently observed two major points for all the parameter settings (see Sections “Effects of number of input trees” to “Effects of noise percentage”): (i) Our method always finds a large subtree of the embedded MFAST after phase two. (ii) Our method always recovers the entire embedded MFAST after phase three. The second observation can be explained from the first one that the outcome of phase two is large enough to build the entire MFAST precisely. The first observation however indicates that the set of seeds generated in phase one contain a significant percentage of the taxa in the embedded MFAST. In this section, we take a closer look into this phenomenon and explain why this is the case even for small values of seed size

The number of rooted bifurcating trees for a given set of

Consider a clade with

Let us denote one of these clades by U(

Recall that it suffices for our algorithm to have a

Assume that the MFAST size in the given set of trees

A lower bound to NS(

Figure

The probability of finding at least one seed which contains a part of an MFAST

**The probability of finding at least one seed which contains a part of an MFAST**. The number of contractions **a**), we set the total number of trees **b**) we set

Overall, we conclude from this experiment that even small values of

Evaluation of state of the art methods

So far, we have shown that our method could successfully find the MFASTs contained in sets of trees

When we fix the number of trees and the number of taxa to 100, PAUP* was able to find the MAST for for all datasets. As we grow the number of taxa to 250 or larger while keeping the number of trees as 100, PAUP* runs our of memory and fails to return any results. After reducing the number of trees to 50, PAUP* still runs out of memory and cannot report any results for more than 100 taxa.

The scalability problem of Phylominer is even more severe. Phylominer is able to compute the MFASTs on datasets with up to 20 taxa. However, as we increase the number of taxa further, its performance deteriorates quickly. When we set the number of taxa to 100, even with as few as 100 trees, Phylominer takes more than a week to report a result. Moreover, in our experiments, the maximum size of the subtrees it found on average contained fewer than 7 taxa, even though the size of the true MAST was 10.

Another interesting question about existing methods would be whether the majority consensus rule can be used to find MFASTs. To evaluate this, we used the same three synthetic datasets used in Section “Effects of noise percentage”. Recall each of these three datasets contains an MFAST of size 15 which is embedded in 80% of the trees. The datasets are created with 20%, 40% and 60% noise indicating different levels of difficulty in recovering the embedded MFAST. We computed 70% majority consensus tree. Notice that if majority consensus rule can identify an MFAST, that would correspond to a bifurcating subtree topology in the consensus tree. In other words a subtree is bifurcating in this experiment only if 70% or more of the input trees agree on the topology of that subtree. The resulting tree, however, was multifurcating for all the three datasets. This means that majority consensus rule could not recover even a smaller portion of the embedded tree while our method was able to locate the entire MFAST successfully (see Table

These results demonstrate that both PAUP* and Phylominer are not well suited to finding agreement subtrees in larger datasets, our method scales better in terms of both the number of taxa and the number of trees. When PAUP* runs to completion, we observed that it reports the true results. Recall from previous experiments that our method always found the true results on the same datasets as well as larger datasets.

Empirical dataset experiments

To examine the performance of the MFAST method on real data, we performed experiments using 200 maximum likelihood bootstrap trees from a phylogenetic analysis of gymnosperms (959 taxa) and Saxifragales (950 taxa). Specifically, we evaluated how the performance of the MFAST algorithm was affected by the number of input trees and the size of the input trees.

Effects of number of input trees

We first examined the effect of input tree number on the size of MFAST. For both the gymnosperm and Saxifragales trees, we generated 10 sets of 50 and 100 trees by randomly sampling from the original 200 trees without replacement. We compared the average size of the MFAST in the 50 and 100 tree data sets with the size of the MFAST in the original 200 tree data set. First, in all analysis, the post-processing step greatly increases the size of the MFAST, sometimes more than doubling it (Table

**Number of**

**MFAST size**

The size of the MFAST found by running only the post processing step is also shown. We run our method on the entire dataset that contains 200 trees as well as randomly selected subsets of 50 and 100 trees. We repeated the 50 and 100 tree experiments 10 times by randomly selecting the trees from the entire dataset and reported the average value.

**trees**

**Gymnosperms**

**Saxifragales**

**Before**

**After**

**Only**

**Before**

**After**

**Only**

50

78.5

129.8

99.5

64.7

122.0

84.1

100

68.4

119.2

83.1

55.4

112.8

74.7

200

76.0

118.0

84.0

40.0

105.0

75.0

The large gap between the MFAST sizes before and after the post processing suggests that phase three is the main reason behind the success of our method, and thus, the costly seed combination phase (i.e., phase two) may be unnecessary. To answer whether this conjecture is correct, we ran a variant of our method by disabling the second phase; we only ran the post processing phase starting from each seed as the initial MFAST one by one. We reported the largest MFAST found that way as the output of this variant in Table

Effects of size of input tree

Next, we examined the effect of number of leaves in the input trees on the size of MFASTs. For both the gymnosperm and Saxifragales trees, we generated 10 sets of 200 input trees with 100, 250, and 500 taxa. To make each set, we randomly selected 100, 250, or 500 taxa, and we deleted all other taxa from the original sets of 200 trees. Thus, these sets of trees with 100, 250, or 500 taxa are subtrees of the original data sets. The size of the average MFAST increases with more taxa in the original trees (Table

**Number of**

**MFAST size**

The size of the MFAST found by running only the post processing step is also shown. We run our method on the entire dataset that contains all the taxa (last row) as well as randomly selected taxa subsets of size 100, 250 and 500. We repeated the 50, 100 and 250 taxa experiments 10 times by randomly selecting the taxa from the entire dataset and reported the average value.

**leaves**

**Gymnosperms**

**Saxifragales**

**Before**

**After**

**Only**

**Before**

**After**

**Only**

100

41.2

56.1

43.5

43.5

50.7

38.5

250

67.2

88.5

63.0

62.3

76.2

54.6

500

91.6

123.0

74.9

52.0

86.7

62.9

All

76.0

118.0

84.0

40.0

105.0

75.0

Similar to the experiments in Section “Effects of number of input trees”, we investigated the gap between the MFAST sizes before and after the post processing step. We ran a variant of our method by disabling the second phase; we only ran the post processing phase starting from each seed as the initial MFAST one by one. We reported the largest MFAST found that way as the output of this variant in Table

Effects of sample size

In our final experiment, we evaluated the effect of the

We carried out this experiment as follows. For both the gymnosperm and Saxifragales trees, we ran 10 sets of experiments for each sampling percentage of 2, 5, 10, 25, 50 and 100%. Thus, totally we ran 60 (6 × 10) experiments. Table

**Sampling**

**MFAST size**

We run our method by randomly picking 2%, 5%, 10%, 25%, 50%, 100% of the seeds found in phase one for combination in phase two.

**percentage**

**Gymnosperms**

**Saxifragales**

2

85.9

74.3

5

87.5

75.6

10

88.4

75.2

25

87.5

75.5

50

88.5

76.2

100

88.5

76.2

Conclusion

In this paper, we present a heuristic for finding the maximum agreement subtrees. The heuristic uses a multi-step approach which first identifies small candidate subtrees (called seeds), from the set of input trees, combines the seeds to build larger candidate MFASTs, and then performs a post-processing step to increase the size of the candidate MFASTs. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods. Although this heuristic is not guaranteed to find all MFASTs, it performs well using both simulated and empirical data sets. Its performance is relatively robust to the number of input trees and the size of the input trees, although with the larger data sets, the post processing step becomes more important. Overall this method provides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses.

Although the method we developed is described and implemented for the rooted and bifurcating trees, it can be trivially extended to multifurcating as well as unrooted trees. The central technical difference in the case of unrooted trees would be the definition of clade (see Definition 1) as the definition requires a root. A clade in an unrooted tree encompasses two sets of nodes; (i) a given set of taxa

Abbreviations

MAST: Maximum agreement subtree; FAST: Frequent agreement subtree; MFAST: Maximum frequent agreement subtree.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

AR participated in algorithm development, implementation, experimental evaluation and writing of the paper. TK participated in algorithm development, experiment design and writing of the paper. GB participated in experiment design, dataset collection and writing of the paper. All authors read and approved the final manuscript.

Acknowledgments

This work was supported partially by the National Science Foundation (grants CCF-0829867 and IIS- 0845439).