### Abstract

#### Background

We consider the problem of finding the maximum frequent agreement subtrees (MFASTs) in a collection of phylogenetic trees. Existing methods for this problem often do not scale beyond datasets with around 100 taxa. Our goal is to address this problem for datasets with over a thousand taxa and hundreds of trees.

#### Results

We develop a heuristic solution that aims to find MFASTs in sets of many, large phylogenetic trees. Our method works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these small seeds to build larger candidate MFASTs. In the final phase, it performs a post-processing step that ensures that we find a frequent agreement subtree that is not contained in a larger frequent agreement subtree. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods.

#### Conclusions

Although this heuristic does not guarantee to find all MFASTs or the largest MFAST, it found the MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on large empirical data sets. Its performance is robust to the number and size of the input trees. Overall, this method provides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses.

##### Keywords:

Phylogenetic trees; Frequent subtree### Background

Phylogenetic trees represent the evolutionary relationships of organisms. While recent advances in genomic sequencing technology and computational methods have enabled construction of extremely large phylogenetic trees (e.g., [1-3]), assessing the support for phylogenetic hypotheses, and ultimately identifying well-supported relationships, remains a major challenge in phylogenetics. Support for a tree often is determined by methods such as nonparametric bootstrapping [4], jackknifing [5], or Bayesian MCMC sampling (e.g., [6]), which generate a collection of trees with identical taxa representing the range of possible phylogenetic relationships. These trees can be summarized in a consensus tree (see [7]). Consensus methods can highlight support for specific nodes in a tree, but they also may obscure highly supported subtrees. For example, in Figure 1, the subtree containing taxa A, B, C, and D is present in all five input trees. However, due to the uncertain placement of taxon E, the majority rule consensus tree implies that the clades in the tree have relatively low (60%) support.

Alternate approaches have been proposed to reveal highly supported subtrees. The maximum agreement subtree (MAST) problem seeks the largest subtree that is present in all members of a given collection of trees [8]. For example, in Figure 1 the MAST includes taxa A, B, C, and D. Finding the MAST is an NP-hard problem [9], although efficient algorithms exist to compute the MAST in some cases (e.g., [9-17]). In practice, since any difference in any single tree will reduce the size of the MAST, the MAST is often quite small, limiting it usefulness.

A less restrictive problem is to find frequent agreement subtrees (FAST), or subtrees
that are found in many, but not necessarily all, of the input trees (see [18]). In this problem, a subtree is declared as frequent if it is in at least as many
trees as a user supplied frequency threshold. Several algorithmic approaches have
been suggested to identify FASTs, and specifically the maximum FASTs (MFASTs), or
FASTs that contain the largest number of taxa. A variant of this problem seeks the
maximal FASTs, i.e., FASTS that are not contained in any other FASTs. Notice that
an MFAST is a *maximal* FAST, however, the inverse is not necessarily true. Zhang and Wang defined algorithms,
implemented in Phylominer, to identify FASTs from a collection of phylogenetic trees
[19,20]. These algorithms are guaranteed to find all FASTs but they may be prohibitively
slow for data sets larger than 20 taxa. Cranston and Rannala implemented Metropolis-Hastings
and Threshold Accepting searches to identify large FASTs from a Bayesian posterior
distribution of phylogenetic trees [21]. This approach can handle thousands of input trees but it may not be feasible if
the trees have more than 100 taxa [21].

Another approach to reveal highly supported subtrees from a collection of trees is to identify and remove rogue taxa, or taxa whose position in the input trees is least consistent. Recently, several methods have been developed that can identify and remove rogue taxa from collections of trees with thousands of taxa [22-24]. However, unlike MAST or FAST approaches, they do not provide guarantees about the support for the remaining taxa.

In this paper, we describe a heuristic approach for identifying MFASTs in collections of trees. Unlike previous methods, our method easily scales to datasets with over a thousand taxa and hundreds of trees. Towards this goal, we develop a heuristic solution that works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these seeds to build larger candidate MFASTs. In the final phase, it performs a post processing step. This step ensures that the size (i.e., number of taxa) of the FAST found can not be increased further by adding a new taxon without reducing its frequency below a user supplied frequency threshold. We demonstrate that this heuristic can easily handle data sets with 1000 taxa. We test the effectiveness of these approaches on simulated data sets and then demonstrate its performance on large, empirical data sets. Although our heuristic does not guarantee to find all MFASTs or the largest MFAST in theory, it found the true MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on the empirical data sets. Its performance is robust with respect to the number of input trees and the size of the input trees.

### Methods

In this section we describe our method that aims to find *Maximum Frequent Agreement SubTrees (MFASTs)* in a given set of *m* phylogenetic trees = {*T*_{1}, *T*_{2}, …, *T*_{m}}. Our method follows from the observation that an MFAST is present in a large number
of trees in . The method builds MFASTs bottom up from small subtrees of taxa in the trees in . Briefly, it works in three phases.

• **Phase 1.** Seed generation (Section “Phase one: Seed generation”).In the first phase, we identify
small subtrees from the input trees that have a potential to be a part of an MFAST.
We call each such subtree a *seed*.

• **Phase 2.** Seed combination (Section “Phase two: Seed combination”).In the second phase, we
construct an initial FAST by combining the seeds found in the first phase.

• **Phase 3.** Post processing (Section “Phase three: Post-processing”).In the third phase, we grow
the FAST further to obtain the maximal FAST that contains it by individually considering
the taxa which are not already in the FAST. We report the resulting maximal FAST as
a possible MFAST.

First, we present the the basic definitions needed for this paper in Section “Preliminaries and notation”. We then discuss each of the three phases above in detail.

#### Preliminaries and notation

In this section, we present the key definitions and notations needed to understand
the rest of the paper. We describe our method using rooted and bifurcating phylogenetic
trees. However, our method and definitions can easily be applied to unrooted or multifurcating
trees with minor or no modifications. Also, we assume that all the taxa are placed
at the leaf level nodes of the phylogenetic tree, and all the internal nodes are inferred
ancestors. Figure 2(a) shows a sample phylogenetic tree built on five taxa. We define the *size* of a tree as the number of taxa in that tree. We start by defining key terms.

#### Definition 1 (**Clade**)

Let *T* be a phylogenetic tree. Given an internal node of *T*, we define the set of all nodes and edges of *T* contained under that node as the *clade* rooted at that node.

**Figure 2.** **(a) A rooted, bifurcating phylogenetic tree ***T ***built on five taxa labeled with ***a, b, c, d ***and **** e**. The internal nodes are shown with

*x*

_{0},

*x*

_{1},

*x*

_{2}and

*x*

_{3}.

**(b)**A clade of

*T*rooted at

*x*

_{1}.

**(b)**and

**(c)**Two subtrees of

*T*by contracting the taxa sets {

*d, e*} and {

*a, e*}.

Each internal node of a phylogenetic tree corresponds to a clade of that tree. Figure
2(b) depicts the clade of the tree in Figure 2(a) rooted at *x*_{1}.

#### Definition 2 (**Contraction**)

Let *T* be a phylogenetic tree with *n* taxa. The contraction operation transforms *T* into a tree with *n*−1 taxa by removing a given taxon in *T* along with the edge that connects that taxon to *T*.

The contraction operation can extract the clades of a tree by removing all the taxa
that are not a part of that clade. It can also extract parts of the tree that are
not necessarily clades. We use the term *subtree* to denote a tree that is obtained by applying contractions to arbitrary set of taxa
in a given tree. Formal definition is as follows.

#### Definition 3 (**Subtree**)

Let *T* and *T’* be two phylogenetic trees. We say that *T’* is a *subtree* of *T* if *T* can be transformed into *T’* by applying a series of contractions on *T*.

If a tree *T’* is a subtree of another tree *T*, we say that *T’* is *present* in *T*. Notice that a clade is always a subtree, but the inverse is not true all the time.
Figures 2(b) and 2(c) illustrate two subtrees of the tree in Figure 2(a). Let us denote the number of combinations of *k* taxa from a set of *n* taxa with . In general, if a tree has *n* taxa, then that tree contains subtrees with *k* taxa. As a consequence, that tree contains 2^{n} − 1 subtrees of any size including itself.

#### Definition 4 (**Frequency**)

Let = {*T*_{1}, _{T2}, … , *T*_{m}} be a set of *m* phylogenetic trees and *T* be a phylogenetic tree. Let us denote the number of trees in at which *T* is present with the variable *m’*. We define the frequency of *T* in as

#### Definition 5 (**FAST**)

Let = {*T*_{1}, *T*_{2}, … , *T*_{m}} be a set of *m* phylogenetic trees and *T* be a phylogenetic tree. Let *γ* be a number in [0, 1] interval that denotes frequency cutoff. We say that *T* is a Frequent Agreement SubTree (FAST) of if its frequency in is at least *γ* (i.e., ).

We say that a FAST is *maximal* if there is no other FAST that contains all the taxa in that FAST. Clearly, larger
FASTs indicate biologically more relevant consensus patterns. The following definition
summarizes this.

#### Definition 6 (**MFAST**)

Let = {*T*_{1}, *T*_{2}, …, *T*_{m}} be a set of *m* phylogenetic trees. Let *γ* be a number in [0, 1] interval that denotes frequency cutoff. A FAST *T* of is a Maximum Frequent Agreement SubTree (MFAST) of if there is no other FAST *T’* of that has a larger size than *T*.

Formally, given a set of phylogenetic trees = {*T*_{1}, *T*_{2}, …, *T*_{m}} and a frequency cutoff, *γ*, we would like to find the MFASTs in in this paper. We develop an algorithm that aims to solve this problem. Table 1 lists the variables used throughout the rest of this paper.

**Table 1.** Commonly used variables and functions in this paper

#### Phase one: Seed generation

The first phase extracts small subtrees from the given set of trees. From these subtrees we extract the basic building blocks which are used to construct MFASTs. We call these building blocks seeds. Conceptually each seed is a phylogenetic tree that contains a small subset of the taxa that make up the trees in . We characterize each seed with three features that are listed below. We elaborate on each feature later in this section.

1. *Seed size ( k)* is the number of taxa in the seed.

2. Number of contractions (*c*) is the number of taxa we prune from a clade taken from an input tree in order to
extract the seed.

3. *Frequency ( f)* is the fraction of input trees in which the seed is present.

We explain the seed features with the help of Figures 3 and 4. The first two characteristics explain how a seed can be found in one of the trees
in . They indicate that there is a clade of a tree in such that this clade contains *k* + *c* taxa and it can be transformed into that seed after *c* contractions from that clade. For instance in Figure 3, when *k* = 2 and *c* = 0, only seed *S*_{1} can be extracted from *T*_{1} by choosing the clade rooted at *x*_{2}. When *k* = 2 and *c* = 1, seeds *S*_{1}, *S*_{2} and *S*_{3} can be obtained using one contraction (*a*_{3}, *a*_{2} and *a*_{1} respectively) from the clade rooted at *x*_{1}.

**Figure 3.** *T*_{1}** is an input tree built on four taxa ***a*_{1}, *a*_{2}, *a*_{3}**and ***a*_{4}. The internal nodes of *T*_{1} are labeled as *x*_{0}, *x*_{1} and *x*_{2}. *S*_{1} is the only seed obtained from *T*_{1} when *k* = 2 and *c* = 0. That is *S*_{1} is identical to the clade rooted at *x*_{2}. *S*_{1}, *S*_{2} and *S*_{3} are the seeds extracted from *T*_{1} when *k* = 2 and *c* = 1. They are all extracted from the clade rooted at *x*_{1} by contracting *a*_{3}, *a*_{2} and *a*_{1} respectively.

**Figure 4.** **The set of input trees ***T*_{1}, *T*_{2}, *T*_{3}** and the set of all nine potential seeds ***S*_{1}, *S*_{2 }*… **S*_{9}** when the seed characteristics are set to ***k*** = 3 and ****c = 1**. All the potential seeds have three taxa as k = 3. We need one contraction from the
input tree to obtain each seed.

*S*

_{1}has frequency 1.0 as it is present in

*T*

_{1},

*T*

_{2}and

*T*

_{3}. Seed

*S*

_{2}has frequency ∼0.67 as it is present in

*T*

_{1}and

*T*

_{2}. Remaining seeds have frequency ∼0.33 as each appears in only one of the three trees.

The last feature denotes the number of trees in in which the seed is present. For example in Figure 4, there are nine seeds *S*_{1}, *S*_{2}, …, *S*_{9} extracted from the three input trees using only one contraction. Among these, the
frequency of *S*_{1} is 1 as it is present in all the trees. Frequency of *S*_{2} is about 0.67 for it is present in only two out of three trees (*T*_{1} and *T*_{2}). The frequency of the rest of the seeds is only about 0.33. Recall that, by definition,
an MFAST is present in at least a fraction *γ* of the trees in . Therefore, we consider only the seeds whose frequency values are equal to or greater
than this number ( i.e. , *f* ≥ *γ*).

Given the values of *k*, *c* and *γ*, we extract all the seeds which possess the desired feature values from the set of
input trees as follows. In the newick string representation of a tree, a pair of matching
parentheses corresponds to an internal node in the tree. The number of taxa in the
clade rooted at this internal node is given by the number of labels between the two
matching parentheses. Following from this observation, we scan the newick string of
each tree one by one. For each such tree, we identify the clades which have *k* + *c* taxa. Notice that, if a tree contains *n* taxa, then it contains at most clades of size *k* + *c* as no two such clades can contain common taxa. We then extract all combinations of
*k* taxa from each of these clades by contracting the remaining *c* taxa. The number of ways this can be done is . Notice that all the small trees extracted this way possess the first two characteristics
explained above. At this point, we however do not know their frequencies. Therefore,
we call them *potential seeds*. It is worth mentioning that the same seed might be extracted from different trees.
As we extract a new potential seed, before storing it in the list of potential seeds,
we check if it is already present there. We include it in the potential seed list
only if it does not exist there yet. Otherwise, we ignore it. This way, we maintain
only one copy of each seed.

Once we build our potential seed list for all the trees in , we go over them one by one and count their frequency in as the fraction of trees that contain them. We filter all the potential seeds whose frequencies are less than the frequency cutoff. We keep the remaining ones as the list of seeds along with the frequency of each seed.

In Figure 4, consider the tree *T*_{1} that has four taxa. For *k* = 3 and *c* = 1, there is only one clade of size *k* + *c* = 4 which is the tree *T*_{1} itself. We extract four potential seeds, each having three leaves from this tree.
The potential seeds in this figure are given by *S*_{1}, *S*_{2}, *S*_{5} and *S*_{7} which we extract by contracting *a*_{4}, *a*_{3}, *a*_{2} and *a*_{1} respectively from *T*_{1}.

#### Phase two: Seed combination

At the end of the first phase, we obtain a set of frequent seeds from the input trees.
Notice that each seed is a FAST as each seed is present in sufficient number of trees
specified by *γ*. These seeds are the basic building blocks of our method. In the second phase of
our method, we combine subsets of these seeds to construct larger FASTs.

We first define what it means to combine two seeds. In order to combine two seeds,
it is a necessary condition that both seeds are present in at least one common tree
*T* in . We call such a tree *T* as the *reference tree*. We combine two seeds with the guidance of a reference tree. Let *S*_{1} and *S*_{2} be two seeds and let *T* be their reference tree. Let *L*_{1}, *L*_{2} and *L* be the set of taxa in *S*_{1}, *S*_{2} and *T* respectively. Combining *S*_{1} and *S*_{2} results in the tree that is equivalent to the one obtained by contracting the taxa
in *L* − (*L*_{1} ∪ *L*_{2}) from *T*. For simplicity, we will denote the combine operation using *T* as the reference network with the ⊕_{T} symbol. For instance we denote combining *S*_{1} and *S*_{2} with *T* being the reference tree as *S*_{1} ⊕ _{T}*S*_{2}. To simplify our notation, whenever the identity of the reference tree is irrelevant,
we will use the symbol ⊕ instead of ⊕_{T}.

Figure 5 demonstrates how two seeds *S*_{1} and *S*_{2} are combined with the help of the reference tree *T*. In this figure, both *S*_{1} and *S*_{2} are subtrees of *T*. Thus, it is possible to use *T* as the reference tree. We have *L*_{1} = {*a*_{1},*a*_{3},*a*_{4}}, *L*_{2} = {*a*_{1},*a*_{2},*a*_{5},*a*_{7}}. Thus, we build *C* = *S*_{1} ⊕ _{T}*S*_{2} by contracting the taxa in *L* − (*L*_{1} ∪ *L*_{2}) = {*a*_{6},*a*_{8}} from *T*.

**Figure 5.** *T***is the reference tree. ***S*_{1} and *S*_{2} are the seeds to be combined, both are present in *T*. *C* is obtained by pruning the subtree containing taxa *a*_{1}, *a*_{2}, *a*_{3}, *a*_{4}, *a*_{5} and *a*_{7} from *T*.

So far, we have explained how to combine two seeds *S*_{1} and *S*_{2} using a reference tree. It is possible that many trees in have both seeds present in them. Thus, one question is which of these trees should
we use as the reference tree to combine the two seeds? The brief answer is that all
such trees need to be considered. However, we make several observations that helps
us avoid combining *S*_{1} and *S*_{2} using each such reference tree one by one exhaustively without ignoring any of such
trees. We explain them next.

Consider two trees *T*_{1} and *T*_{2} from where both seeds are present in. There are two cases for *T*_{1} and *T*_{2}.

• Case 1: *S*_{1} ⊕_{T1}*S*_{2} = *S*_{1} ⊕_{T2}*S*_{2}. In this case, it does not matter whether we use *T*_{1} or *T*_{2} as the reference tree. They will both lead to the same combined subtree. Thus, we
use only one.

• Case 2: *S*_{1} ⊕_{T1}*S*_{2} ≠ *S*_{1} ⊕_{T2}*S*_{2}. In this case, the trees *T*_{1} and *T*_{2} lead to alternative combination topologies. So, we consider both of them separately.

We utilize the observations above as follows. We start by picking one reference tree
arbitrarily. Once we create a combined subtree using that tree, we check whether that
subtree is present in the remaining trees in . We mark those trees that contain it as considered for reference tree and never use
them as reference for the same seed pair again. This is because those trees fall into
the first case described above. This way, we also store the frequency of the combined
subtree in . If the number of unmarked trees is too small (i.e., less than *γ* × *m*) then it means that even if all the remaining trees agree on the same combined topology
for the two seeds under consideration, they are not sufficient to make it a FAST.
Thus, we do not use any of the remaining trees as reference for those two seeds. Otherwise,
we pick another unmarked tree arbitrarily and repeat the same process until we run
out of reference trees.

The next question we need to answer is which seed pairs should we combine? To answer this question we first make the following proposition.

#### Proposition 1

Assume that we are given a set of phylogenetic trees . Let *S*_{1} and *S*_{2} be two seeds constructed from the trees in . For all trees , we have the following inequality

#### Proof

For any *T*, both *S*_{1} and *S*_{2} are subtrees of *S*_{1} ⊕_{T}*S*_{2}. Thus if *S*_{1} ⊕_{T}*S*_{2} is present in a tree, then both *S*_{1} and *S*_{2} are present in that tree. As a result, *freq*(*S*_{1} ⊕_{T}*S*_{2}, ) ≤ *freq*(*S*_{1}, ) and *freq*(*S*_{1} ⊕_{T}*S*_{2}, ) ≤ *freq*(*S*_{2}, ). Hence,

□

Proposition 1 states that as we combine pairs of seeds to grow them, their frequency monotonically decreases. This suggests that it is desirable to combine two seeds if both of them have large frequencies. This is because if one of them has a small frequency, regardless of the frequency of the other, the combined tree will have a small frequency. As a result its chance to grow into a larger tree through additional combine operations gets smaller. Following this intuition, we develop two approaches for combining the seeds.

1. *In-order Combination* (Section “In-order combination”).

2. *Minimum Overlap Combination* (Section “Minimum overlap combination”).

Both approaches accept the list of seeds computed in the first phase as input and produce a larger FAST that is a combination of multiple seeds. Both of them also assume that the list of input seeds are already sorted in decreasing order of their frequencies. We discuss these approaches next.

#### In-order combination

The in-order combination approach follows from Proposition 1. It assumes that the
seeds with higher frequencies have greater potential to be a part of an *MFAST*. It exploits this assumption as follows, first it picks a seed as the starting point
to create a FAST. It then grows this seed by combining it with other seeds starting
from the most frequent one as long as the frequency of the resulting tree remains
at least as large as the given cutoff *γ*. It repeats this process by trying each seed as the starting point, Algorithm Algorithm
1 In order combination presents this approach.

#### Algorithm 1 In order combination

*FAST* ← *∅*

**for all** seeds *S*_{i}**do**

*FAS**T*^{′} ← *S*_{i}

Mark *S*_{i} as considered

**repeat**

*S*_{j} ← seed with highest frequency among unconsidered seeds Mark *S*_{j} as considered CUTOFF ← *γ**t*_*FAST*^{′} ← *FAS**T*^{′}

**repeat**

Pick the next unconsidered tree as reference

Mark all the trees as that contain *FAS**T*^{′} ⊕_{T}*S*_{j} as considered

*t*_*FAS**T*^{′} ← *FAS**T*^{′} ⊕_{T}*S*_{j}

**end if**

**until** Less than *γ* × *m* unmarked reference trees are left in

*FAS**T*^{′}←*t*_*FAS**T*^{′}

**until** all seeds are considered

**if** size of *FAS**T*^{′} ≥ size of *FAST***then**

*FAST* ← *FAS**T*^{′}

**end if**

Unmark all seeds

**end for**

In Algorithm Algorithm 1 In order combination we first initialize the FAST as empty.
We then consider each seed one by one. We initialize a temporary subtree denoted by
*FAST’* with the seed * S_{i}* under consideration and mark

*S*

_{i}as considered. We combine the FAST’ with a seed

*S*

_{j}which has the highest frequency amongst the seeds that have not been added. If multiple seeds have the highest frequency, we randomly pick one of them and mark that seed

*as added to the FAST’. There can be alternative ways to combine FAST’ with*

*S*_{j}*S*

_{j}leading to different topologies. We use the trees in that contain both FAST’ and

*S*

_{j}as guides to try only the topologies that exist in . We stop constructing alternative topologies as soon as we ensure that there are not sufficient number of trees to yield frequency of

*γ*. We set FAST’ to the combined seed if the combined seed has large enough frequency. We then consider the seed with the next highest frequency for addition and repeat this step till all

*have been considered. If the resulting temporary FAST is larger than FAST we replace the smaller FAST with the larger one. In the next iteration, we initialize the*

*S*_{j}*FAST*with the next

*. Using this approach we can initialize the*

*S*_{i}*FAST*with all

*, alternatively if the user wishes to limit the amount of time spent using a*

*S*_{i}*maximum time cutoff*we stop the outermost loop (i.e., alternative initializations of FAST’) as soon as the allowed running time budget is reached.

Notice that in Algorithm 3 each seed *S*_{i} can lead to a different FAST. We record only the FAST that has the largest size.
However, it is trivial to maintain the top *k* FASTs with the largest size instead if the user is looking for *k* alternative maximal FASTs.

#### Minimum overlap combination

The purpose of combining seeds is to construct a FAST that is large in size. Our in-order
combination approach (Section “In-order combination”) aimed to maximize the frequency
of the combined seeds. In this section, we develop our second approach, named *Minimum Overlap Combination*. This approach picks seeds so that their combination produced as large subtree as
possible. We elaborate on this approach next.

When we combine two seeds, the size of the resulting tree becomes at least as big
as the size of each of these seeds. Formally let *S*_{1} and *S*_{2} be two seeds (i.e., trees). Let *L*_{1} and *L*_{2} be the set of taxa combined in *S*_{1} and *S*_{2}. We denote the size of a set, say *L*_{1}, with |*L*_{1}|. The size of the tree resulting from combination of *S*_{1} and *S*_{2} is |*L*_{1}| + |*L*_{2}| − |*L*_{1} ∩ *L*_{2}|. For a given fixed seed size, the first two terms of this formulation remains unchanged
regardless of the seed. The last term determines the growth in the size of the FAST.
Thus, in order to grow the FAST rapidly, it is desirable to combine two frequent subtrees
with a small number of common taxa.

Our second approach follows from the observation above. We introduce a criteria called
the *overlap* between two subtrees as the number of taxa common between them. Our minimum overlap
combination approach works the same as Algorithm Algorithm 1 In order combination
with a minor difference in selecting the seed *S*_{j} that will be combined with the current temporary FAST (i.e., FAST’). Rather than
choosing the seed with the largest frequency, this approach chooses the one that has
the least overlap with FAST’ among all the unconsidered and frequent seeds. If multiple
seeds have the same smallest overlap, it considers the frequency as the tie breaker
and chooses the one with the largest frequency among those.

#### Phase three: Post-processing

So far we described how to obtain seeds (Section “Phase one: Seed generation”) and how to combine them to construct FAST (Section “Phase two: Seed combination”). The two approaches we developed for combining seeds aim to maximize the size of FAST. However, they do not ensure the maximality of the resulting FAST. There are two main reasons that prevent our seed combining algorithms from constructing maximal FAST. First, some of the taxa of a maximal FAST may not appear in any seed (i.e. false negatives). As a result no combination of seeds will lead to that maximal FAST. Second, even if all the taxa of a maximal FAST are parts of at least one seed, our algorithms will reject combining that seed with the FAST of the seeds if those seeds contain other taxa that are not part of the maximal FAST (i.e. false positives).

In the post-processing phase, we tackle above-mentioned problem. Algorithm 3 describes
the post processing phase in detail. We do this by considering all taxa which are
not already present in the *FAST* one by one. We iteratively grow the current FAST by including one more taxon at a
time if the frequency of the resulting FAST remains at least as large as the frequency
cutoff *γ*. We repeat these iterations until no new taxon can be included in the FAST. Thus
the resulting FAST is guaranteed to be maximal.

#### Algorithm 2 Post processing

INPUT = FAST from the seed combination phase

OUTPUT = Maximal FAST

RESULT ← FAST

**for all ***a*_{i} not in FAST **do**

CUTOFF ← *γ*

t_RESULT ← *RESULT*

**repeat**

Pick the next unconsidered tree as reference

RESULT’ ← RESULT ⊕_{T}*a*_{i}

Mark all the trees that contain RESULT’ as considered

**if** frequency of RESULT’ ≥ CUTOFF **then** t_RESULT ← RESULT’ CUTOFF ← frequency of RESULT’

**end if**

**until** Less than *γ* × *m* unmarked reference trees are left in

RESULT ← t_RESULT

**end for**

**return** RESULT

We expect the post processing step to identify quickly the taxa that have a potential to be in an MFAST that might have not been considered during the seed generation and seed combination phases. At the end of the post processing step we obtain an MFAST.

#### Complexity analysis of our method

In this section we discuss the complexity of our method in terms of the three phases
involved in it. Let be a set of *m* phylogenetic trees having *n* leaves each. The complexity of the different phases of our method are as follows.

#### Phase one

Finding the seeds involves enumerating all the subtrees and checking their frequencies.
Given seed size *k* and number of contractions *c*, each tree will contain at most clades each leading to alternative subtrees. Thus, in total there can be up to seeds (possibly many of them identical) from all the trees in . Typically, the values of *k* and *c* are fixed and small (in our experiments we have *k* ∈ {3, 4, 5} and *c* ∈ {0, 1, 2, 3, 4, 5}) leading to *O*(*mn*) seeds.

The complexity of finding whether a seed is present in a single tree is *O*(*n* log *n*). Given that there are *m* trees in , the cost of computing the frequency of a single seed is *O*(*mn* log *n*). Thus, the time complexity for finding the frequency of all the seeds is this expression
multiplied by the number of seeds, which is *O*(*m*^{2}*n*^{2} log *n*).

#### Phase two

Consider a set of *p* frequent seeds that will be considered for combining in this phase. Recall that we
have two approaches to combine them. Below, we focus on each.

INORDER COMBINATION We try to combine each seed with every other seed leading to *O*(*p*^{2}) iterations. The complexity of checking the frequency of each combined subtree is
*O*(*mn*log*n*). Also, there can be up to *O*(*m*) different reference trees for guiding the combine operation. Multiplying these terms,
we obtain the complexity of phase using this approach as *O*(*p*^{2}*m*^{2}*n* log *n*).

MINIMUM OVERLAP COMBINATION The complexity of combining the frequent seeds using the minimum overlap combination
approach is very similar to the inorder approach except for an additional term. The
additional complexity is because we maintain the overlap between the subtrees. This
leads to the complexity *O*(*p*^{2}*n*^{2} + *p*^{2}*m*^{2}*n* log *n*).

#### Phase three

Here, we consider the FAST obtained from each of the *p* frequent seeds in phase two. For each FAST, we sequentially go over each taxa one
by one leading to *O*(*n*) iterations. There can be up to *O*(* γ * × *m*) references to add a taxon. So the cost of extending all *p* FASTs is *O*(*γ* × *mnp*).

Notice that each frequent seed has to appear in at least *γ* × *m* trees. Thus, the number of unique frequent seeds *p* is bounded by = . Thus, adding the cost of all the three phases, the overall time complexity of our
method using inorder combination is

That using minimum overlap combination is

In the two summations above, the second term is asymptotically larger than the first and the last terms. Thus, we can simplify the asymptotic time complexity of inorder and minimum overlap combinations as

and

respectively.