Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX, USA

Department of Computer Science, Rice University, Houston, TX, USA

Abstract

Background

Maximum likelihood has been widely used for over three decades to infer phylogenetic trees from molecular data. When reticulate evolutionary events occur, several genomic regions may have conflicting evolutionary histories, and a phylogenetic network may provide a more adequate model for representing the evolutionary history of the genomes or species. A maximum likelihood (ML) model has been proposed for this case and accounts for both mutation within a genomic region and reticulation across the regions. However, the performance of this model in terms of inferring information about reticulate evolution and properties that affect this performance have not been studied.

Results

In this paper, we study the effect of the evolutionary diameter and height of a reticulation event on its identifiability under ML. We find both of them, particularly the diameter, have a significant effect. Further, we find that the number of genes (which can be generalized to the concept of "non-recombining genomic regions") that are transferred across a reticulation edge affects its detectability. Last but not least, a fundamental challenge with phylogenetic networks is that they allow an arbitrary level of complexity, giving rise to the model selection problem. We investigate the performance of two information criteria, the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC), for addressing this problem. We find that BIC performs well in general for controlling the model complexity and preventing ML from grossly overestimating the number of reticulation events.

Conclusion

Our results demonstrate that BIC provides a good framework for inferring reticulate evolutionary histories. Nevertheless, the results call for caution when interpreting the accuracy of the inference particularly for data sets with particular evolutionary features.

Introduction

W. Maddison proposed a likelihood framework for inferring species trees by simultaneously accounting for evolutionary events within loci (that is, mutations at the nucleotide level) and across loci (that is, gene tree incongruence)

When reticulate evolutionary events occur among species, the species phylogeny takes the shape of a

Our results show that, under the conditions we investigate, ML performs well in terms of estimating inheritance probabilities, and less so in determining the location, or placement, of reticulation edges. They also show that the diameter, inheritance probability, and number of gene data sets used combined have a significant effect on the performance. We find that BIC, and to a lesser extent AIC, performs very well in terms of model selection and preventing ML from grossly overestimating the amount of reticulation.

Methods

Phylogenetic networks and trees

While the phylogenetic network model is general enough to allow for modeling all types of reticulate evolutionary events, such as hybrid speciation, recombination, and horizontal gene transfer (HGT), the semantics of the model change based on the specific evolutionary events allowed

**Definition 1 **

• _{T }_{H },where V_{T }(tree nodes) is the set that contains the root (node r with in-degree 0 and out-degree 2), the set V_{L }of leaves (nodes with in-degree 1 and out-degree 0), and the set V_{I }of internal nodes other than the root (nodes with in-degree 1 and out-degree 2); V_{H }(reticulation nodes) is the set of nodes with in-degree 2 and out-degree 1; and E is the set of the network's edges (we distinguish between the set E_{T }of tree edges, whose heads are tree nodes, and the set E_{H }of reticulation edges, whose heads are reticulation nodes)

• _{L }

• _{H }_{1 }_{2 }

As the name implies, the interpretation of _{H }_{1},R_{2}_{k}_{i}

Phylogenetic networks and maximum likelihood

Given a collection _{1},_{2},..., _{k }_{1}, _{2}, ..., _{k}_{i }_{i}

where **P**(_{i}**P**(

Information criteria

Given a phylogenetic network

To address this issue, we explore in this paper two information criteria, AIC

where

where

Searching the phylogenetic network space

We implemented a heuristic search procedure that starts from an initial tree _{1}, then all networks obtained from _{1 }by adding a single reticulation node, and so on. When analyzing a real data set, _{1}, _{1}) and (_{2}, _{2}), subdivides each edge into two edges of equal length (each of the two edges is half the length of the original edge that was subdivided), such that we have (_{1}, _{1}), (_{1}, _{1}), (_{2}, _{2}), and (_{2}, _{2}), and finally, it adds a reticulation edge between _{1 }and _{2 }(in either direction). It is important to note that in this procedure, when the pair of edges is picked for adding a reticulation node, cycles are excluded, as well as reticulation edges between two tree edges emanating from the same node ("sibling edges"). In our search procedure, we begin with a tree (the species tree), and then consider networks with higher numbers of reticulation nodes. The set of all networks with **P**(_{i}

To put it all together, given a phylogenetic network

where (^{∗}, ^{∗}) is identified, the phylogenetic network ^{∗ }to ^{∗}.

Results

In this section, we investigate the effects of topological properties of reticulation events on the performance of an ML approach to phylogenetic network inference. Further, we study the performance of ML in terms of estimating the inheritance probabilities from sequence data, and then investigate how the two information criteria perform in terms of estimating the number of reticulation events in a data set. For the synthetic data we analyze here, we used the PhyloGen program

Effect of the diameter and height of reticulation events

Consider a set _{1}, _{2},..., _{100 }down the 16-taxon tree _{i}

Effect of the diameter of an HGT edge on the change in the likelihood score

**Effect of the diameter of an HGT edge on the change in the likelihood score**. The diameter of an HGT edge from node

For our second experiment, we generated data as above, yet scored the probabilities of the sequence data on trees that differ from the true underlying tree in a single reticulation event that varies across trees in terms of its height. Unlike the diameter, the height does not seem to have much of an effect on the probabilities beyond the decrease as compared to the probability of the sequences on the true tree (height 0). Results are omitted due to space limitation.

Performance of ML in determining the placement and probability of reticulation edges

In our second set of experiments, we set out to investigate how ML performs in terms of identifying the location of a reticulation edge as well as the inheritance probability that indicates the fraction of genes (non-recombining regions) that were transferred across that edge. We considered three independent evolutionary scenarios, each involving a single reticulation edge of a certain diameter, as shown in Fig. _{1}, which is formed by adding only reticulation edge 1 to the underlying tree _{1}, where _{1 }differs from _{2}, which is formed by adding only reticulation edge 2 to the underlying tree _{2}, where _{2 }differs from _{3}, which is formed by adding only reticulation edge 3 to the underlying tree _{3}, where _{3 }differs from

(a) Three evolutionary histories, each involving the same underlying tree (black lines) and a single reticulation edge from the set of three reticulation edges 1, 2, and 3

**(a) Three evolutionary histories, each involving the same underlying tree (black lines) and a single reticulation edge from the set of three reticulation edges 1, 2, and 3**. The diameters of the three reticulation edges 1, 2, 3 are 0.5, 1.0, and 1.5, respectively. (b,c) The performance of ML for estimating the inheritance probabilities on data simulated with a single reticulation event. The genome size corresponds to the number of gene data sets used in the inference. Each panel contains three segments, corresponding to three different values of true inheritance probabilities: 0.1, 0.3, and 0.5. The inheritance probabilities _{e }

To answer the two questions, we generated sequence data as follows: For an inheritance probability _{i}_{i}

To investigate how ML performs in terms of estimating the inheritance probability, we fixed all elements of the model and only inferred the inheritance probability. That is, in this part, we assumed knowledge of the correct placement of the reticulation edge, and inferred the value of its associated

There are several points to make. The diameter of the reticulation edge has a great effect on the accuracy of the estimated probabilities. For the largest diameter (

For studying the performance of ML in terms of placing the postulated reticulation edges, we used the data generated as described above along with the underlying (species) tree, as shown in Fig. _{i }_{i }_{i }_{i }_{i}

The accuracy of the placement of the inferred reticulation edge in terms of the RF distance

**Diameter**

**Genome size**

**Genome size**

**Genome size**

10

20

40

80

10

20

40

80

10

20

40

80

0.5

0.6

0

0

0

0

0

0

0

0

0

0

0

1

2.3

2.6

1.2

0.3

1.2

0.1

0

0

0.2

0

0

0

1.5

5.6

5.7

5.6

5.5

5.0

3.6

2.3

1.7

3.0

3.2

1.5

0

The genome size corresponds to the number of gene data sets used in the inference. The three diameters correspond to the three networks of Fig. 2.

The results show a very strong effect of the diameter of the true reticulation event on the postulated placement of the inferred one. Holding the inheritance probability and genome size constant, we observe a significant increase in the error as the diameter increases. For example, when using 10 genes and with inheritance probability of 0.1, the error in the placement of the reticulation event increases from 0.6 for diameter 0.5 to 5.6 for diameter 1.5. The same trend holds across all parameter values. This result indicates that confidence in the placement of an inferred reticulation event based on ML decreases as the diameter of the inferred event increases. On the more positive side, and with the exception of diameter 1.5 and inheritance probability of 0.1, increasing the number of genes drastically improves the accuracy of the placement. It is not surprising that for

These results highlight an important issue in detecting reticulations using ML. If reticulation is a hybridization or hybrid speciation event, where a large number of genes may be exchanged or transferred across a reticulation edge (that is, a high value of

Model selection under ML and the performance of information criteria

Now that we have explored the effect of diameter on the performance of ML in terms of estimating the placement of reticulation edges along with their associated probabilities, we turn our attention to a most crucial issue with this model, as well as with phylogenetic networks in general, namely model selection. Here, we will investigate how ML does in estimating the correct number of reticulation edges and how, when augmented with information criteria, it performs. Let us denote by

In our first experiment, we set out to investigate how both criteria perform when the data set has no reticulations. We used an experimental setup as above, where we generated 50 sequence data sets based on the (species) tree of Fig.

We now turn our attention to the case of a single reticulation, yet with three different diameters and three different inheritance probabilities, as shown in Fig.

These results, combined with the analysis above, indicate that inspecting both the change in the likelihood score itself, as well as the information criteria value may be valuable in determining, for real data sets, the true number of reticulations. An important trend to notice also is that the improvement in the likelihood score decreases when overestimated reticulations are added. Further, the inheritance probability has a clear effect on the performance: the higher the probability, the higher the improvement of the likelihood score becomes, especially as compared to the improvements when overestimating. This again points to the conclusion that it is easier to detect hybridization or hybrid speciation events, where many genes support a reticulation edge, than horizontal gene transfer events involving very small number of genes.

Results on a biological data set

Unlike synthetic data, where the full evolutionary history is known, biological data sets pose several challenges, including the often unknown evolutionary history. In this section, we analyze a 15-taxon dataset of plastids, cyanobacteria, and proteobacteria, which is a subset of the dataset considered by

Results on the

**Results on the rbcL gene data set**. (Left) The underlying species tree, as reported in

Discussion

In this paper, we studied the performance of ML for identifying reticulation events from sequence data, based on the formulation given in Eq. (1). We showed through simulation studies that the evolutionary diameter, and to a lesser extent, the height of a reticulation edge affects the performance in terms of estimating the inheritance probability (which reflects the proportion of genes transferred across a reticulation edge) and postulating a placement for the reticulation edge. We showed that increasing the number of genes improves the performance as well. We then investigated the performance of two information criteria, AIC and BIC, and found that BIC in general performs well in terms of model selection and preventing ML from overestimating the number of reticulation edges. Both AIC and BIC produced reasonable results on a biological data set. In this paper, we simulated data on "caterpillar" trees. We will conduct analyses that use other tree shapes to study whether the results hold there as well.

It is important to stress again that the framework, as given by Eq. (1), that we investigated here assumes reticulation as the only source of heterogeneity in the evolution of the sequence data. However, in practice, other events may take place and the model needs to be modified accordingly. In particular, if events such as

Another issue that is of great significance when dealing with reticulation is taxon sampling. As we showed above, the location of the donor node has a significant impact on the detectability of a reticulation edge. When analyzing data sets in practice, particularly prokaryotic data, it may easily be the case that the true donor of the horizontally transferred is not in the data set being analyzed. Therefore, beyond our findings here about the power of ML to infer the placement of a reticulation edge, one has to be cautious about interpreting the placement of a computationally inferred reticulation edge.

A third issue is that while the term reticulation encompasses all types of evolutionary events that are not vertical, there is a clear distinction between, for example, the exchange of a genomic regions through homologous recombination in bacteria and a hybrid speciation event that gives rise to a new species in plants. The amount of genetic material transferred across a reticulation edge in the latter case is much larger than that of in the former. In a phylogenomic study involving thousands of gene families, identifying a reticulation edge that might have been used in the transfer of a single gene might be confounded by the overwhelming vertical signal supported by the remaining genes. Consequently, more confidence can be associated with inferences in cases where a large number of genes support a reticulation edge.

When gene trees are estimated with confidence, one can replace Eq. (1) by _{i }**P**(_{i }| N, γ

Finally, we showed in this manuscript that if the improvement ratio in the likelihood score by adding a reticulation edge is beyond

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

All authors contributed equally.

Acknowledgements

This work was supported in part by NSF grant DBI-1062463, grant R01LM009494 from the National Library of Medicine, and an Alfred P. Sloan Research Fellowship to L.N., and by the Shared University Grid at Rice funded by NSF under Grant EIA-0216467, and a partnership between Rice University, Sun Microsystems, and Sigma Solutions, Inc. The contents are solely the responsibility of the authors and do not necessarily represent the official views of the NSF, National Library of Medicine, the National Institutes of Health, or the Alfred P. Sloan Foundation.

This article has been published as part of