Open Access Highly Accessed Open Badges Research article

The taming of an impossible child: a standardized all-in approach to the phylogeny of Hymenoptera using public database sequences

Ralph S Peters1*, Benjamin Meyer2, Lars Krogmann3, Janus Borner4, Karen Meusemann1, Kai Schütte5, Oliver Niehuis1 and Bernhard Misof1

Author Affiliations

1 Zoologisches Forschungsmuseum Alexander Koenig, Adenauerallee 160, D-53113 Bonn, Germany

2 Institut für Systemische Neurowissenschaften, Universitätsklinikum Hamburg-Eppendorf, Martinistrasse 52, D-20246 Hamburg, Germany

3 Staatliches Museum für Naturkunde Stuttgart, Rosenstein 1, D-70191 Stuttgart, Germany

4 Zoologisches Institut der Universität Hamburg, Martin-Luther-King-Platz 3, D-20146 Hamburg, Germany

5 Zoologisches Museum Hamburg, Martin-Luther-King-Platz 3, D-20146 Hamburg, Germany

For all author emails, please log on.

BMC Biology 2011, 9:55  doi:10.1186/1741-7007-9-55

Published: 18 August 2011



Enormous molecular sequence data have been accumulated over the past several years and are still exponentially growing with the use of faster and cheaper sequencing techniques. There is high and widespread interest in using these data for phylogenetic analyses. However, the amount of data that one can retrieve from public sequence repositories is virtually impossible to tame without dedicated software that automates processes. Here we present a novel bioinformatics pipeline for downloading, formatting, filtering and analyzing public sequence data deposited in GenBank. It combines some well-established programs with numerous newly developed software tools (available at webcite).


We used the bioinformatics pipeline to investigate the phylogeny of the megadiverse insect order Hymenoptera (sawflies, bees, wasps and ants) by retrieving and processing more than 120,000 sequences and by selecting subsets under the criteria of compositional homogeneity and defined levels of density and overlap. Tree reconstruction was done with a partitioned maximum likelihood analysis from a supermatrix with more than 80,000 sites and more than 1,100 species. In the inferred tree, consistent with previous studies, "Symphyta" is paraphyletic. Within Apocrita, our analysis suggests a topology of Stephanoidea + (Ichneumonoidea + (Proctotrupomorpha + (Evanioidea + Aculeata))). Despite the huge amount of data, we identified several persistent problems in the Hymenoptera tree. Data coverage is still extremely low, and additional data have to be collected to reliably infer the phylogeny of Hymenoptera.


While we applied our bioinformatics pipeline to Hymenoptera, we designed the approach to be as general as possible. With this pipeline, it is possible to produce phylogenetic trees for any taxonomic group and to monitor new data and tree robustness in a taxon of interest. It therefore has great potential to meet the challenges of the phylogenomic era and to deepen our understanding of the tree of life.