Reconstructing the phylogeny of organisms, the tree of life, is one of the major goals in biology and is essential for research in other biological disciplines ranging from evolutionary biology and systematics to biological control and conservation. In phylogenetics, molecular characters have become an indispensable tool, since they can be collected in a standardized and automated way. This is indicated by the exponential growth of published data, with a current doubling time of approximately 30 months [1
] and expected massively accelerated data generation over the next several years. The sequencing of expressed sequence tags (ESTs), complete genomes and countless single-gene fragments has resulted in enormous, yet highly incomplete and unbalanced, data sets accessible via public databases such as the National Center for Biotechnology Information (NCBI) GenBank, the European Molecular Biology Laboratory (EMBL) and the DNA Database of Japan (DDBJ).
The accumulation of new data is, of course, important, but the potential of the currently available data for phylogenetic analysis has not yet been sufficiently explored. McMahon and Sanderson [2
], Sanderson et al
] and Thomson and Shaffer [4
] have published their attempts to use molecular data from public databases and to process them for phylogenetic analysis. However, these approaches, while valuable and trend-setting, did not offer thorough solutions and call for extension, improvements and updates in terms of generalization, detail, analysis and degree of automation. Any new approach must offer solutions to deal with data scarcity, poor data overlap, nonstationary substitution processes, base compositional heterogeneity and data quality deficits. In this study, we address these problems with a newly developed bioinformatics pipeline. We use a large exemplar taxon for which far more than 100,000 sequences have been published and show that comprehensive analyses can potentially deliver new results which were not available from each included data set separately.
As an exemplary taxon, we chose the insect order Hymenoptera, which comprises prominent groups such as bees, ants and wasps, the latter including the overwhelming armada of parasitoid species [5
]. The Hymenoptera seem well-suited to demonstrate the power of our approach, since the taxon is megadiverse and offers a number of phylogenetic challenges, including many unresolved relationships and well-known problems that are associated with so-called long-branch taxa and rapid radiations (see, for example, [6
]). Over a long period, comparatively few authors tried to resolve the phylogenetic relationships of the major lineages of Hymenoptera (see, for example, [9
]). In recent years, however, interest and effort in solving higher-level relationships within the Hymenoptera have notably increased and led to the publication of an extensive analysis based exclusively on morphological characters [17
], a study using complete mitochondrial genomes [18
], a supertree approach using previously published trees [19
], a phylogenetic estimate based on EST data [20
] and a taxon-rich four-gene study [21
]. In the past five years, complete nuclear genomes of several Hymenoptera species have been sequenced. Most noteworthy in this context are the genomes of the honey bee Apis mellifera
] and the jewel wasp Nasonia vitripennis
, with its sibling species N. giraulti
and N. longicornis
]. These genomes contributed significantly to the amount of sequence data available for Hymenoptera. However, their number is still too small to profitably augment phylogenetic analyses.
Overall, there are only few phylogenetic hypotheses on major lineages within Hymenoptera that are generally accepted. These are as follows: (1) "Symphyta" (sawflies) are paraphyletic, with the absence of the constriction between the first and second abdominal segments (that is, the wasp waist) as a symplesiomorphic character; (2) Apocrita (wasp-waisted wasps) are monophyletic (see, for example, [24
]); (3) Xyelidae are sister group to all other Hymenoptera (see, for example, [25
]); (4) Orussidae are sister group to Apocrita (see, for example, [17
]) and (5) Aculeata (stinging wasps; Apoidea, Chrysidoidea and Vespoidea) are monophyletic (see, for example, [28
]). In addition, most of the 22 currently recognized superfamilies are presumed to be monophyletic (see [29
] for a synopsis). Numerous relationships within Hymenoptera are still unresolved. Among them, the most intriguing ones are the phylogeny of the major lineages within Apocrita, and in particular what the sister group of Aculeata is, and the monophyly and phylogeny of Proctotrupomorpha sensu
Rasnitsyn 1988 [13
] (Chalcidoidea, Cynipoidea, Diaprioidea, Mymarommatoidea, Platygastroidea and Proctotrupoidea).
In this study, we present a standardized, fast and transparent bioinformatics pipeline to collect, filter and analyze public sequence data deposited in GenBank. The pipeline is designed to be generally applicable in terms of taxa, genes and the variety of potential users. We apply this pipeline to sequences of Hymenoptera and discuss our results against the background of current hypotheses on two selected questions: the phylogeny of the major lineages within Apocrita and the monophyly and phylogeny of Proctotrupomorpha. Additionally, we use the results to diagnose persistent problems in the hymenopteran tree. Finally, we illustrate the merit of being able to easily generate trees from available sequence data at a time when data sets are accumulating at an ever-increasing speed.