|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: KRP ACM. Performed the experiments: KRP. Analyzed the data: KRP ACM. Contributed reagents/materials/analysis tools: KRP. Wrote the paper: KRP ACM. Implementation: KRP LR.
Metagenome sequencing is becoming common and there is an increasing need for easily accessible tools for data analysis. An essential step is the taxonomic classification of sequence fragments. We describe a web server for the taxonomic assignment of metagenome sequences with PhyloPythiaS. PhyloPythiaS is a fast and accurate sequence composition-based classifier that utilizes the hierarchical relationships between clades. Taxonomic assignments with the web server can be made with a generic model, or with sample-specific models that users can specify and create. Several interactive visualization modes and multiple download formats allow quick and convenient analysis and downstream processing of taxonomic assignments. Here, we demonstrate usage of our web server by taxonomic assignment of metagenome samples from an acidophilic biofilm community of an acid mine and of a microbial community from cow rumen.
A metagenome sequence sample is obtained by sequencing the DNA of a mixture of microorganisms from an environment of interest . Identification of the taxonomic affiliation of DNA sequences, either for individual reads or assembled contigs, is an essential step prior to further analysis, such as characterization of the functional and metabolic capabilities of the sequenced microbial community . Various taxonomic assignment methods exist, which can be divided into three categories: sequence composition-based, sequence alignment-based and hybrids; see ,  and  respectively for examples. Sequence composition based methods use short substrings (k-mers) to represent a sequence as a vector of fixed length, which is used to assess similarity among sequences. Such a representation is known as a “genomic signature” and is more conserved between evolutionarily close species than distant species , . Sequence alignment and phylogeny-based methods use sequence similarity as a measure of evolutionary relatedness between sequences. This approach is computationally more expensive compared to sequence composition, and thus requires more hardware resources for analysis of large datasets. Hybrid methods combine information from both sequence composition and alignment to assess similarity between sequences. From another perspective, taxonomic assignment methods can be categorized as either unsupervised or supervised methods. Unsupervised methods cluster the sequences based on a similarity measure and then assign a taxonomic affiliation to the clusters. Supervised methods, on the other hand, infer a taxonomic model using sequences of known taxonomic origin, which are then used for taxonomic assignment of novel metagenome sequences. Given that sufficient reference data for modeling are available, supervised methods are likely to be more accurate in taxonomic assignment than clustering techniques, as the effect of non-taxonomic signals, such as guanine and cytosine strand biases, on taxonomic assignment is minimized during model induction.
Recently we developed a new method PhyloPythiaS, which is a successor to the previously published software PhyloPythia , . PhyloPythiaS exhibits high prediction accuracy and allows a rapid analysis of datasets with several hundred mega-bases or giga-bases. PhyloPythiaS was benchmarked on simulated and real data sets and shows good predictive performance. PhyloPythiaS shows notably reduced execution times in comparison to MEGAN  and PhymmBL  (85-fold and 106-fold respectively on a 13 Mb assembled metagenome sample), as no similarity searches are performed against large databases. It also shows better predictive performance on both simulated and real metagenome samples, in particular when limited amount of reference sequences from particular species are available (approximately 100 kb). While for short fragments, all methods perform less favorably than for fragments of 1 kb in length or more , similarity-based assignment with MEGAN has the lowest error rate for short fragments. PhyloPythiaS is freely available for non-commercial users and can be installed on a Linux-based machine .
PhyloPythiaS can be used in two different modes – generic and sample-specific. The generic model is suitable for the analysis of a metagenome sample, if no further information on the sample's taxonomic composition or relevant reference data are available. Assignment accuracy can be improved by creation and use of a sample-specific model, which includes clades for the abundant sample population that are inferred from the appropriate reference sequences. A sample-specific model is inferred from public sequence data combined with sequences with known taxonomic affiliation identified from the metagenome sample, along with a customized taxonomy. If a better match to the taxa in the metagenome sample is achieved, sample-specific models exhibit higher predictive accuracy, and have improved resolution to low-ranking clades and higher coverage in terms of assigned sequences compared to the generic model. Accurate assignments can be obtained based on ~100 kb of reference sequence for a modeled sample population .
Here we present a web server for taxonomic sequence assignment for web-based use of PhyloPythiaS. The underlying functionality of the software is as we have described it before. For researchers with limited computational resources or who are not familiar with command line usage under Unix/Linux, web servers provide computational resources and a graphical user interface for convenient use. Furthermore, they allow a visual presentation of results for a quick overview and exploration of data sets. Several web servers for taxonomic assignment are available, such as the MG-RAST , WebCARMA  and the Naïve Bayes Classification (NBC)  web servers. Our server is unique in that it provides the ability to construct and use sample-specific models, besides enabling assignment with generic models. We illustrate taxonomic metagenome assignment with the generic and sample-specific modes of the web server by analyzing metagenome samples of an acidophilic biofilm community from acid mine drainage (AMD)  and of a cow rumen microbial community .
We demonstrate the functionality of the web server based on a taxonomic assignment of two metagenome sequence samples. For performance analysis, we assessed the consistency and taxonomic distance of assignments, as defined in . A prediction for a sequence fragment was considered to be consistent if the fragment was either assigned to the correct clade or to a parental clade of the correct taxonomic label. The consistency was measured as the percentage of sequence fragments or base-pairs assigned correctly. Higher consistency in terms of assigned base-pairs than number of sequence fragments indicates that longer sequences are classified more consistently than short ones. As the consistency considers also assignments to parental clades to be correct, it is a ‘coarse’ performance measure. As a more ‘fine grained’ performance measure, we also calculated the taxonomic distance, based on the geodesic distance between the correct and predicted nodes in the reference taxonomy. Taken together, these two measures provide good qualitative assessment. A well performing method will produce assignments with both high consistency and low taxonomic distance. For clades of the analyzed samples, the average values of consistencies and taxonomic distances over the corresponding sequences are reported, in addition to average values for the entire sample. We also calculated accuracy values for clades at the genus-level and for higher taxonomic ranks, both in terms of the fraction of sequences and based on the fraction of base-pairs correctly assigned.
We used our web server for the taxonomic assignment of a well-studied metagenome of an acidophilic biofilm community, sequenced with Sanger sequencing technology. The AMD community comprises five abundant species: Ferroplasma Types I and II, a Thermoplasmatales species (all Euryarchaeota), and Leptospirillum sp. Group I and II of the phylum Nitrospirae. The test scaffolds for the AMD metagenome were downloaded from the IMG/M portal (http://img.jgi.doe.gov/, taxon object ID 2001200000). These data comprise 1183 scaffolds and ~10.83 Mb of DNA sequence. Draft genome assemblies, comprising 908 scaffolds overall, were created using sequencing coverage and nucleotide composition for the five populations of the AMD sample; the genome assemblies were then deposited at NCBI (accession numbers CH003520–CH004435). We mapped the AMD scaffolds to these reference assemblies with BLASTN  and used the best match in terms of the lowest E-value for each scaffold of the AMD data set as an estimate of its correct taxonomic affiliation.
We compared the PhyloPythiaS generic and sample-specific model assignments with predictions from the NBC web server (http://nbc.ece.drexel.edu/), MEGAN and the best BLASTN hit approach of MG-RAST (see Text S1). As MG-RAST and WebCARMA incorporate AMD sequences as reference data, a comparative evaluation by direct submission to these servers would not have ensured strict separation of the reference data and test data. Taxonomic scaffold assignments with PhyloPythiaS and the other tested methods were evaluated based on draft genome assemblies for the five strains and the Fluorescent In-Situ Hybridization cell counts published in the original AMD study (Figure 1d, e).
The PhyloPythiaS generic model returned the assignments in less than 5 minutes.Most scaffolds were assigned to high taxonomic ranks (taxonomic assignments are shown in Figure 1a, base-pair accuracy is given in Table 1; see Figures S1, S6 and Table S1). As no reference data were available in model construction for the sample populations, this was expected. Euryarchaeota were identified, but many scaffolds were assigned to phyla Proteobacteria and Verrucomicrobia, instead of to Nitrospirae. The generic model assignments were similar to those of BLASTN in terms of population abundance (Figure S3). In contrast, the NBC web server overestimated the abundance of Firmicutes and underestimated that of Euryarchaeota (Figure 1f, Figures S4 and S5).
For assignment using a sample-specific model, we randomly selected ~100 kb of continuous sequences from the five populations as sample-specific training sequences. Specifically, the five strains and corresponding amounts of sample-specific data used were 70 kb for Leptospirillum sp. Group III, 100 kb for Ferroplasma acidarmanus Type I, 100 kb for Leptospirillum sp. Group II '5-way CG', 100 kb for Ferroplasma sp. Type II and 70 kb for Thermoplasmatales archaeon Gpl (G-plasma). Construction of the sample-specific model took slightly less than 7 hours. The Newick tree and sample-specific data used to train the model are available on the web server as exemplary data. Assignments with this model (Figure 1b, c and Figure S2) corroborate well with the taxonomic makeup of this dataset. Both the generic and sample-specific models of PhyloPythiaS produced assignments that were taxonomically consistent and closer to the draft assemblies than those of the BLASTN approach, MEGAN and the NBC server (Table 2, Figure S6). Low scaffold consistency for the Leptospirillum sp. Group II '5-way CG' population (0.76) accompanied by low taxonomic distance between correct and predicted taxonomic affiliations (1.73) suggest that there was a certain degree of ‘back-and-forth’ in assignments between the Leptospirillum clades. In contrast, assignments for the Ferroplasma populations showed high scaffold consistency (>0.95) and higher taxonomic distance between correct and predicted affiliation (>3.7), suggesting that assignments were made to higher ranks (Table S1).
We furthermore performed taxonomic assignments for 26,042 metagenomic scaffolds (568 Mbp) of a microbial community adherent to switchgrass incubated in a bovine rumen  with a twofold objective: First, to demonstrate usage of the server on a large dataset and, second, to verify usability of the method for sequences generated by Illumina sequencing technology. The data was downloaded from the DOE Joint Genome Institute website (ftp://ftp.jgi-psf.org/pub/rnd2/Cow_Rumen/). The majority of the scaffolds were found to have no similarity to sequenced genomes in the original study, suggesting uncharacterized microbes as their origin. We submitted the scaffolds to the web server in the generic mode as a multiplex sample and visualized the combined predictions. The majority of the scaffolds were assigned to the orders Bacteroidales, Clostridiales, Bacillales, Spirochaetales, Methanomicrobiales, Methanosarcinales, Sulfolobales, Selenomonadales and Rhizobiales (Figure 2).
Fifteen near-complete ‘genome bins’ of abundant populations from four orders were identified in the original study from the cow rumen sample, based on analysis of tetranucleotide frequency and assembly information . We used these genome bins, comprising 466 scaffolds overall, as the correct taxonomic affiliation for comparison with the taxonomic assignments of PhyloPythiaS. The partial genome bins published in the original article are not guaranteed to be entirely correct, but provide a qualitative reference point, as they were generated based on multiple sources of information and verified by human in-depth inspection. We measured the assignment consistency as the number of base-pairs of these scaffolds consistently assigned by the PhyloPythiaS generic model to the order-level clades of the respective genome bins. Taxonomic distances of the predictions were calculated relative to the reported orders for the genome bins (Table 3). Overall, the generic model made consistent assignments for the majority of scaffolds. In particular, this was the case for genome bins of order-level clades with substantial numbers of reference genomes available, while assignment consistency was lower for clades covered by fewer reference genomes. Seven of the 15 bins were more than 90% consistent, four of them even to 100%. Five bins showed low consistency. In particular, we observed that the Clostridiales and Myxococcales genome bins were less consistent than bins of the other three orders. For Myxococcales this is likely because fewer sequenced genomes were available for training of the generic model (given the number of species with sequenced genomes for all five clades). For the Clostridiales, this might be due to genomic differences of the species represented by the genome bins to the sequenced Clostridiales genomes used as reference (mean GC content of 50% versus a mean GC content of 36%). However, regardless of the exact nature of the assigned taxonomic affiliation, scaffolds of a particular bin tended to be homogeneously assigned to the same clade by the generic model, varying from 44% to 100% of the scaffolds for the different bins. The predictive accuracy of the overall assignment can likely be further improved by construction of a sample-specific model, as we showed for the AMD sample.
We provide a web server for taxonomic assignment of metagenome sequences with PhyloPythiaS. Software updates and custom-made models will be easily accessible to the community through the web server. Our server is unique in that it provides, in addition to generic models, the ability to build and use sample-specific models. The sample-specific mode allows additional sequences to be incorporated as a reference and relevant clades to be defined for a given community, e.g. based on accompanying 16S rRNA sample surveys. By taxonomic assignment of the AMD metagenome sample, we have shown how creation of such a sample-specific model allowed us to increase the coverage, resolution and accuracy of taxonomic assignments, with only a small amount (~100 kb) of reference data being used. Due to computational limitations, no cross-validation for estimation of the hyperparameters is provided for sample-specific model construction, but our experiments show that default parameters produce accurate assignments on both simulated and real metagenome samples . Furthermore, the assignments can be visualized and downloaded through an easy-to-use interactive interface. For the AMD metagenome, we found BLASTN (the strategy implemented by the MG-RAST server) to perform similarly to the generic model (both had an accuracy of >86% at the domain level), and the sample-specific model to show considerably improved assignment accuracy, in particular for lower taxonomic ranks. The NBC server mis-assigned a considerable fraction of the sequences and had an accuracy of ~45% at domain level. MEGAN performed well on this data in terms of specificity, but showed lower sensitivity. To demonstrate use of the server and generic model for exploratory analysis of a large metagenome sample generated with the Illumina sequencing technology, we assigned scaffolds from the cow rumen metagenome in the generic mode. This showed high assignment consistency for the majority of the genome bins in comparison to a manual refined reference binning of the original study.
With many high-throughput sequencing technologies being developed , it is important to assess how taxonomic assignment methods cope with the different technology-specific errors and read lengths. The technologies produce reads of different lengths and qualities, potentially affecting the performance of taxonomic assignment methods. We have previously shown  that PhyloPythiaS works well with assembled contigs from Sanger  and Roche/454  sequencing technologies using metagenome samples from the Tammar wallaby gut  and from the guts of obese human twins , respectively. In the current study, we analyzed two datasets, one sequenced with Sanger and another with Illumina sequencing technology . We found that regardless of the technology used, both of these datasets were characterized consistently. We expect the web server models to work equally well with assembled sequence data from other technologies with similar sequencing error rates, such as the SOLiD (Applied Biosystems) platform . It should be noted that the performance of PhyloPythiaS on sequence fragments with high error rates is still unexplored. Furthermore, we advise that short reads should be assembled into longer contigs before submitting them to the PhyloPythiaS web server (see  for assembler comparisons). Although the server produces assignments for short sequences (<1000 bp), like with other methods, these assignments are less accurate than those for longer sequences and often to higher ranking taxa only. For scientists without access to large computing resources or familiarity with Unix/Linux, our server provides a novel, easily accessible resource for taxonomic assignment of metagenome sequence fragments.
The PhyloPythiaS model, referred to in short as the model hereafter, consists of an ensemble of structural support vector machines (SSVM) . Each SSVM is induced using a sequence-composition derived input space and a taxonomy-based output space. By default, the model comprises six SSVMs, each induced on an input space that has been derived from training fragments of different length –1 kb, 3 kb, 5 kb, 10 kb, 15 kb and 50 kb. The input space is a combination of counts of substrings of length 4, 5 and 6 (k-mers), normalized based on the fragment length. The output space for each SSVM, defined by the taxonomy, is the same. At prediction time, a test fragment is classified using an ensemble of at most three SSVMs; built with fragments of the same length as the test fragment or longer.
PhyloPythiaS has two modes – generic and sample-specific. The generic mode uses a model trained with publicly available prokaryotic genomes and the taxonomy available at NCBI (http://www.ncbi.nlm.nih.gov/). Bacterial and archaeal taxa of seven major taxonomic ranks (species, genus, family, order, class, phylum and domain), with sequenced genomes from at least three genomes being available, were included in the reference taxonomy. The genomes were mapped to the lowest corresponding taxa of the model taxonomy and equal amounts of non-overlapping sequence fragments were selected for each taxon to create a training data set for each SSVM.
Lack of appropriate reference data can cause taxonomic assignments to be either of low resolution (i.e. assignments to high ranking taxa) or inaccurate. There are two reasons why the appropriate reference data might be lacking. First, the vast majority of microbial diversity has not been cultured and sequenced , and therefore metagenome samples often represent novel species for which no sequences of closely related organisms are available in public databases. Second, although genomic signature is informative for species and higher-level taxonomic clades , , it is also known that sequence characteristics are dependent upon environmental factors . In this case, the genomic signature of the organisms in the metagenome sample can deviate from the genomic signature of the evolutionarily close organisms available in public databases. A sample-specific model (i.e. a model that includes training data from the metagenome sample itself in addition to public data) is better suited in such scenarios. By including sample-specific sequences and taxonomy in the training of SSVM, the dataset shift problem can be reduced . Suitable sample-specific training sequences can be obtained from the metagenome sample itself, based on sequence homology to 16S rRNA or other phylogenetic marker genes, or by targeted sequencing of fosmids with such marker genes . Trained with appropriate reference data, PhyloPythiaS allows the accurate assignment of sequence fragments with lengths of more than 1 kb, and is particularly well suited for the analysis of assembled sequence datasets. For shorter fragments, there is a loss in sensitivity, particularly at lower taxonomic ranks, which is a trend observed for all taxonomic assignment methods .
As previously described, the web server can be used in two different modes – generic or sample-specific. The generic mode accepts sequences as a multi-FASTA file of up to 100 Mb in size and performs taxonomic assignments using a generic model. The generic model is constructed from prokaryotic genome sequences available at NCBI and models sufficiently covered clades from domain to species level (see Introduction). The sample-specific mode allows the user to specify the clades for a model and upload representative sequences for construction of a user-defined model. In this mode, the user has to provide three files: (1) a tree file: a plain text file with NCBI identifiers for the clades to be modeled or a rooted Newick tree with non-negative integer node names; (2) a sample-specific fasta file: a multi-FASTA file with sample-specific sequences, where each sequence header must contain a valid node identifier X as “label:X”; and (3) a prediction fasta file: a multi-FASTA file with the sequences for which taxonomic assignments are to be made. The sample-specific data provided by the user is pooled with the reference data used for generic model to build a model with default parameters as described in . This model is then used for taxonomic assignment of the test sequences provided in the prediction fasta file.
The generic and sample-specific models produce output in the same format. The output page shows an assignments table with a maximum of 100 entries, as well as a pie chart and the model taxonomy. The pie chart shows the abundance of the taxa and can be interactively changed to visualize different taxonomic ranks and to display either the number of sequences or number of bases. The taxonomy shows the modeled tree along with the assignment information for each node. The taxonomy can be interactively changed to display either the taxonomic identifiers or the NCBI scientific names. This allows the user to easily visualize the distribution of the assignments over the taxonomy. Every node in the tree contains additional information, such as the number of sequences/bases assigned to the node or its subtree. Additionally, a link is provided to obtain the sequences assigned to each node. The assignments can be downloaded, possibly with additional data, or received via email. If the server was invoked in the sample-specific mode then additional assignments on separate data can be obtained using the same model.
Metagenome samples can be larger than the upload limitations of the web server. For this reason, the ability to visualize and download combined assignments from multiple submissions for classification with the same model is provided. One uploads a large sample in the form of multiple non-overlapping FASTA files, each as a different process, and retains the corresponding process identifiers. Once all the processes are finished, the process identifiers can then be provided to the ‘multiplex-sample’ utility, which combines the predictions from all processes and generates visualizations and download files.
The PhyloPythiaS web server is freely available for non-commercial use at http://binning.bioinf.mpi-inf.mpg.de/.
Assignments for the AMD metagenome scaffolds at different taxonomic ranks by the PhyloPythiaS generic model. This model does not assign sequences to any of the genus level clades. This is expected behavior as none of the genera (Leptospirillum and Ferroplasma) were present in the generic model. The existence of Deltaproteobacteria (in Actual and Proteobacteria in Phylum) has been previously reported (reference  in Text S1) and is due to the provisional assignment of Leptospirillium to delta subdivision (reference  in Text S1).
Assignments for the AMD metagenome scaffolds at different taxonomic ranks by PhyloPythiaS sample-specific model. Sample-specific data (approximately 100 kb from each of the five strains) from the two genera (Leptospirillum and Ferroplasma) was used.
Assignments for the AMD metagenome scaffolds at different taxonomic ranks by best BLASTN hit analysis. E-value cut-off of 0.1 was used. The blast database used same genomes used for creating PhyloPythiaS generic model, i.e. all 1076 complete genomes available from NCBI as of April 2010.
Assignments for the AMD metagenome scaffolds at different taxonomic ranks by the NBC webserver. Default N-mer length of 15 with Bacteria/Archaea genomes were used. The webserver was accessed at http://nbc.ece.drexel.edu/in April 2011.
Assignments for the AMD metagenome scaffolds fragmented at 500 bp at different taxonomic ranks by the NBC webserver. To check for the possible effect of test sequence length on the taxonomic assignment of the AMD metagenome using the NBC webserver, we created fragments of length 500 bp from the scaffolds and obtained their assignments. Default N-mer length of 15 and Bacteria/Archaea genomes were used. Bacteria were overestimated while underestimating the Archaea. The NBC webserver was accessed at http://nbc.ece.drexel.edu/in May 2011.
Performance of different methods at six major taxonomic ranks on the AMD data-set. All the methods except PhyloPythiaS in sample-specific mode and BLASTN made only incorrect assignments at genus and family levels. The performance measures are used as defined in Patil et al. (reference  in the main text). The methods compared are the PhyloPythiaS generic model (PPS G), PhyloPythiaS sample-specific model (PPS SS), BLAST best hit (BLASTN), MEGAN and naïve Bayesian classifier (NBC).
Taxonomic distance analysis for AMD metagenome scaffolds assignment to draft genome assemblies generated for five strains in the AMD metagenome project. The most specific assignments provided by each method were used for this analysis. The correct scaffold assignments, i.e. Population, were obtained using five strains (three species) whole genome shotgun sequences obtained from NCBI. The methods are PhyloPythiaS sample-specific model (PPS SS), PhyloPythiaS generic model (PPS G), BLASTN, MEGAN and naïve Bayesian classifier (NBC). The populations are Thermoplasmatales archaeon Gpl (T), Leptospirillum sp. Group III (L1), Leptospirillum sp. Group II '5-way CG' (L2), Ferroplasma acidarmanus (F1) and Ferroplasma sp. Type II (F2). The numbers in brackets after population name show number of correct scaffolds. The rows signify number of assigned scaffolds (Assigned), the fraction of assignments in the same lineage as the correct taxon (Const_n_scaff), the fraction of base-pairs in the same lineage as the correct taxon (Const_n_bp) and average taxonomic distance of with respect to draft reference genomes (Tax Dist).
We thank J. Büch and G. Friedrich for technical support in server implementation, I. Gregor for testing and J. Dröge for comments on the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
Funding: KRP, LR and ACM were supported by the Max-Planck society. ACM was also supported by the Heinrich-Heine University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.