A completely new public metagenome annotation system has been developed and released. The process is the result of several years of planning and engineering. Designed to leverage the SEED microbial genome annotation platform, the mg-RAST platform provides seamless integration of metagenome data, microbial genomics, and manually curated annotations. Each metagenome project has its own requirements for stringency, datasets to be analyzed, and output format for results. The metagenomics SEED pipeline was designed to allow alterations to the parameters for the sequence matches underlying both the phylogenetic and metabolic reconstructions to restrict matches. It has been built by using an extensible format allowing the integration of new datasets and algorithms without a need for recomputation of existing results.
The mg-RAST service handles both assembled and unassembled data. Each approach has advantages that should be considered when comparing metagenomes. For example, if one is carrying out comparative metagenomics or if statistics are being used to compare samples [
18,
19], the sequences cannot be assembled, since the assembly process loses the frequency information critical for determining differences between samples. In contrast, assembled sequences tend to be longer and therefore more likely to accurately identify gene function or phylogenetic source from binning [
20].
The analytical methods integrated into the pipeline provide core annotation and analysis tools to compare and contrast a diverse set of metagenomes [
21-
24]. The approach underlying the subsystems-based functional analysis of metagenomes has been validated with 90 different samples from nine major biomes. The analysis demonstrated that the biomes could clearly be separated by their functional composition [
25]. All of the metagenomes present in that study are included in the publicly available datasets visible in the mg-RAST server.
Although the service contains core functionality for the annotation and analysis of metagenomes, many of the techniques traditionally used for genome analysis (e.g., approaches for the prediction of coding sequences) either do not work with metagenomes or show a significant performance degradation [
26]. Many of the differences between complete genome annotation and metagenome annotation are reminiscent of those encountered previously with the analysis of expressed sequenced tags [
27]. Therefore, new analytical methods are needed to fully understand metagenomics data. The most obvious problem is with the large number of unknown sequences in any sample. Depending on the specific sample processed, as few as 10% of the sequences or as many as 98% of the sequences may have no known similarity to anything in the database [
28]. We and others are developing new binning, clustering, and coding region prediction tools to handle these unknown sequences, and effective tools will be incorporated into the pipeline when available. Another problem is that the rapid pace with which sequence data is being generated outpaces increases in computational speed, and therefore improvements in common search algorithms are required to ensure that sequence space can be accurately and efficiently searched. A third problem, common to all annotation platforms, is that metabolic reconstructions and analyses are dependent on the underlying quality of the data. The SEED has the most consistent and accurate microbial genome annotations of any publicly available source because of the subsystems approach to annotation. However, the SEED subsystems are necessarily focused on core metabolism and pathogenesis of a select few organisms. Comprehensive subsystem coverage of secondary metabolism, and especially of metabolism specific to diverse environments, is required to truly comprehend those data sets.