Traditional and classical methods of genomics and microbiology allow researchers to study an individual microbial species obtained from the environment by isolating the organism into pure colonies using microbial culture techniques. However, this approach cannot capture the structure of the broader microbial community within the environmental sample, the relative representation of multiple genomes, and their interaction with each other and with the environment. Additionally, a large number of microbial species are very difficult, or impossible, to culture
in vitro in the laboratory setting. The development of next-generation sequencing has advanced the field of metagenomics by enabling scientists to simultaneously study multiple genomes recovered directly from an environmental sample, thereby bypassing the need for microbial isolation through culturing (see
[1] for a review).
In a metagenomic experiment, a sample is usually taken from a natural (e.g., soil and seawater) or a host-associated (e.g., human gut) environment containing micro-organisms organized into communities or microbiomes. DNA is extracted from the environmental sample containing a mixture of multiple genomes and then sequenced without prior separation. The resulting dataset comprises millions of mixed sequence reads from the multiple genomes contained in the sample. Traditionally, DNA has been sequenced using Sanger sequencing technology
[2] and the reads generated are routinely 800–1000 base pairs long. However this technology is extremely cumbersome and costly. Recently next-generation sequencers, e.g., Illumina/Solexa, Applied Biosystems' SOLiD, and Roche's 454 Life Sciences sequencing systems, have emerged as the future of genomics with incredible ability to rapidly generate large amounts of sequence data
[3],
[4]. These new technologies greatly facilitate high-throughput while lowering the cost of metagenomic studies. However, the reads generated are of much shorter length making reads assembly and alignment more challenging. For example, Illumina/Solexa and SOLiD generate reads ranging between 35–100 base pairs while Roche 454 reads are approximately 100–400 base pairs in length.
One goal of metagenomic studies is to identify what genomes are contained in the environmental sample and to estimate their relative abundance. Identification of genomes is complicated by the mixed nature of multiple genomes in the sample. A widely used approach is assigning the sequence reads to NCBI’s taxonomy tree based on sequence read homology alignment with known sequences catalogued in reference databases. The sequence reads are first aligned to the reference sequence databases using a sequence comparison program such as BLAST
[5]. Reads which have hits in the database are then assigned to the taxonomy tree based on the best match or multiple high-scoring hits. The challenge of this approach is that hits may be found in multiple genomes for a single read at a given threshold of bit-score or Expect value, due to sequence homology and overlaps associated with similarity among species. Strategy of weighting similarities for multiple BLAST hits has been used to estimate the relative genomic abundance and average size
[6]. Another representative and stand-alone analysis tool, MEGAN
[7], assigns a read with hits in multiple genomes to their lowest common ancestor (LCA) in the NCBI taxonomy tree. Thus assignments of reads to different ranks of taxonomy tree depend on what threshold for bit-score or Expect value is used. Furthermore, MEGAN assigns reads one at a time. As a consequence, the results have less false positives but lack specificity. Various methods have been proposed to improve the taxonomic assignment of reads by assigning more reads to the lower ranks of taxonomy tree
[8]–
[12]. In particular, CARMA3
[10] which is BLAST-based but not LCA-based, uses reciprocal search technique as in SOrt-ITEMS
[13] to reduce the number of hits and hence further improves the accuracy of the taxonomic classification.
In this paper, we propose a statistical approach, TAMER, for taxonomic assignment of metagenomic sequence reads. In this approach we first identify a list of candidate genomes using homology searches. A mixture model is then employed to estimate the proportion of reads generated by each candidate genome. Finally, instead of assigning reads one at a time to the taxonomy tree as done by LCA-based methods, reads are assigned to the genomes in a global manner by taking into account both sequence alignment scores and estimated proportion of reads generated by each genome. The proposed method is comprehensively tested on simulated metagenomic data with diverse complexity of microbial community structure and with various read length and also applied to several real world metagenomic datasets. Compared with other available algorithms and tools designated for metagenomic analysis, the proposed approach demonstrates greater accuracy in identification and quantification of multiple genomes in a given sample.