With the astonishing rate that genomic and metagenomic sequence data sets are accumulating, there are many reasons to constrain the data analyses. One approach to such constrained analyses is to focus on select subsets of gene families that are particularly well suited for the tasks at hand. Such gene families have generally been referred to as “marker” genes. We are particularly interested in identifying and using such marker genes for phylogenetic and phylogeny-driven ecological studies of microbes and their communities (e.g., construction of species trees, phylogenetic based assignment of metagenomic sequence reads to taxonomic groups, phylogeny-based assessment of alpha- and beta-diversity of microbial communities from metagenomic data). We therefore refer to these as PhyEco (for phylogenetic and phylogenetic ecology) markers. The dual use of these PhyEco markers means that we needed to develop and apply a set of somewhat novel criteria for identification of the best candidates for such markers. The criteria we focused on included universality across the taxa of interest, ability to be used to produce robust phylogenetic trees that reflect as much as possible the evolution of the species from which the genes come, and low variation in copy number across taxa.
We describe here an automated protocol for identifying potential PhyEco markers from a set of complete genome sequences. The protocol combines rapid searching, clustering and phylogenetic tree building algorithms to generate protein families that meet the criteria listed above. We report here the identification of PhyEco markers for different taxonomic levels including 40 for “all bacteria and archaea”, 114 for “all bacteria (greatly expanding on the ∼30 commonly used), and 100 s to 1000 s for some of the individual phyla of bacteria. This new list of PhyEco markers should allow much more detailed automated phylogenetic and phylogenetic ecology analyses of these groups than possible previously.
It is unlikely that there is any single objective measure of merit, so research assessment therefore requires new multivariate metrics that reflect the context of research, regardless of discipline.
Because both subjective post-publication review and the number of citations are highly error prone and biased measures of merit of scientific papers, journal-based metrics may be a better surrogate.
The assessment of scientific publications is an integral part of the scientific process. Here we investigate three methods of assessing the merit of a scientific paper: subjective post-publication peer review, the number of citations gained by a paper, and the impact factor of the journal in which the article was published. We investigate these methods using two datasets in which subjective post-publication assessments of scientific publications have been made by experts. We find that there are moderate, but statistically significant, correlations between assessor scores, when two assessors have rated the same paper, and between assessor score and the number of citations a paper accrues. However, we show that assessor score depends strongly on the journal in which the paper is published, and that assessors tend to over-rate papers published in journals with high impact factors. If we control for this bias, we find that the correlation between assessor scores and between assessor score and the number of citations is weak, suggesting that scientists have little ability to judge either the intrinsic merit of a paper or its likely impact. We also show that the number of citations a paper receives is an extremely error-prone measure of scientific merit. Finally, we argue that the impact factor is likely to be a poor measure of merit, since it depends on subjective assessment. We conclude that the three measures of scientific merit considered here are poor; in particular subjective assessments are an error-prone, biased, and expensive method by which to assess merit. We argue that the impact factor may be the most satisfactory of the methods we have considered, since it is a form of pre-publication review. However, we emphasise that it is likely to be a very error-prone measure of merit that is qualitative, not quantitative.
Helicobacter pylori colonization is highly prevalent among humans and causes significant gastric disease in a subset of those infected. When present, this bacterium dominates the gastric microbiota of humans and induces antimicrobial responses in the host. Since the microbial context of H. pylori colonization influences the disease outcome in a mouse model, we sought to assess the impact of H. pylori challenge upon the pre-existing gastric microbial community members in the rhesus macaque model. Deep sequencing of the bacterial 16S rRNA gene identified a community profile of 221 phylotypes that was distinct from that of the rhesus macaque distal gut and mouth, although there were taxa in common. High proportions of both H. pylori and H. suis were observed in the post-challenge libraries, but at a given time, only one Helicobacter species was dominant. However, the relative abundance of non-Helicobacter taxa was not significantly different before and after challenge with H. pylori. These results suggest that while different gastric species may show competitive exclusion in the gastric niche, the rhesus gastric microbial community is largely stable despite immune and physiological changes due to H. pylori infection.
Artificial human gut microbial communities implanted into germ-free mice provide insights into how species-level responses to changes in diet give rise to community-level structural and functional reconfiguration and how types of bacteria prioritize use of available nutrients in vivo.
The human gut microbiota is an important metabolic organ, yet little is known about how its individual species interact, establish dominant positions, and respond to changes in environmental factors such as diet. In this study, gnotobiotic mice were colonized with an artificial microbiota comprising 12 sequenced human gut bacterial species and fed oscillating diets of disparate composition. Rapid, reproducible, and reversible changes in the structure of this assemblage were observed. Time-series microbial RNA-Seq analyses revealed staggered functional responses to diet shifts throughout the assemblage that were heavily focused on carbohydrate and amino acid metabolism. High-resolution shotgun metaproteomics confirmed many of these responses at a protein level. One member, Bacteroides cellulosilyticus WH2, proved exceptionally fit regardless of diet. Its genome encoded more carbohydrate active enzymes than any previously sequenced member of the Bacteroidetes. Transcriptional profiling indicated that B. cellulosilyticus WH2 is an adaptive forager that tailors its versatile carbohydrate utilization strategy to available dietary polysaccharides, with a strong emphasis on plant-derived xylans abundant in dietary staples like cereal grains. Two highly expressed, diet-specific polysaccharide utilization loci (PULs) in B. cellulosilyticus WH2 were identified, one with characteristics of xylan utilization systems. Introduction of a B. cellulosilyticus WH2 library comprising >90,000 isogenic transposon mutants into gnotobiotic mice, along with the other artificial community members, confirmed that these loci represent critical diet-specific fitness determinants. Carbohydrates that trigger dramatic increases in expression of these two loci and many of the organism's 111 other predicted PULs were identified by RNA-Seq during in vitro growth on 31 distinct carbohydrate substrates, allowing us to better interpret in vivo RNA-Seq and proteomics data. These results offer insight into how gut microbes adapt to dietary perturbations at both a community level and from the perspective of a well-adapted symbiont with exceptional saccharolytic capabilities, and illustrate the value of artificial communities.
Our intestines are populated by an almost unimaginably large number of microbial cells, most of which are bacteria. This species assemblage operates as a microbial metabolic organ, performing myriad tasks that contribute to our well-being, including processing components of our diet. The way this incredible machine assembles itself and operates remains mysterious. One approach to understanding its properties is to create artificial communities composed of a limited number of sequenced human gut bacterial species and to install them in the guts of germ-free mice that are then fed different diets. In this report, we adopt this approach. We describe the genome sequence of a new gut bacterial isolate, Bacteroides cellulosilyticus WH2, which is equipped with an unprecedented number of carbohydrate active enzymes. Deploying four different “omics” technologies, we characterize the response to diet, the relative stability, and the temporal dynamics of a 12-species artificial bacterial assemblage (including B. cellulosilyticus WH2) implanted in germ-free mouse guts. We also combine high-throughput substrate utilization screens and RNA-Seq to generate reference data analogous to a “Rosetta stone” in order to decipher what types of carbohydrates B. cellulosilyticus encounters and uses within the gut, and how it interacts with other organisms that have similar and/or distinct “professions.” This work sets the stage for future ecological and metabolic studies of more complex assemblages that more fully emulate the properties of our native gut communities.
We find that genome-wide DNA transfer by conjugation in mycobacteria affords bacteria that reproduce by binary fission the same advantages of sexual reproduction, and may explain the genomic evolution of Mycobacterium tuberculosis.
Horizontal gene transfer (HGT) in bacteria generates variation and drives evolution, and conjugation is considered a major contributor as it can mediate transfer of large segments of DNA between strains and species. We previously described a novel form of chromosomal conjugation in mycobacteria that does not conform to classic oriT-based conjugation models, and whose potential evolutionary significance has not been evaluated. Here, we determined the genome sequences of 22 F1-generation transconjugants, providing the first genome-wide view of conjugal HGT in bacteria at the nucleotide level. Remarkably, mycobacterial recipients acquired multiple, large, unlinked segments of donor DNA, far exceeding expectations for any bacterial HGT event. Consequently, conjugal DNA transfer created extensive genome-wide mosaicism within individual transconjugants, which generated large-scale sibling diversity approaching that seen in meiotic recombination. We exploited these attributes to perform genome-wide mapping and introgression analyses to map a locus that determines conjugal mating identity in M. smegmatis. Distributive conjugal transfer offers a plausible mechanism for the predicted HGT events that created the genome mosaicism observed among extant Mycobacterium tuberculosis and Mycobacterium canettii species. Mycobacterial distributive conjugal transfer permits innovative genetic approaches to map phenotypic traits and confers the evolutionary benefits of sexual reproduction in an asexual organism.
Bacteria reproduce by binary fission, generating two clones of the original; this restricts the genomic diversity of the population, which brings with it inherent evolutionary drawbacks. This problem can be eased by conjugation, which transfers DNA from a donor to a recipient bacterium. Understanding the potential of conjugal DNA transfer for generating genetic diversity is necessary for estimating gene flow through populations and for predicting rates of bacterial evolution. The influence of chromosomal conjugal DNA transfer on mycobacterial diversity has not been previously addressed. Here, we determine and compare the complete genome sequences of independent progeny from bacterial matings between defined donor and recipient strains of Mycobacterium smegmatis. We find the resulting hybrid bacteria to be extremely diverse blends of the parental strains, reminiscent of the genetic mixing that occurs through meiotic recombination in sexual organisms. This novel mechanism of conjugation can create genome-wide mosaicism in a single event, generating segments of donor DNA that range from small (∼0.05 kb) to large (∼250 kb), widely distributed around the recipient chromosome. We exploit this mixing by using genetic tools originally developed for finding mammalian disease genes to locate the genes that confer a donor phenotype in M. smegmatis. We speculate that similar genomic mosaicism observed in pathogenic mycobacteria arose from conjugation between ancestral progenitor strains.
The capacity to form endospores is unique to certain members of the low-G+C group of Gram-positive bacteria (Firmicutes) and requires signature sporulation genes that are highly conserved across members of distantly related genera, such as Clostridium and Bacillus. Using gene conservation among endospore-forming bacteria, we identified eight previously uncharacterized genes that are enriched among endospore-forming species. The expression of five of these genes was dependent on sporulation-specific transcription factors. Mutants of none of the genes exhibited a conspicuous defect in sporulation, but mutants of two, ylxY and ylyA, were outcompeted by a wild-type strain under sporulation-inducing conditions, but not during growth. In contrast, a ylmC mutant displayed a slight competitive advantage over the wild type specific to sporulation-inducing conditions. The phenotype of a ylyA mutant was ascribed to a defect in spore germination efficiency. This work demonstrates the power of combining phylogenetic profiling with reverse genetics and gene-regulatory studies to identify unrecognized genes that contribute to a conserved developmental process.
There are 10× more bacterial cells in our bodies from the microbiome than human cells. Viral DNA is known to integrate in the human genome, but the integration of bacterial DNA has not been described. Using publicly available sequence data from the human genome project, the 1000 Genomes Project, and The Cancer Genome Atlas (TCGA), we examined bacterial DNA integration into the human somatic genome. Here we present evidence that bacterial DNA integrates into the human somatic genome through an RNA intermediate, and that such integrations are detected more frequently in (a) tumors than normal samples, (b) RNA than DNA samples, and (c) the mitochondrial genome than the nuclear genome. Hundreds of thousands of paired reads support random integration of Acinetobacter-like DNA in the human mitochondrial genome in acute myeloid leukemia samples. Numerous read pairs across multiple stomach adenocarcinoma samples support specific integration of Pseudomonas-like DNA in the 5′-UTR and 3′-UTR of four proto-oncogenes that are up-regulated in their transcription, consistent with conversion to an oncogene. These data support our hypothesis that bacterial integrations occur in the human somatic genome and may play a role in carcinogenesis. We anticipate that the application of our approach to additional cancer genome projects will lead to the more frequent detection of bacterial DNA integrations in tumors that are in close proximity to the human microbiome.
There are 10× more bacterial cells in the human body than there are human cells that are part of the human microbiome. Many of those bacteria are in constant, intimate contact with human cells. We sought to establish if bacterial cells insert their own DNA into the human genome. Such random mutations could cause disease in the same manner that mutagens like UV rays from the sun or chemicals in cigarettes induce mutations. We detected the integration of bacterial DNA in the human genome more readily in tumors than normal samples. In particular, extensive amounts of DNA with similarity to Acinetobacter DNA were fused to human mitochondrial DNA in acute myeloid leukemia samples. We also identified specific integrations of DNA with similarity to Pseudomonas DNA near the untranslated regulatory regions of four proto-oncogenes. This supports our hypothesis that bacterial integrations occur in the human somatic genome that may potentially play a role in carcinogenesis. Further study in this area may provide new avenues for cancer prevention.
Here we present the draft genome of Leucobacter sp. strain UCD-THU. The genome contains 3,317,267 bp in 11 scaffolds. This strain was isolated from a residential toilet as part of an undergraduate project to sequence reference genomes of microbes from the built environment.
Leptonema illini Hovind-Hougen 1979 is the type species of the genus Leptonema, family Leptospiraceae, phylum Spirochaetes. Organisms of this family have a Gram-negative-like cell envelope consisting of a cytoplasmic membrane and an outer membrane. The peptidoglycan layer is associated with the cytoplasmic rather than the outer membrane. The two flagella of members of Leptospiraceae extend from the cytoplasmic membrane at the ends of the bacteria into the periplasmic space and are necessary for their motility. Here we describe the features of the L. illini type strain, together with the complete genome sequence, and annotation. This is the first genome sequence (finished at the level of Improved High Quality Draft) to be reported from of a member of the genus Leptonema and a representative of the third genus of the family Leptospiraceae for which complete or draft genome sequences are now available. The three scaffolds of the 4,522,760 bp draft genome sequence reported here, and its 4,230 protein-coding and 47 RNA genes are part of the Genomic
Gram-negative; flexible; motile; cytoplasmatic tubules; non-sporulating; axial flagella; aerobic; chemoorganotrophic; Leptospiraceae; GEBA
Turneriella parva Levett et al. 2005 is the only species of the genus Turneriella which was established as a result of the reclassification of Leptospira parva Hovind-Hougen et al. 1982. Together with Leptonema and Leptospira, Turneriella constitutes the family Leptospiraceae, within the order Spirochaetales. Here we describe the features of this free-living aerobic spirochete together with the complete genome sequence and annotation. This is the first complete genome sequence of a member of the genus Turneriella and the 13th member of the family Leptospiraceae for which a complete or draft genome sequence is now available. The 4,409,302 bp long genome with its 4,169 protein-coding and 45 RNA genes is part of the Genomic
Gram-negative; motile; axial filaments; helical; flexible; non-sporulating; aerobic; mesophile; Leptospiraceae; GEBA
The apparently complex membrane organization of Gemmata obscuriglobus, and probably all PVC superphylum members, comprises interconnected invaginations and is topologically identical to the “classical” Gram-negative bacterial membrane system.
The division of cellular space into functionally distinct membrane-defined compartments has been one of the major transitions in the history of life. Such compartmentalization has been claimed to occur in members of the Planctomycetes, Verrucomicrobiae, and Chlamydiae bacterial superphylum. Here we have investigated the three-dimensional organization of the complex endomembrane system in the planctomycete bacteria Gemmata obscuriglobus. We reveal that the G. obscuriglobus cells are neither compartmentalized nor nucleated as none of the spaces created by the membrane invaginations are closed; instead, they are all interconnected. Thus, the membrane organization of G. obscuriglobus, and most likely all PVC members, is not different from, but an extension of, the “classical” Gram-negative bacterial membrane system. Our results have implications for our definition and understanding of bacterial cell organization, the genesis of complex structure, and the origin of the eukaryotic endomembrane system.
The compartmentalization of cellular space has been an important evolutionary innovation, allowing for the functional specialization of cellular space. This compartmentalization is extensively developed in eukaryotes and although not as complex and developed, compartments with specialized function are known to occur in bacteria and can be surprisingly sophisticated. Nevertheless, members of the Planctomycetes, Verrucomicrobiae, and Chlamydiae (PVC) bacterial superphylum are exceptional in displaying diverse and extensive intracellular membranous organization. We investigated the three-dimensional organization of the complex endomembrane system in the planctomycete bacterium Gemmata obscuriglobus. We reveal that the G. obscuriglobus cells are neither compartmentalized nor nucleated, contrary to previous claims, as none of the spaces created by the membrane invaginations is topologically closed; instead, they are all interconnected. The organization of cellular space is similar to that of a classical Gram-negative bacterium modified by the presence of large invaginations of the inner membrane inside the cytoplasm. Thus, the membrane organization of G. obscuriglobus, and most likely all PVC members, is not fundamentally different from, but is rather an extension of, the “classical” Gram-negative bacterial membrane system.
Spirochaeta africana Zhilina et al. 1996 is an anaerobic, aerotolerant, spiral-shaped bacterium that is motile via periplasmic flagella. The type strain of the species, Z-7692T, was isolated in 1993 or earlier from a bacterial bloom in the brine under the trona layer in a shallow lagoon of the alkaline equatorial Lake Magadi in Kenya. Here we describe the features of this organism, together with the complete genome sequence, and annotation. Considering the pending reclassification of S. caldaria to the genus Treponema, S. africana is only the second 'true' member of the genus Spirochaeta with a genome-sequenced type strain to be published. The 3,285,855 bp long genome of strain Z-7692T with its 2,817 protein-coding and 57 RNA genes is a part of the G enomic
E ncyclopedia of
B acteria and
A rchaea project.
anaerobic; aerotolerant; mesophilic; halophilic; spiral-shaped; motile; periplasmic flagella; Gram-negative; chemoorganotrophic; Spirochaetaceae; GEBA
Here we present the draft genome of an actinobacterium, Curtobacterium flaccumfaciens strain UCD-AKU, isolated from a residential carpet. The genome assembly contains 3,692,614 bp in 130 contigs. This is the first member of the Curtobacterium genus to be sequenced.
Here, we present the draft genome of Kocuria sp. strain UCD-OTCP, a member of the phylum Actinobacteria, isolated from a restaurant chair cushion. The assembly contains 3,791,485 bp (G+C content of 73%) and is contained in 68 scaffolds.
Here, we present the draft genome sequence of an actinobacterium, Dietzia sp. strain UCD-THP, isolated from a residential toilet handle. The assembly contains 3,915,613 bp. The genome sequences of only two other Dietzia species have been published, those of Dietzia alimentaria and Dietzia cinnamea.
Over 3000 microbial (bacterial and archaeal) genomes have been made publically available to date, providing an unprecedented opportunity to examine evolutionary genomic trends and offering valuable reference data for a variety of other studies such as metagenomics. The utility of these genome sequences is greatly enhanced when we have an understanding of how they are phylogenetically related to each other. Therefore, we here describe our efforts to reconstruct the phylogeny of all available bacterial and archaeal genomes. We identified 24, single-copy, ubiquitous genes suitable for this phylogenetic analysis. We used two approaches to combine the data for the 24 genes. First, we concatenated alignments of all genes into a single alignment from which a Maximum Likelihood (ML) tree was inferred using RAxML. Second, we used a relatively new approach to combining gene data, Bayesian Concordance Analysis (BCA), as implemented in the BUCKy software, in which the results of 24 single-gene phylogenetic analyses are used to generate a “primary concordance” tree. A comparison of the concatenated ML tree and the primary concordance (BUCKy) tree reveals that the two approaches give similar results, relative to a phylogenetic tree inferred from the 16S rRNA gene. After comparing the results and the methods used, we conclude that the current best approach for generating a single phylogenetic tree, suitable for use as a reference phylogeny for comparative analyses, is to perform a maximum likelihood analysis of a concatenated alignment of conserved, single-copy genes.
Coriobacterium glomerans Haas and König 1988, is the only species of the genus Coriobacterium, family Coriobacteriaceae, order Coriobacteriales, phylum Actinobacteria. The bacterium thrives as an endosymbiont of pyrrhocorid bugs, i.e. the red fire bug Pyrrhocoris apterus L. The rationale for sequencing the genome of strain PW2T is its endosymbiotic life style which is rare among members of Actinobacteria. Here we describe the features of this symbiont, together with the complete genome sequence and its annotation. This is the first complete genome sequence of a member of the genus Coriobacterium and the sixth member of the order Coriobacteriales for which complete genome sequences are now available. The 2,115,681 bp long single replicon genome with its 1,804 protein-coding and 54 RNA genes is part of the Genomic
Gram-positive; non-motile; non-sporulating; obligatory anaerobic; chemoorganotroph; mesophile; endosymbiont; insect intestinal tract; Coriobacteriaceae; Actinobacteria; GEBA
At present, Joostella marina Quan et al. 2008 is the sole species with a validly published name in the genus Joostella, family Flavobacteriacae, phylum Bacteriodetes. It is a yellow-pigmented, aerobic, marine organism about which little has been reported other than the chemotaxonomic features required for initial taxonomic description. The genome of J. marina strain En5T complements a list of 16 Flavobacteriaceae strains for which complete genomes and draft genomes are currently available. Here we describe the features of this bacterium, together with the complete genome sequence, and annotation. This is the first member of the genus Joostella for which a complete genome sequence becomes available. The 4,508,243 bp long single replicon genome with its 3,944 protein-coding and 60 RNA genes is part of the Genomic
Gram-negative; non-motile; aerobic; mesophile; Flavobacteriaceae; Bacteroidetes; GEBA
Anaerobaculum mobile Menes and Muxí 2002 is one of three described species of the genus Anaerobaculum, family Synergistaceae, phylum Synergistetes. This anaerobic and motile bacterium ferments a range of carbohydrates and mono- and dicarboxylic acids with acetate, hydrogen and CO2 as end products. A. mobile NGAT is the first member of the genus Anaerobaculum and the sixth member of the phylum Synergistetes with a completely sequenced genome. Here we describe the features of this bacterium, together with the complete genome sequence, and annotation. The 2,160,700 bp long single replicon genome with its 2,053 protein-coding and 56 RNA genes is part of the Genomic
Gram-negative; rod-shaped; motile; flagellum; non-spore forming; anaerobic; chemoorganotrophic; crotonate-reducer; Synergistetes; Synergistaceae; GEBA
Alistipes finegoldii Rautio et al. 2003 is one of five species of Alistipes with a validly published name: family Rikenellaceae, order Bacteroidetes, class Bacteroidia, phylum Bacteroidetes. This rod-shaped and strictly anaerobic organism has been isolated mostly from human tissues. Here we describe the features of the type strain of this species, together with the complete genome sequence, and annotation. A. finegoldii is the first member of the genus Alistipes for which the complete genome sequence of its type strain is now available. The 3,734,239 bp long single replicon genome with its 3,302 protein-coding and 68 RNA genes is part of the Genomic
Gram-negative; rod-shaped; non-sporulating; non-motile; mesophile; strictly anaerobic; chemoorganotrophic; Rikenellaceae; GEBA
Spirochaeta caldaria Pohlschroeder et al. 1995 is an obligately anaerobic, spiral-shaped bacterium that is motile via periplasmic flagella. The type strain, H1T, was isolated in 1990 from cyanobacterial mat samples collected at a freshwater hot spring in Oregon, USA, and is of interest because it enhances the degradation of cellulose when grown in co-culture with Clostridium thermocellum. Here we provide a taxonomic re-evaluation for S. caldaria based on phylogenetic analyses of 16S rRNA sequences and whole genomes, and propose the reclassification of S. caldaria and two other Spirochaeta species as members of the emended genus Treponema. Whereas genera such as Borrelia and Sphaerochaeta possess well-distinguished genomic features related to their divergent lifestyles, the physiological and functional genomic characteristics of Spirochaeta and Treponema appear to be intermixed and are of little taxonomic value. The 3,239,340 bp long genome of strain H1T with its 2,869 protein-coding and 59 RNA genes is a part of the Genomic
obligately anaerobic; thermophilic; spiral-shaped; motile; periplasmic flagella; Gram-negative; chemoorganotrophic; Spirochaetaceae; Spirochaeta; Treponema; GEBA
Here, we present the draft genome sequence of Microbacterium sp. strain UCD-TDU, a member of the phylum Actinobacteria. The assembly contains 3,746,321 bp (in 8 scaffolds). This strain was isolated from a residential toilet as part of an undergraduate student research project to sequence reference genomes of microbes from the built environment.
Here we present the draft genome of an actinobacterium, Brachybacterium muris UCD-AY4. The assembly contains 3,257,338 bp and has a GC content of 70%. This strain was isolated from a residential bath towel and has a 16S rRNA gene 99.7% identical to that of the original B. muris strain, C3H-21.
DNA modifications such as methylation and DNA damage can play critical regulatory roles in biological systems. Single molecule, real time (SMRT) sequencing technology generates DNA sequences as well as DNA polymerase kinetic information that can be used for the direct detection of DNA modifications. We demonstrate that local sequence context has a strong impact on DNA polymerase kinetics in the neighborhood of the incorporation site during the DNA synthesis reaction, allowing for the possibility of estimating the expected kinetic rate of the enzyme at the incorporation site using kinetic rate information collected from existing SMRT sequencing data (historical data) covering the same local sequence contexts of interest. We develop an Empirical Bayesian hierarchical model for incorporating historical data. Our results show that the model could greatly increase DNA modification detection accuracy, and reduce requirement of control data coverage. For some DNA modifications that have a strong signal, a control sample is not even needed by using historical data as alternative to control. Thus, sequencing costs can be greatly reduced by using the model. We implemented the model in a R package named seqPatch, which is available at https://github.com/zhixingfeng/seqPatch.
DNA modifications have been found in a wide range of living organisms, from bacteria to human. Many existing studies have shown that they play important roles in development, disease, bacteria virulence, etc. However, for many types of DNA modification, for example N6-methyladenine and 8-oxoG, there is not an efficient and accurate detection method. Single molecule real time (SMRT) sequencing not only generates DNA sequences, but also generates DNA polymerase kinetic information. The kinetic information is sensitive to DNA modifications in the sequenced DNA template, and therefore can be used for detecting a wide range of DNA modification types. The usual detection strategy is a case-control method, which compare kinetic information between native sample and a control sample whose modifications have been removed. However, generating a control sample doubles the cost. We proposed a hierarchical model, which can incorporate existing SMRT sequencing data to increase detection accuracy and reduce coverage requirement of control sample or even avoid the need of a control sample in some cases. We tested our method on SMRT sequencing data of plasmids with known modified sites and E. coli K-12 strain to demonstrate our method can greatly increase detection accuracy and reduce sequencing cost.