In the discovery of secondary metabolites, analysis of sequence data is a promising exploration path that remains largely underutilized due to the lack of computational platforms that enable such a systematic approach on a large scale. In this work, we present IMG-ABC (https://img.jgi.doe.gov/abc), an atlas of biosynthetic gene clusters within the Integrated Microbial Genomes (IMG) system, which is aimed at harnessing the power of “big” genomic data for discovering small molecules. IMG-ABC relies on IMG’s comprehensive integrated structural and functional genomic data for the analysis of biosynthetic gene clusters (BCs) and associated secondary metabolites (SMs). SMs and BCs serve as the two main classes of objects in IMG-ABC, each with a rich collection of attributes. A unique feature of IMG-ABC is the incorporation of both experimentally validated and computationally predicted BCs in genomes as well as metagenomes, thus identifying BCs in uncultured populations and rare taxa. We demonstrate the strength of IMG-ABC’s focused integrated analysis tools in enabling the exploration of microbial secondary metabolism on a global scale, through the discovery of phenazine-producing clusters for the first time in Alphaproteobacteria. IMG-ABC strives to fill the long-existent void of resources for computational exploration of the secondary metabolism universe; its underlying scalable framework enables traversal of uncovered phylogenetic and chemical structure space, serving as a doorway to a new era in the discovery of novel molecules.
IMG-ABC is the largest publicly available database of predicted and experimental biosynthetic gene clusters and the secondary metabolites they produce. The system also includes powerful search and analysis tools that are integrated with IMG’s extensive genomic/metagenomic data and analysis tool kits. As new research on biosynthetic gene clusters and secondary metabolites is published and more genomes are sequenced, IMG-ABC will continue to expand, with the goal of becoming an essential component of any bioinformatic exploration of the secondary metabolism world.
The Genomes OnLine Database (GOLD; http://www.genomesonline.org) is a comprehensive online resource to catalog and monitor genetic studies worldwide. GOLD provides up-to-date status on complete and ongoing sequencing projects along with a broad array of curated metadata. Here we report version 5 (v.5) of the database. The newly designed database schema and web user interface supports several new features including the implementation of a four level (meta)genome project classification system and a simplified intuitive web interface to access reports and launch search tools. The database currently hosts information for about 19 200 studies, 56 000 Biosamples, 56 000 sequencing projects and 39 400 analysis projects. More than just a catalog of worldwide genome projects, GOLD is a manually curated, quality-controlled metadata warehouse. The problems encountered in integrating disparate and varying quality data into GOLD are briefly highlighted. GOLD fully supports and follows the Genomic Standards Consortium (GSC) Minimum Information standards.
The Genomic Encyclopedia of Bacteria and Archaea (GEBA) project was launched by the JGI in 2007 as a pilot project with the objective of sequencing 250 bacterial and archaeal genomes. The two major goals of that project were (a) to test the hypothesis that there are many benefits to the use the phylogenetic diversity of organisms in the tree of life as a primary criterion for generating their genome sequence and (b) to develop the necessary framework, technology and organization for large-scale sequencing of microbial isolate genomes. While the GEBA pilot project has not yet been entirely completed, both of the original goals have already been successfully accomplished, leading the way for the next phase of the project.
Here we propose taking the GEBA project to the next level, by generating high quality draft genomes for 1,000 bacterial and archaeal strains. This represents a combined 16-fold increase in both scale and speed as compared to the GEBA pilot project (250 isolate genomes in 4+ years). We will follow a similar approach for organism selection and sequencing prioritization as was done for the GEBA pilot project (i.e. phylogenetic novelty, availability and growth of cultures of type strains and DNA extraction capability), focusing on type strains as this ensures reproducibility of our results and provides the strongest linkage between genome sequences and other knowledge about each strain. In turn, this project will constitute a pilot phase of a larger effort that will target the genome sequences of all available type strains of the Bacteria and Archaea.
IMG/M (http://img.jgi.doe.gov/m) provides support for comparative analysis of microbial community aggregate genomes (metagenomes) in the context of a comprehensive set of reference genomes from all three domains of life, as well as plasmids, viruses and genome fragments. IMG/M’s data content and analytical tools have expanded continuously since its first version was released in 2007. Since the last report published in the 2012 NAR Database Issue, IMG/M’s database architecture, annotation and data integration pipelines and analysis tools have been extended to copewith the rapid growth in the number and size of metagenome data sets handled by the system. IMG/M data marts provide support for the analysis of publicly available genomes, expert review of metagenome annotations (IMG/M ER: http://img.jgi.doe.gov/mer) and Human Microbiome Project (HMP)-specific metagenome samples (IMG/M HMP: http://img.jgi.doe.gov/imgm_hmp).
Variability in the extent of the descriptions of data (‘metadata’) held in public repositories forces users to assess the quality of records individually, which rapidly becomes impractical. The scoring of records on the richness of their description provides a simple, objective proxy measure for quality that enables filtering that supports downstream analysis. Pivotally, such descriptions should spur on improvements. Here, we introduce such a measure - the ‘Metadata Coverage Index’ (MCI): the percentage of available fields actually filled in a record or description. MCI scores can be calculated across a database, for individual records or for their component parts (e.g., fields of interest). There are many potential uses for this simple metric: for example; to filter, rank or search for records; to assess the metadata availability of an ad hoc collection; to determine the frequency with which fields in a particular record type are filled, especially with respect to standards compliance; to assess the utility of specific tools and resources, and of data capture practice more generally; to prioritize records for further curation; to serve as performance metrics of funded projects; or to quantify the value added by curation. Here we demonstrate the utility of MCI scores using metadata from the Genomes Online Database (GOLD), including records compliant with the ‘Minimum Information about a Genome Sequence’ (MIGS) standard developed by the Genomic Standards Consortium. We discuss challenges and address the further application of MCI scores; to show improvements in annotation quality over time, to inform the work of standards bodies and repository providers on the usability and popularity of their products, and to assess and credit the work of curators. Such an index provides a step towards putting metadata capture practices and in the future, standards compliance, into a quantitative and objective framework.
Thermovirga lienii Dahle and Birkeland 2006 is a member of the genus Thermovirga in the genomically moderately well characterized phylum 'Synergistetes'. Members of this relatively recently proposed phylum ‘Synergistetes’ are of interest because of their isolated phylogenetic position and their diverse habitats, e.g. from humans to oil wells. The genome of T. lienii Cas60314T is the fifth genome sequence (third completed) from this phylum to be published. Here we describe the features of this organism, together with the complete genome sequence and annotation. The 1,999,646 bp long genome (including one plasmid) with its 1,914 protein-coding and 59 RNA genes is a part of the Genomic
anaerobic; chemoorganotrophic; Gram-negative; motile; thermophilic; marine oil well; Synergistaceae; GEBA
Holophaga foetida Liesack et al. 1995 is a member of the phylum Acidobacteria and is of interest for its ability to anaerobically degrade aromatic compounds and for its production of volatile sulfur compounds through a unique pathway. The genome of H. foetida strain TMBS4T is the first to be sequenced for a representative of the class Holophagae. Here we describe the features of this organism, together with the complete genome sequence (improved high quality draft), and annotation. The 4,127,237 bp long chromosome with its 3,615 protein-coding and 57 RNA genes is a part of the Genomic
anaerobic; motile; Gram-negative; mesophilic; chemoorganotrophic; sulfide-methylation; fresh water mud; Acidobacteria; Holophagaceae; GEBA
The Integrated Microbial Genomes (IMG) system serves as a community resource for comparative analysis of publicly available genomes in a comprehensive integrated context. IMG integrates publicly available draft and complete genomes from all three domains of life with a large number of plasmids and viruses. IMG provides tools and viewers for analyzing and reviewing the annotations of genes and genomes in a comparative context. IMG's data content and analytical capabilities have been continuously extended through regular updates since its first release in March 2005. IMG is available at http://img.jgi.doe.gov. Companion IMG systems provide support for expert review of genome annotations (IMG/ER: http://img.jgi.doe.gov/er), teaching courses and training in microbial genome analysis (IMG/EDU: http://img.jgi.doe.gov/edu) and analysis of genomes related to the Human Microbiome Project (IMG/HMP: http://www.hmpdacc-resources.org/img_hmp).
The integrated microbial genomes and metagenomes (IMG/M) system provides support for comparative analysis of microbial community aggregate genomes (metagenomes) in a comprehensive integrated context. IMG/M integrates metagenome data sets with isolate microbial genomes from the IMG system. IMG/M's data content and analytical capabilities have been extended through regular updates since its first release in 2007. IMG/M is available at http://img.jgi.doe.gov/m. A companion IMG/M systems provide support for annotation and expert review of unpublished metagenomic data sets (IMG/M ER: http://img.jgi.doe.gov/mer).
Pedobacter saltans Steyn et al. 1998 is one of currently 32 species in the genus Pedobacter within the family Sphingobacteriaceae. The species is of interest for its isolated location in the tree of life. Like other members of the genus P. saltans is heparinolytic. Cells of P. saltans show a peculiar gliding, dancing motility and can be distinguished from other Pedobacter strains by their ability to utilize glycerol and the inability to assimilate D-cellobiose. The genome presented here is only the second completed genome sequence of a type strain from a member of the family Sphingobacteriaceae to be published. The 4,635,236 bp long genome with its 3,854 protein-coding and 67 RNA genes consists of one chromosome, and is a part of the Genomic Encyclopedia of Bacteria and Archaea project.
strictly aerobic; gliding motility; Gram-negative; heparinolytic; mesophilic; chemoorganotrophic; Sphingobacteriaceae; GEBA
Fluviicola taffensis O'Sullivan et al. 2005 belongs to the monotypic genus Fluviicola within the family Cryomorphaceae. The species is of interest because of its isolated phylogenetic location in the genome-sequenced fraction of the tree of life. Strain RW262T forms a monophyletic lineage with uncultivated bacteria represented in freshwater 16S rRNA gene libraries. A similar phylogenetic differentiation occurs between freshwater and marine bacteria in the family Flavobacteriaceae, a sister family to Cryomorphaceae. Most remarkable is the inability of this freshwater bacterium to grow in the presence of Na+ ions. All other genera in the family Cryomorphaceae are from marine habitats and have an absolute requirement for Na+ ions or natural sea water. F. taffensis is the first member of the family Cryomorphaceae with a completely sequenced and publicly available genome. The 4,633,577 bp long genome with its 4,082 protein-coding and 49 RNA genes is a part of the Genomic Encyclopedia of Bacteria and Archaea project.
strictly aerobic; motile by gliding; Gram-negative; flexirubin-synthesizing; mesophilic; chemoorganotrophic; Cryomorphaceae; GEBA
Nitratifractor salsuginis Nakagawa et al. 2005 is the type species of the genus Nitratifractor, a member of the family Nautiliaceae. The species is of interest because of its high capacity for nitrate reduction via conversion to N2 through respiration, which is a key compound in plant nutrition. The strain is also of interest because it represents the first mesophilic and facultatively anaerobic member of the Epsilonproteobacteria reported to grow on molecular hydrogen. This is the first completed genome sequence of a member of the genus Nitratifractor and the second sequence from the family Nautiliaceae. The 2,101,285 bp long genome with its 2,121 protein-coding and 54 RNA genes is a part of the Genomic Encyclopedia of Bacteria and Archaea project.
anaerobic; microaerobic; non-motile; Gram-negative; mesophilic; strictly chemolithoautotroph; Nautiliaceae; GEBA
Hydrogenobacter thermophilus Kawasumi et al. 1984 is the type species of the genus Hydrogenobacter. H. thermophilus was the first obligate autotrophic organism reported among aerobic hydrogen-oxidizing bacteria. Strain TK-6T is of interest because of the unusually efficient hydrogen-oxidizing ability of this strain, which results in a faster generation time compared to other autotrophs. It is also able to grow anaerobically using nitrate as an electron acceptor when molecular hydrogen is used as the energy source, and able to aerobically fix CO2 via the reductive tricarboxylic acid cycle. This is the fifth completed genome sequence in the family Aquificaceae, and the second genome sequence determined from a strain derived from the original isolate. Here we describe the features of this organism, together with the complete genome sequence and annotation. The 1,742,932 bp long genome with its 1,899 protein-coding and 49 RNA genes is a part of the Genomic Encyclopedia of Bacteria and Archaea project.
strictly thermophilic; obligately chemolithoautotrophic; Gram-negative; aerobic; hydrogen-oxidizing; nonmotile; non sporeforming; rod shaped; Aquificaceae; Aquificae; GEBA
Riemerella anatipestifer (Hendrickson and Hilbert 1932) Segers et al. 1993 is the type species of the genus Riemerella, which belongs to the family Flavobacteriaceae. The species is of interest because of the position of the genus in the phylogenetic tree and because of its role as a pathogen of commercially important avian species worldwide. This is the first completed genome sequence of a member of the genus Riemerella. The 2,155,121 bp long genome with its 2,001 protein-coding and 51 RNA genes consists of one circular chromosome and is a part of the Genomic Encyclopedia of Bacteria and Archaea project.
capnophilic; non-motile; Gram-negative; poultry pathogen; mesophilic; chemoorganotrophic; Flavobacteriaceae; GEBA
Marivirga tractuosa (Lewin 1969) Nedashkovskaya et al. 2010 is the type species of the genus Marivirga, which belongs to the family Flammeovirgaceae. Members of this genus are of interest because of their gliding motility. The species is of interest because representative strains show resistance to several antibiotics, including gentamicin, kanamycin, neomycin, polymixin and streptomycin. This is the first complete genome sequence of a member of the family Flammeovirgaceae. Here we describe the features of this organism, together with the complete genome sequence and annotation. The 4,511,574 bp long chromosome and the 4,916 bp plasmid with their 3,808 protein-coding and 49 RNA genes are a part of the Genomic Encyclopedia of Bacteria and Archaea project.
mesophilic; chemoorganotrophic; strictly aerobic; Gram-negative; slender and flexible rod-shaped; non-sporeforming; motile by gliding; Flammeovirgaceae; GEBA
Leadbetterella byssophila Weon et al. 2005 is the type species of the genus Leadbetterella of the family Cytophagaceae in the phylum Bacteroidetes. Members of the phylum Bacteroidetes are widely distributed in nature, especially in aquatic environments. They are of special interest for their ability to degrade complex biopolymers. L. byssophila occupies a rather isolated position in the tree of life and is characterized by its ability to hydrolyze starch and gelatine, but not agar, cellulose or chitin. Here we describe the features of this organism, together with the complete genome sequence, and annotation. L. byssophila is already the 16th member of the family Cytophagaceae whose genome has been sequenced. The 4,059,653 bp long single replicon genome with its 3,613 protein-coding and 53 RNA genes is part of the Genomic Encyclopedia of Bacteria and Archaea project.
non-motile; non-sporulating; aerobic; mesophile; Gram-negative; flexirubin; Cytophagaceae; GEBA
Weeksella virosa Holmes et al. 1987 is the sole member and type species of the genus Weeksella which belongs to the family Flavobacteriaceae of the phylum Bacteroidetes. Twenty-nine isolates, collected from clinical specimens provided the basis for the taxon description. While the species seems to be a saprophyte of the mucous membranes of healthy man and warm-blooded animals a causal relationship with disease has been reported in a few instances. Except for the ability to produce indole and to hydrolyze Tween and proteins such as casein and gelatin, this aerobic, non-motile, non-pigmented bacterial species is metabolically inert in most traditional biochemical tests. The 2,272,954 bp long genome with its 2,105 protein-coding and 76 RNA genes consists of one circular chromosome and is a part of the Genomic Encyclopedia of Bacteria and Archaea project.
strictly aerobic; slimy; Gram-negative; lyses proteins; inhabitant of mucosa; Flavobacteriaceae; GEBA
Thermomonospora curvata Henssen 1957 is the type species of the genus Thermomonospora. This genus is of interest because members of this clade are sources of new antibiotics, enzymes, and products with pharmacological activity. In addition, members of this genus participate in the active degradation of cellulose. This is the first complete genome sequence of a member of the family Thermomonosporaceae. Here we describe the features of this organism, together with the complete genome sequence and annotation. The 5,639,016 bp long genome with its 4,985 protein-coding and 76 RNA genes is a part of the Genomic Encyclopedia of Bacteria and Archaea project.
chemoorganotroph; facultative aerobe; eurythermal thermophile; mycelium; Gram-positive; cellulose degradation; Thermomonosporaceae; GEBA
Methanothermus fervidus Stetter 1982 is the type strain of the genus Methanothermus. This hyperthermophilic genus is of a thought to be endemic in Icelandic hot springs. M. fervidus was not only the first characterized organism with a maximal growth temperature (97°C) close to the boiling point of water, but also the first archaeon in which a detailed functional analysis of its histone protein was reported and the first one in which the function of 2,3-cyclodiphosphoglycerate in thermoadaptation was characterized. Strain V24ST is of interest because of its very low substrate ranges, it grows only on H2 + CO2. This is the first completed genome sequence of the family Methanothermaceae. Here we describe the features of this organism, together with the complete genome sequence and annotation. The 1,243,342 bp long genome with its 1,311 protein-coding and 50 RNA genes is a part of the Genomic Encyclopedia of Bacteria and Archaea project.
hyperthermophile; strictly anaerobic; motile; Gram-positive; chemolithoautotroph; Methanothermaceae; Euryarchaeota; GEBA
Spirochaeta smaragdinae Magot et al. 1998 belongs to the family Spirochaetaceae. The species is Gram-negative, motile, obligately halophilic and strictly anaerobic and is of interest because it is able to ferment numerous polysaccharides. S. smaragdinae is the only species of the family Spirochaetaceae known to reduce thiosulfate or element sulfur to sulfide. This is the first complete genome sequence in the family Spirochaetaceae. The 4,653,970 bp long genome with its 4,363 protein-coding and 57 RNA genes is a part of the Genomic Encyclopedia of Bacteria and Archaea project.
spiral shaped; corkscrew-like motility; chemoorganotroph; strictly anaerobe; obligately halophile; rhodanese-like protein; Spirochaetaceae; GEBA
High-quality draft genome sequences were determined for 10 Exiguobacterium strains in order to provide insight into their evolutionary strategies for speciation and environmental adaptation. The selected genomes include psychrotrophic and thermophilic species from a range of habitats, which will allow for a comparison of metabolic pathways and stress response genes.
Vulcanisaeta distributa Itoh et al. 2002 belongs to the family Thermoproteaceae in the phylum Crenarchaeota. The genus Vulcanisaeta is characterized by a global distribution in hot and acidic springs. This is the first genome sequence from a member of the genus Vulcanisaeta and seventh genome sequence in the family Thermoproteaceae. The 2,374,137 bp long genome with its 2,544 protein-coding and 49 RNA genes is a part of the Genomic Encyclopedia of Bacteria and Archaea project.
hyperthermophilic; acidophilic; non-motile; microaerotolerant anaerobe; Thermoproteaceae; Crenarchaeota; GEBA
Coraliomargarita akajimensis Yoon et al. 2007 is the type species of the genus Coraliomargarita. C. akajimensis is an obligately aerobic, Gram-negative, non-spore-forming, non-motile, spherical bacterium that was isolated from seawater surrounding the hard coral Galaxea fascicularis. C. akajimensis is of special interest because of its phylogenetic position in a genomically under-studied area of the bacterial diversity. Here we describe the features of this organism, together with the complete genome sequence, and annotation. This is the first complete genome sequence of a member of the family Puniceicoccaceae. The 3,750,771 bp long genome with its 3,137 protein-coding and 55 RNA genes is a part of the Genomic Encyclopedia of Bacteria and Archaea project.
sphere-shaped; non-motile; non-spore-forming; aerobic; mesophile; Gram-negative; Puniceicoccaceae; Opitutae; GEBA
Arcobacter nitrofigilis (McClung et al. 1983) Vandamme et al. 1991 is the type species of the genus Arcobacter in the family Campylobacteraceae within the Epsilonproteobacteria. The species was first described in 1983 as Campylobacter nitrofigilis  after its detection as a free-living, nitrogen-fixing Campylobacter species associated with Spartina alterniflora Loisel roots . It is of phylogenetic interest because of its lifestyle as a symbiotic organism in a marine environment in contrast to many other Arcobacter species which are associated with warm-blooded animals and tend to be pathogenic. Here we describe the features of this organism, together with the complete genome sequence, and annotation. This is the first complete genome sequence of a type stain of the genus Arcobacter. The 3,192,235 bp genome with its 3,154 protein-coding and 70 RNA genes is part of the Genomic Encyclopedia of Bacteria and Archaea project.
symbiotic; Spartina alterniflora Loisel; nitrogen fixation; micro-anaerophilic; motile; Campylobacteraceae; GEBA
Thermobispora bispora (Henssen 1957) Wang et al. 1996 is the type species of the genus Thermobispora. This genus is of great interest because it is strictly thermophilic and because it has been shown for several of its members that the genome contains substantially distinct (6.4% sequence difference) and transcriptionally active 16S rRNA genes. Here we describe the features of this organism, together with the complete genome sequence and annotation. This is the second completed genome sequence of a member from the suborder Streptosporangineae and the first genome sequence of a member of the genus Thermobispora. The 4,189,976 bp long genome with its 3,596 protein-coding and 63 RNA genes is part of the Genomic Encyclopedia of Bacteria and Archaea project.
Two distinct 16S rRNA genes; strictly thermophilic; non-pathogenic; Streptosporangineae; GEBA