Metagenomics has become an indispensable tool for studying the diversity and metabolic potential of environmental microbes, whose bulk is as yet non-cultivable. Continual progress in next-generation sequencing allows for generating increasingly large metagenomes and studying multiple metagenomes over time or space. Recently, a new type of holistic ecosystem study has emerged that seeks to combine metagenomics with biodiversity, meta-expression and contextual data. Such ‘ecosystems biology’ approaches bear the potential to not only advance our understanding of environmental microbes to a new level but also impose challenges due to increasing data complexities, in particular with respect to bioinformatic post-processing. This mini review aims to address selected opportunities and challenges of modern metagenomics from a bioinformatics perspective and hopefully will serve as a useful resource for microbial ecologists and bioinformaticians alike.
16S rRNA biodiversity; binning; bioinformatics; Genomic Standards Consortium; metagenomics; next-generation sequencing
Members of the Planctomycetes clade share many unusual features for bacteria. Their cytoplasm contains membrane-bound compartments, they lack peptidoglycan and FtsZ, they divide by polar budding, and they are capable of endocytosis. Planctomycete genomes have remained enigmatic, generally being quite large (up to 9 Mb), and on average, 55% of their predicted proteins are of unknown function. Importantly, proteins related to the unusual traits of Planctomycetes remain largely unknown. Thus, we embarked on bioinformatic analyses of these genomes in an effort to predict proteins that are likely to be involved in compartmentalization, cell division, and signal transduction. We used three complementary strategies. First, we defined the Planctomycetes core genome and subtracted genes of well-studied model organisms. Second, we analyzed the gene content and synteny of morphogenesis and cell division genes and combined both methods using a “guilt-by-association” approach. Third, we identified signal transduction systems as well as sigma factors. These analyses provide a manageable list of candidate genes for future genetic studies and provide evidence for complex signaling in the Planctomycetes akin to that observed for bacteria with complex life-styles, such as Myxococcus xanthus.
SILVA (from Latin silva, forest, http://www.arb-silva.de) is a comprehensive web resource for up to date, quality-controlled databases of aligned ribosomal RNA (rRNA) gene sequences from the Bacteria, Archaea and Eukaryota domains and supplementary online services. The referred database release 111 (July 2012) contains 3 194 778 small subunit and 288 717 large subunit rRNA gene sequences. Since the initial description of the project, substantial new features have been introduced, including advanced quality control procedures, an improved rRNA gene aligner, online tools for probe and primer evaluation and optimized browsing, searching and downloading on the website. Furthermore, the extensively curated SILVA taxonomy and the new non-redundant SILVA datasets provide an ideal reference for high-throughput classification of data from next-generation sequencing approaches.
16S ribosomal RNA gene (rDNA) amplicon analysis remains the standard approach for the cultivation-independent investigation of microbial diversity. The accuracy of these analyses depends strongly on the choice of primers. The overall coverage and phylum spectrum of 175 primers and 512 primer pairs were evaluated in silico with respect to the SILVA 16S/18S rDNA non-redundant reference dataset (SSURef 108 NR). Based on this evaluation a selection of ‘best available’ primer pairs for Bacteria and Archaea for three amplicon size classes (100–400, 400–1000, ≥1000 bp) is provided. The most promising bacterial primer pair (S-D-Bact-0341-b-S-17/S-D-Bact-0785-a-A-21), with an amplicon size of 464 bp, was experimentally evaluated by comparing the taxonomic distribution of the 16S rDNA amplicons with 16S rDNA fragments from directly sequenced metagenomes. The results of this study may be used as a guideline for selecting primer pairs with the best overall coverage and phylum spectrum for specific applications, therefore reducing the bias in PCR-based microbial diversity studies.
Motivation: In the analysis of homologous sequences, computation of multiple sequence alignments (MSAs) has become a bottleneck. This is especially troublesome for marker genes like the ribosomal RNA (rRNA) where already millions of sequences are publicly available and individual studies can easily produce hundreds of thousands of new sequences. Methods have been developed to cope with such numbers, but further improvements are needed to meet accuracy requirements.
Results: In this study, we present the SILVA Incremental Aligner (SINA) used to align the rRNA gene databases provided by the SILVA ribosomal RNA project. SINA uses a combination of k-mer searching and partial order alignment (POA) to maintain very high alignment accuracy while satisfying high throughput performance demands.
SINA was evaluated in comparison with the commonly used high throughput MSA programs PyNAST and mothur. The three BRAliBase III benchmark MSAs could be reproduced with 99.3, 97.6 and 96.1 accuracy. A larger benchmark MSA comprising 38 772 sequences could be reproduced with 98.9 and 99.3% accuracy using reference MSAs comprising 1000 and 5000 sequences. SINA was able to achieve higher accuracy than PyNAST and mothur in all performed benchmarks.
Availability: Alignment of up to 500 sequences using the latest SILVA SSU/LSU Ref datasets as reference MSA is offered at http://www.arb-silva.de/aligner. This page also links to Linux binaries, user manual and tutorial. SINA is made available under a personal use license.
Supplementary data are available at Bioinformatics online.
Next-generation sequencing (NGS) technologies have enabled the application of broad-scale sequencing in microbial biodiversity and metagenome studies. Biodiversity is usually targeted by classifying 16S ribosomal RNA genes, while metagenomic approaches target metabolic genes. However, both approaches remain isolated, as long as the taxonomic and functional information cannot be interrelated. Techniques like self-organizing maps (SOMs) have been applied to cluster metagenomes into taxon-specific bins in order to link biodiversity with functions, but have not been applied to broad-scale NGS-based metagenomics yet. Here, we provide a novel implementation, demonstrate its potential and practicability, and provide a web-based service for public usage. Evaluation with published data sets mimicking varyingly complex habitats resulted into classification specificities and sensitivities of close to 100% to above 90% from phylum to genus level for assemblies exceeding 8 kb for low and medium complexity data. When applied to five real-world metagenomes of medium complexity from direct pyrosequencing of marine subsurface waters, classifications of assemblies above 2.5 kb were in good agreement with fluorescence in situ hybridizations, indicating that biodiversity was mostly retained within the metagenomes, and confirming high classification specificities. This was validated by two protein-based classifications (PBCs) methods. SOMs were able to retrieve the relevant taxa down to the genus level, while surpassing PBCs in resolution. In order to make the approach accessible to a broad audience, we implemented a feature-rich web-based SOM application named TaxSOM, which is freely available at http://www.megx.net/toolbox/taxsom. TaxSOM can classify reads or assemblies exceeding 2.5 kb with high accuracy and thus assists in linking biodiversity and functions in metagenome studies, which is a precondition to study microbial ecology in a holistic fashion.
binning; metagenomics; molecular ecology; self-organizing map (SOM); taxonomic classification; TaxSOM
Planctomycetes represent a remarkable clade in the domain Bacteria because they play crucial roles in global carbon and nitrogen cycles and display cellular structures that closely parallel those of eukaryotic cells. Studies on Planctomycetes have been hampered by the lack of genetic tools, which we developed for Planctomyces limnophilus.
Marine phages have an astounding global abundance and ecological impact. However, little knowledge is derived from phage genomes, as most of the open reading frames in their small genomes are unknown, novel proteins. To infer potential functional and ecological relevance of sequenced marine Pseudoalteromonas phage H105/1, two strategies were used. First, similarity searches were extended to include six viral and bacterial metagenomes paired with their respective environmental contextual data. This approach revealed ‘ecogenomic' patterns of Pseudoalteromonas phage H105/1, such as its estuarine origin. Second, intrinsic genome signatures (phylogenetic, codon adaptation and tetranucleotide (tetra) frequencies) were evaluated on a resolved intra-genomic level to shed light on the evolution of phage functional modules. On the basis of differential codon adaptation of Phage H105/1 proteins to the sequenced Pseudoalteromonas spp., regions of the phage genome with the most ‘host'-adapted proteins also have the strongest bacterial tetra signature, whereas the least ‘host'-adapted proteins have the strongest phage tetra signature. Such a pattern may reflect the evolutionary history of the respective phage proteins and functional modules. Finally, analysis of the structural proteome identified seven proteins that make up the mature virion, four of which were previously unknown. This integrated approach combines both novel and classical strategies and serves as a model to elucidate ecological inferences and evolutionary relationships from phage genomes that typically abound with unknown gene content.
ecogenomics; genome signatures; genomics; marine; phage; Pseudoalteromonas
DNA-binding transcription factors (TFs) regulate cellular functions in prokaryotes, often in response to environmental stimuli. Thus, the environment exerts constant selective pressure on the TF gene content of microbial communities. Recently a study on marine Synechococcus strains detected differences in their genomic TF content related to environmental adaptation, but so far the effect of environmental parameters on the content of TFs in bacterial communities has not been systematically investigated.
We quantified the effect of environment stability on the transcription factor repertoire of marine pelagic microbes from the Global Ocean Sampling (GOS) metagenome using interpolated physico-chemical parameters and multivariate statistics. Thirty-five percent of the difference in relative TF abundances between samples could be explained by environment stability. Six percent was attributable to spatial distance but none to a combination of both spatial distance and stability. Some individual TFs showed a stronger relationship to environment stability and space than the total TF pool.
Environmental stability appears to have a clearly detectable effect on TF gene content in bacterioplanktonic communities described by the GOS metagenome. Interpolated environmental parameters were shown to compare well to in situ measurements and were essential for quantifying the effect of the environment on the TF content. It is demonstrated that comprehensive and well-structured contextual data will strongly enhance our ability to interpret the functional potential of microbes from metagenomic data.
transcription factors; ecological metagenomics; interpolated environmental data; multivariate statistics
In any sequencing project, the possible depth of comparative analysis is determined largely by the amount and quality of the accompanying contextual data. The structure, content, and storage of this contextual data should be standardized to ensure consistent coverage of all sequenced entities and facilitate comparisons. The Genomic Standards Consortium (GSC) has developed the “Minimum Information about Genome/Metagenome Sequences (MIGS/MIMS)” checklist for the description of genomes and here we annotate all 30 publicly available marine bacteriophage sequences to the MIGS standard. These annotations build on existing International Nucleotide Sequence Database Collaboration (INSDC) records, and confirm, as expected that current submissions lack most MIGS fields. MIGS fields were manually curated from the literature and placed in XML format as specified by the Genomic Contextual Data Markup Language (GCDML). These “machine-readable” reports were then analyzed to highlight patterns describing this collection of genomes. Completed reports are provided in GCDML. This work represents one step towards the annotation of our complete collection of genome sequences and shows the utility of capturing richer metadata along with raw sequences.
marine phages; contextual data; genome standards; markup language
In the future, we hope to see an open and thriving data market in which users can find and select data from a wide range of data providers. In such an open access market, data are products that must be packaged accordingly. Increasingly, eCommerce sellers present heterogeneous product lines to buyers using faceted browsing. Using this approach we have developed the Ontogrator platform, which allows for rapid retrieval of data in a way that would be familiar to any online shopper. Using Knowledge Organization Systems (KOS), especially ontologies, Ontogrator uses text mining to mark up data and faceted browsing to help users navigate, query and retrieve data. Ontogrator offers the potential to impact scientific research in two major ways: 1) by significantly improving the retrieval of relevant information; and 2) by significantly reducing the time required to compose standard database queries and assemble information for further research. Here we present a pilot implementation developed in collaboration with the Genomic Standards Consortium (GSC) that includes content from the StrainInfo, GOLD, CAMERA, Silva and Pubmed databases. This implementation demonstrates the power of ontogration and highlights that the usefulness of this approach is fully dependent on both the quality of data and the KOS (ontologies) used. Ideally, the use and further expansion of this collaborative system will help to surface issues associated with the underlying quality of annotation and could lead to a systematic means for accessing integrated data resources.
This report details the outcome the first meeting of the Earth Microbiome Project to discuss sample selection and acquisition. The meeting, held at the Argonne National Laboratory on Wednesday October 6th 2010, focused on discussion of how to prioritize environmental samples for sequencing and metagenomic analysis as part of the global effort of the EMP to systematically determine the functional and phylogenetic diversity of microbial communities across the world.
This report summarizes the proceedings of the 9th workshop of the Genomic Standards Consortium (GSC), held at the J. Craig Venter Institute, Rockville, MD, USA. It was the first GSC workshop to have open registration and attracted over 90 participants. This workshop featured sessions that provided overviews of the full range of ongoing GSC projects. It included sessions on Standards in Genomic Sciences, the open access journal of the GSC, building standards for genome annotation, the M5 platform for next-generation collaborative computational infrastructures, building ties with the biodiversity research community and two discussion panels with government and industry participants. Progress was made on all fronts, and major outcomes included the completion of the MIENS specification for publication and the formation of the Biodiversity working group.
Marinobacter adhaerens HP15 is the type strain of a newly identified marine species, which is phylogenetically related to M. flavimaris, M. algicola, and M. aquaeolei. It is of special interest for research on marine aggregate formation because it showed specific attachment to diatom cells. In vitro it led to exopolymer formation and aggregation of these algal cells to form marine snow particles. M. adhaerens HP15 is a free-living, motile, rod-shaped, Gram-negative gammaproteobacterium, which was originally isolated from marine particles sampled in the German Wadden Sea. M. adhaerens HP15 grows heterotrophically on various media, is easy to access genetically, and serves as a model organism to investigate the cellular and molecular interactions with the diatom Thalassiosira weissflogii. Here we describe the complete and annotated genome sequence of M. adhaerens HP15 as well as some details on flagella-associated genes. M. adhaerens HP15 possesses three replicons; the chromosome comprises 4,422,725 bp and codes for 4,180 protein-coding genes, 51 tRNAs and three rRNA operons, while the two circular plasmids are ~187 kb and ~42 kb in size and contain 178 and 52 protein-coding genes, respectively.
marine heterotrophic bacteria; diatoms; attachment; marine aggregate formation
This report summarizes the proceedings of the 8th meeting of the Genomic Standards Consortium held at the Department of Energy Joint Genome Institute in Walnut Creek, CA, USA on September 9-11, 2009. This three-day workshop marked the maturing of Genomic Standards Consortium from an informal gathering of researchers interested in developing standards in the field of genomic and metagenomics to an established community with a defined governance mechanism, its own open access journal, and a family of established standards for describing genomes, metagenomes and marker studies (i.e. ribosomal RNA gene surveys). There will be increased efforts within the GSC to reach out to the wider scientific community via a range of new projects. Further information about the GSC and its activities can be found at http://gensc.org/.
Environmental sequence datasets are increasing at an exponential rate; however, the vast majority of them lack appropriate descriptors like sampling location, time and depth/altitude: generally referred to as metadata or contextual data. The consistent capture and structured submission of these data is crucial for integrated data analysis and ecosystems modeling. The application MetaBar has been developed, to support consistent contextual data acquisition.
MetaBar is a spreadsheet and web-based software tool designed to assist users in the consistent acquisition, electronic storage, and submission of contextual data associated to their samples. A preconfigured Microsoft® Excel® spreadsheet is used to initiate structured contextual data storage in the field or laboratory. Each sample is given a unique identifier and at any stage the sheets can be uploaded to the MetaBar database server. To label samples, identifiers can be printed as barcodes. An intuitive web interface provides quick access to the contextual data in the MetaBar database as well as user and project management capabilities. Export functions facilitate contextual and sequence data submission to the International Nucleotide Sequence Database Collaboration (INSDC), comprising of the DNA DataBase of Japan (DDBJ), the European Molecular Biology Laboratory database (EMBL) and GenBank. MetaBar requests and stores contextual data in compliance to the Genomic Standards Consortium specifications. The MetaBar open source code base for local installation is available under the GNU General Public License version 3 (GNU GPL3).
The MetaBar software supports the typical workflow from data acquisition and field-sampling to contextual data enriched sequence submission to an INSDC database. The integration with the megx.net marine Ecological Genomics database and portal facilitates georeferenced data integration and metadata-based comparisons of sampling sites as well as interactive data visualization. The ample export functionalities and the INSDC submission support enable exchange of data across disciplines and safeguarding contextual data.
This report summarizes the proceedings of the “Metagenomics, Metadata and Meta-analysis” (M3) Special Interest Group (SIG) meeting held at the Intelligent Systems for Molecular Biology 2009 conference. The Genomic Standards Consortium (GSC) hosted this meeting to explore the bottlenecks and emerging solutions for obtaining biological insights through large-scale comparative analysis of metagenomic datasets. The M3 SIG included 16 talks, half of which were selected from submitted abstracts, a poster session and a panel discussion involving members of the GSC Board. This report summarizes this one-day SIG, attempts to identify shared themes and recapitulates community recommendations for the future of this field. The GSC will also host an M3 workshop at the Pacific Symposium on Biocomputing (PSB) in January 2010. Further information about the GSC and its range of activities can be found at http://gensc.org/.
Megx.net is a database and portal that provides integrated access to georeferenced marker genes, environment data and marine genome and metagenome projects for microbial ecological genomics. All data are stored in the Microbial Ecological Genomics DataBase (MegDB), which is subdivided to hold both sequence and habitat data and global environmental data layers. The extended system provides access to several hundreds of genomes and metagenomes from prokaryotes and phages, as well as over a million small and large subunit ribosomal RNA sequences. With the refined Genes Mapserver, all data can be interactively visualized on a world map and statistics describing environmental parameters can be calculated. Sequence entries have been curated to comply with the proposed minimal standards for genomes and metagenomes (MIGS/MIMS) of the Genomic Standards Consortium. Access to data is facilitated by Web Services. The updated megx.net portal offers microbial ecologists greatly enhanced database content, and new features and tools for data analysis, all of which are freely accessible from our webpage http://www.megx.net.
The marine model organism Rhodopirellula baltica SH1T was the first Planctomycete to have its genome completely sequenced. The genome analysis predicted a complex lifestyle and a variety of genetic opportunities to adapt to the marine environment. Its adaptation to environmental stressors was studied by transcriptional profiling using a whole genome microarray.
Stress responses to salinity and temperature shifts were monitored in time series experiments. Chemostat cultures grown in mineral medium at 28°C were compared to cultures that were shifted to either elevated (37°C) or reduced (6°C) temperatures as well as high salinity (59.5‰) and observed over 300 min. Heat shock showed the induction of several known chaperone genes. Cold shock altered the expression of genes in lipid metabolism and stress proteins. High salinity resulted in the modulation of genes coding for compatible solutes, ion transporters and morphology. In summary, over 3000 of the 7325 genes were affected by temperature and/or salinity changes.
Transcriptional profiling confirmed that R. baltica is highly responsive to its environment. The distinct responses identified here have provided new insights into the complex adaptation machinery of this environmentally relevant marine bacterium. Our transcriptome study and previous proteome data suggest a set of genes of unknown functions that are most probably involved in the global stress response. This work lays the foundation for further bioinformatic and genetic studies which will lead to a comprehensive understanding of the biology of a marine Planctomycete.
Through a newly established Research Coordination Network for the Genomic Standards Consortium (RCN4GSC), the GSC will continue its leadership in establishing and integrating genomic standards through community-based efforts. These efforts, undertaken in the context of genomic and metagenomic research aim to ensure the electronic capture of all genomic data and to facilitate the achievement of a community consensus around collecting and managing relevant contextual information connected to the sequence data. The GSC operates as an open, inclusive organization, welcoming inspired biologists with a commitment to community service. Within the collaborative framework of the ongoing, international activities of the GSC, the RCN will expand the range of research domains engaged in these standardization efforts and sustain scientific networking to encourage active participation by the broader community. The RCN4GSC, funded for five years by the US National Science Foundation, will primarily support outcome-focused working meetings and the exchange of early-career scientists between GSC research groups in order to advance key standards contributions such as GCDML. Focusing on the timely delivery of the extant GSC core projects, the RCN will also extend the pioneering efforts of the GSC to engage researchers active in developing ecological, environmental and biodiversity data standards. As the initial goals of the GSC are increasingly achieved, promoting the comprehensive use of effective standards will be essential to ensure the effective use of sequence and associated data, to provide access for all biologists to all of the information, and to create interdisciplinary opportunities for discovery. The RCN will facilitate these implementation activities through participation in major scientific conferences and presentations on scientific advances enabled by community usage of genomic standards.
This report summarizes the proceedings of the 6th and 7th workshops of the Genomic Standards Consortium (GSC), held back-to-back in 2008. GSC 6 focused on furthering the activities of GSC working groups, GSC 7 focused on outreach to the wider community. GSC 6 was held October 10-14, 2008 at the European Bioinformatics Institute, Cambridge, United Kingdom and included a two-day workshop focused on the refinement of the Genomic Contextual Data Markup Language (GCDML). GSC 7 was held as the opening day of the International Congress on Metagenomics 2008 in San Diego California. Major achievements of these combined meetings included an agreement from the International Nucleotide Sequence Database Consortium (INSDC) to create a “MIGS” keyword for capturing ”Minimum Information about a Genome Sequence” compliant information within INSDC (DDBJ/EMBL /Genbank) records, launch of GCDML 1.0, MIGS compliance of the first set of “Genomic Encyclopedia of Bacteria and Archaea” project genomes, approval of a proposal to extend MIGS to 16S rRNA sequences within a “Minimum Information about an Environmental Sequence”, finalization of plans for the GSC eJournal, “Standards in Genomic Sciences” (SIGS), and the formation of a GSC Board. Subsequently, the GSC has been awarded a Research Co-ordination Network (RCN4GSC) grant from the National Science Foundation, held the first SIGS workshop and launched the journal. The GSC will also be hosting outreach workshops at both ISMB 2009 and PSB 2010 focused on “Metagenomics, Metadata and MetaAnalysis” (M3). Further information about the GSC and its range of activities can be found at http://gensc.org, including videos of all the presentations at GSC 7.
Sulfate-reducing bacteria (SRB) belonging to the metabolically versatile Desulfobacteriaceae are abundant in marine sediments and contribute to the global carbon cycle by complete oxidation of organic compounds. Desulfobacterium autotrophicum HRM2 is the first member of this ecophysiologically important group with a now available genome sequence. With 5.6 megabasepairs (Mbp) the genome of Db. autotrophicum HRM2 is about 2 Mbp larger than the sequenced genomes of other sulfate reducers (SRB). A high number of genome plasticity elements (> 100 transposon-related genes), several regions of GC discontinuity and a high number of repetitive elements (132 paralogous genes Mbp−1) point to a different genome evolution when comparing with Desulfovibrio spp. The metabolic versatility of Db. autotrophicum HRM2 is reflected in the presence of genes for the degradation of a variety of organic compounds including long-chain fatty acids and for the Wood–Ljungdahl pathway, which enables the organism to completely oxidize acetyl-CoA to CO2 but also to grow chemolithoautotrophically. The presence of more than 250 proteins of the sensory/regulatory protein families should enable Db. autotrophicum HRM2 to efficiently adapt to changing environmental conditions. Genes encoding periplasmic or cytoplasmic hydrogenases and formate dehydrogenases have been detected as well as genes for the transmembrane TpII-c3, Hme and Rnf complexes. Genes for subunits A, B, C and D as well as for the proposed novel subunits L and F of the heterodisulfide reductases are present. This enzyme is involved in energy conservation in methanoarchaea and it is speculated that it exhibits a similar function in the process of dissimilatory sulfate reduction in Db. autotrophicum HRM2.
Current sequencing technologies give access to sequence information for genomes and metagenomes at a tremendous speed. Subsequent data processing is mainly performed by automatic pipelines provided by the sequencing centers. Although, standardised workflows are desirable and useful in many respects, rational data mining, comparative genomics, and especially the interpretation of the sequence information in the biological context, demands for intuitive, flexible, and extendable solutions.
The JCoast software tool was primarily designed to analyse and compare (meta)genome sequences of prokaryotes. Based on a pre-computed GenDB database project, JCoast offers a flexible graphical user interface (GUI), as well as an application programming interface (API) that facilitates back-end data access. JCoast offers individual, cross genome-, and metagenome analysis, and assists the biologist in exploration of large and complex datasets.
JCoast combines all functions required for the mining, annotation, and interpretation of (meta)genomic data. The lightweight software solution allows the user to easily take advantage of advanced back-end database structures by providing a programming and graphical user interface to answer biological questions. JCoast is available at the project homepage.
Magnetotactic bacteria (MTB) are a heterogeneous group of aquatic prokaryotes with a unique intracellular organelle, the magnetosome, which orients the cell along magnetic field lines. Magnetotaxis is a complex phenotype, which depends on the coordinate synthesis of magnetosomes and the ability to swim and orient along the direction caused by the interaction with the Earth's magnetic field. Although a number of putative magnetotaxis genes were recently identified within a conserved genomic magnetosome island (MAI) of several MTB, their functions have remained mostly unknown, and it was speculated that additional genes located outside the MAI might be involved in magnetosome formation and magnetotaxis. In order to identify genes specifically associated with the magnetotactic phenotype, we conducted comparisons between four sequenced magnetotactic Alphaproteobacteria including the nearly complete genome of Magnetospirillum gryphiswaldense strain MSR-1, the complete genome of Magnetospirillum magneticum strain AMB-1, the complete genome of the magnetic coccus MC-1, and the comparative-ready preliminary genome assembly of Magnetospirillum magnetotacticum strain MS-1 against an in-house database comprising 426 complete bacterial and archaeal genome sequences. A magnetobacterial core genome of about 891 genes was found shared by all four MTB. In addition to a set of approximately 152 genus-specific genes shared by the three Magnetospirillum strains, we identified 28 genes as group specific, i.e., which occur in all four analyzed MTB but exhibit no (MTB-specific genes) or only remote (MTB-related genes) similarity to any genes from nonmagnetotactic organisms and which besides various novel genes include nearly all mam and mms genes previously shown to control magnetosome formation. The MTB-specific and MTB-related genes to a large extent display synteny, partially encode previously unrecognized magnetosome membrane proteins, and are either located within (18 genes) or outside (10 genes) the MAI of M. gryphiswaldense. These genes, which represent less than 1% of the 4,268 open reading frames of the MSR-1 genome, as yet are mostly of unknown functions but are likely to be specifically involved in magnetotaxis and, thus, represent prime targets for future experimental analysis.