PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (57)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
more »
1.  FastaValidator: an open-source Java library to parse and validate FASTA formatted sequences 
BMC Research Notes  2014;7:365.
Background
Advances in sequencing technologies challenge the efficient importing and validation of FASTA formatted sequence data which is still a prerequisite for most bioinformatic tools and pipelines. Comparative analysis of commonly used Bio*-frameworks (BioPerl, BioJava and Biopython) shows that their scalability and accuracy is hampered.
Findings
FastaValidator represents a platform-independent, standardized, light-weight software library written in the Java programming language. It targets computer scientists and bioinformaticians writing software which needs to parse quickly and accurately large amounts of sequence data. For end-users FastaValidator includes an interactive out-of-the-box validation of FASTA formatted files, as well as a non-interactive mode designed for high-throughput validation in software pipelines.
Conclusions
The accuracy and performance of the FastaValidator library qualifies it for large data sets such as those commonly produced by massive parallel (NGS) technologies. It offers scientists a fast, accurate and standardized method for parsing and validating FASTA formatted sequence data.
doi:10.1186/1756-0500-7-365
PMCID: PMC4094456  PMID: 24929426
FASTA; Data validation; High-throughput
2.  Current opportunities and challenges in microbial metagenome analysis—a bioinformatic perspective 
Briefings in Bioinformatics  2012;13(6):728-742.
Metagenomics has become an indispensable tool for studying the diversity and metabolic potential of environmental microbes, whose bulk is as yet non-cultivable. Continual progress in next-generation sequencing allows for generating increasingly large metagenomes and studying multiple metagenomes over time or space. Recently, a new type of holistic ecosystem study has emerged that seeks to combine metagenomics with biodiversity, meta-expression and contextual data. Such ‘ecosystems biology’ approaches bear the potential to not only advance our understanding of environmental microbes to a new level but also impose challenges due to increasing data complexities, in particular with respect to bioinformatic post-processing. This mini review aims to address selected opportunities and challenges of modern metagenomics from a bioinformatics perspective and hopefully will serve as a useful resource for microbial ecologists and bioinformaticians alike.
doi:10.1093/bib/bbs039
PMCID: PMC3504927  PMID: 22966151
16S rRNA biodiversity; binning; bioinformatics; Genomic Standards Consortium; metagenomics; next-generation sequencing
3.  The Genome of the Alga-Associated Marine Flavobacterium Formosa agariphila KMM 3901T Reveals a Broad Potential for Degradation of Algal Polysaccharides 
Applied and Environmental Microbiology  2013;79(21):6813-6822.
In recent years, representatives of the Bacteroidetes have been increasingly recognized as specialists for the degradation of macromolecules. Formosa constitutes a Bacteroidetes genus within the class Flavobacteria, and the members of this genus have been found in marine habitats with high levels of organic matter, such as in association with algae, invertebrates, and fecal pellets. Here we report on the generation and analysis of the genome of the type strain of Formosa agariphila (KMM 3901T), an isolate from the green alga Acrosiphonia sonderi. F. agariphila is a facultative anaerobe with the capacity for mixed acid fermentation and denitrification. Its genome harbors 129 proteases and 88 glycoside hydrolases, indicating a pronounced specialization for the degradation of proteins, polysaccharides, and glycoproteins. Sixty-five of the glycoside hydrolases are organized in at least 13 distinct polysaccharide utilization loci, where they are clustered with TonB-dependent receptors, SusD-like proteins, sensors/transcription factors, transporters, and often sulfatases. These loci play a pivotal role in bacteroidetal polysaccharide biodegradation and in the case of F. agariphila revealed the capacity to degrade a wide range of algal polysaccharides from green, red, and brown algae and thus a strong specialization of toward an alga-associated lifestyle. This was corroborated by growth experiments, which confirmed usage particularly of those monosaccharides that constitute the building blocks of abundant algal polysaccharides, as well as distinct algal polysaccharides, such as laminarins, xylans, and κ-carrageenans.
doi:10.1128/AEM.01937-13
PMCID: PMC3811500  PMID: 23995932
4.  The founding charter of the Genomic Observatories Network 
GigaScience  2014;3:2.
The co-authors of this paper hereby state their intention to work together to launch the Genomic Observatories Network (GOs Network) for which this document will serve as its Founding Charter. We define a Genomic Observatory as an ecosystem and/or site subject to long-term scientific research, including (but not limited to) the sustained study of genomic biodiversity from single-celled microbes to multicellular organisms.
An international group of 64 scientists first published the call for a global network of Genomic Observatories in January 2012. The vision for such a network was expanded in a subsequent paper and developed over a series of meetings in Bremen (Germany), Shenzhen (China), Moorea (French Polynesia), Oxford (UK), Pacific Grove (California, USA), Washington (DC, USA), and London (UK). While this community-building process continues, here we express our mutual intent to establish the GOs Network formally, and to describe our shared vision for its future. The views expressed here are ours alone as individual scientists, and do not necessarily represent those of the institutions with which we are affiliated.
doi:10.1186/2047-217X-3-2
PMCID: PMC3995929  PMID: 24606731
Biodiversity; Genomics; Biocode; Earth observations
5.  Extending Standards for Genomics and Metagenomics Data: A Research Coordination Network for the Genomic Standards Consortium (RCN4GSC) 
Standards in Genomic Sciences  2009;1(1):87-90.
Through a newly established Research Coordination Network for the Genomic Standards Consortium (RCN4GSC), the GSC will continue its leadership in establishing and integrating genomic standards through community-based efforts. These efforts, undertaken in the context of genomic and metagenomic research aim to ensure the electronic capture of all genomic data and to facilitate the achievement of a community consensus around collecting and managing relevant contextual information connected to the sequence data. The GSC operates as an open, inclusive organization, welcoming inspired biologists with a commitment to community service. Within the collaborative framework of the ongoing, international activities of the GSC, the RCN will expand the range of research domains engaged in these standardization efforts and sustain scientific networking to encourage active participation by the broader community. The RCN4GSC, funded for five years by the US National Science Foundation, will primarily support outcome-focused working meetings and the exchange of early-career scientists between GSC research groups in order to advance key standards contributions such as GCDML. Focusing on the timely delivery of the extant GSC core projects, the RCN will also extend the pioneering efforts of the GSC to engage researchers active in developing ecological, environmental and biodiversity data standards. As the initial goals of the GSC are increasingly achieved, promoting the comprehensive use of effective standards will be essential to ensure the effective use of sequence and associated data, to provide access for all biologists to all of the information, and to create interdisciplinary opportunities for discovery. The RCN will facilitate these implementation activities through participation in major scientific conferences and presentations on scientific advances enabled by community usage of genomic standards.
doi:10.4056/sigs.26218
PMCID: PMC3035207  PMID: 21304642
6.  Meeting Report from the Genomic Standards Consortium (GSC) Workshops 6 and 7 
Standards in Genomic Sciences  2009;1(1):68-71.
This report summarizes the proceedings of the 6th and 7th workshops of the Genomic Standards Consortium (GSC), held back-to-back in 2008. GSC 6 focused on furthering the activities of GSC working groups, GSC 7 focused on outreach to the wider community. GSC 6 was held October 10-14, 2008 at the European Bioinformatics Institute, Cambridge, United Kingdom and included a two-day workshop focused on the refinement of the Genomic Contextual Data Markup Language (GCDML). GSC 7 was held as the opening day of the International Congress on Metagenomics 2008 in San Diego California. Major achievements of these combined meetings included an agreement from the International Nucleotide Sequence Database Consortium (INSDC) to create a “MIGS” keyword for capturing ”Minimum Information about a Genome Sequence” compliant information within INSDC (DDBJ/EMBL /Genbank) records, launch of GCDML 1.0, MIGS compliance of the first set of “Genomic Encyclopedia of Bacteria and Archaea” project genomes, approval of a proposal to extend MIGS to 16S rRNA sequences within a “Minimum Information about an Environmental Sequence”, finalization of plans for the GSC eJournal, “Standards in Genomic Sciences” (SIGS), and the formation of a GSC Board. Subsequently, the GSC has been awarded a Research Co-ordination Network (RCN4GSC) grant from the National Science Foundation, held the first SIGS workshop and launched the journal. The GSC will also be hosting outreach workshops at both ISMB 2009 and PSB 2010 focused on “Metagenomics, Metadata and MetaAnalysis” (M3). Further information about the GSC and its range of activities can be found at http://gensc.org, including videos of all the presentations at GSC 7.
doi:10.4056/sigs.25165
PMCID: PMC3035212  PMID: 21304639
7.  The SILVA and “All-species Living Tree Project (LTP)” taxonomic frameworks 
Nucleic Acids Research  2013;42(D1):D643-D648.
SILVA (from Latin silva, forest, http://www.arb-silva.de) is a comprehensive resource for up-to-date quality-controlled databases of aligned ribosomal RNA (rRNA) gene sequences from the Bacteria, Archaea and Eukaryota domains and supplementary online services. SILVA provides a manually curated taxonomy for all three domains of life, based on representative phylogenetic trees for the small- and large-subunit rRNA genes. This article describes the improvements the SILVA taxonomy has undergone in the last 3 years. Specifically we are focusing on the curation process, the various resources used for curation and the comparison of the SILVA taxonomy with Greengenes and RDP-II taxonomies. Our comparisons not only revealed a reasonable overlap between the taxa names, but also points to significant differences in both names and numbers of taxa between the three resources.
doi:10.1093/nar/gkt1209
PMCID: PMC3965112  PMID: 24293649
8.  Identification of Proteins Likely To Be Involved in Morphogenesis, Cell Division, and Signal Transduction in Planctomycetes by Comparative Genomics 
Journal of Bacteriology  2012;194(23):6419-6430.
Members of the Planctomycetes clade share many unusual features for bacteria. Their cytoplasm contains membrane-bound compartments, they lack peptidoglycan and FtsZ, they divide by polar budding, and they are capable of endocytosis. Planctomycete genomes have remained enigmatic, generally being quite large (up to 9 Mb), and on average, 55% of their predicted proteins are of unknown function. Importantly, proteins related to the unusual traits of Planctomycetes remain largely unknown. Thus, we embarked on bioinformatic analyses of these genomes in an effort to predict proteins that are likely to be involved in compartmentalization, cell division, and signal transduction. We used three complementary strategies. First, we defined the Planctomycetes core genome and subtracted genes of well-studied model organisms. Second, we analyzed the gene content and synteny of morphogenesis and cell division genes and combined both methods using a “guilt-by-association” approach. Third, we identified signal transduction systems as well as sigma factors. These analyses provide a manageable list of candidate genes for future genetic studies and provide evidence for complex signaling in the Planctomycetes akin to that observed for bacteria with complex life-styles, such as Myxococcus xanthus.
doi:10.1128/JB.01325-12
PMCID: PMC3497475  PMID: 23002222
9.  Ecogenomic Perspectives on Domains of Unknown Function: Correlation-Based Exploration of Marine Metagenomes 
PLoS ONE  2013;8(3):e50869.
Background
The proportion of conserved DNA sequences with no clear function is steadily growing in bioinformatics databases. Studies of sequence and structural homology have indicated that many uncharacterized protein domain sequences are variants of functionally described domains. If these variants promote an organism's ecological fitness, they are likely to be conserved in the genome of its progeny and the population at large. The genetic composition of microbial communities in their native ecosystems is accessible through metagenomics. We hypothesize the co-variation of protein domain sequences across metagenomes from similar ecosystems will provide insights into their potential roles and aid further investigation.
Methodology/Principal findings
We calculated the correlation of Pfam protein domain sequences across the Global Ocean Sampling metagenome collection, employing conservative detection and correlation thresholds to limit results to well-supported hits and associations. We then examined intercorrelations between domains of unknown function (DUFs) and domains involved in known metabolic pathways using network visualization and cluster-detection tools. We used a cautious “guilty-by-association” approach, referencing knowledge-level resources to identify and discuss associations that offer insight into DUF function. We observed numerous DUFs associated to photobiologically active domains and prevalent in the Cyanobacteria. Other clusters included DUFs associated with DNA maintenance and repair, inorganic nutrient metabolism, and sodium-translocating transport domains. We also observed a number of clusters reflecting known metabolic associations and cases that predicted functional reclassification of DUFs.
Conclusion/Significance
Critically examining domain covariation across metagenomic datasets can grant new perspectives on the roles and associations of DUFs in an ecological setting. Targeted attempts at DUF characterization in the laboratory or in silico may draw from these insights and opportunities to discover new associations and corroborate existing ones will arise as more large-scale metagenomic datasets emerge.
doi:10.1371/journal.pone.0050869
PMCID: PMC3597751  PMID: 23516388
10.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools 
Nucleic Acids Research  2012;41(D1):D590-D596.
SILVA (from Latin silva, forest, http://www.arb-silva.de) is a comprehensive web resource for up to date, quality-controlled databases of aligned ribosomal RNA (rRNA) gene sequences from the Bacteria, Archaea and Eukaryota domains and supplementary online services. The referred database release 111 (July 2012) contains 3 194 778 small subunit and 288 717 large subunit rRNA gene sequences. Since the initial description of the project, substantial new features have been introduced, including advanced quality control procedures, an improved rRNA gene aligner, online tools for probe and primer evaluation and optimized browsing, searching and downloading on the website. Furthermore, the extensively curated SILVA taxonomy and the new non-redundant SILVA datasets provide an ideal reference for high-throughput classification of data from next-generation sequencing approaches.
doi:10.1093/nar/gks1219
PMCID: PMC3531112  PMID: 23193283
11.  Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies 
Nucleic Acids Research  2012;41(1):e1.
16S ribosomal RNA gene (rDNA) amplicon analysis remains the standard approach for the cultivation-independent investigation of microbial diversity. The accuracy of these analyses depends strongly on the choice of primers. The overall coverage and phylum spectrum of 175 primers and 512 primer pairs were evaluated in silico with respect to the SILVA 16S/18S rDNA non-redundant reference dataset (SSURef 108 NR). Based on this evaluation a selection of ‘best available’ primer pairs for Bacteria and Archaea for three amplicon size classes (100–400, 400–1000, ≥1000 bp) is provided. The most promising bacterial primer pair (S-D-Bact-0341-b-S-17/S-D-Bact-0785-a-A-21), with an amplicon size of 464 bp, was experimentally evaluated by comparing the taxonomic distribution of the 16S rDNA amplicons with 16S rDNA fragments from directly sequenced metagenomes. The results of this study may be used as a guideline for selecting primer pairs with the best overall coverage and phylum spectrum for specific applications, therefore reducing the bias in PCR-based microbial diversity studies.
doi:10.1093/nar/gks808
PMCID: PMC3592464  PMID: 22933715
12.  Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications 
Yilmaz, Pelin | Kottmann, Renzo | Field, Dawn | Knight, Rob | Cole, James R | Amaral-Zettler, Linda | Gilbert, Jack A | Karsch-Mizrachi, Ilene | Johnston, Anjanette | Cochrane, Guy | Vaughan, Robert | Hunter, Christopher | Park, Joonhong | Morrison, Norman | Rocca-Serra, Philippe | Sterk, Peter | Arumugam, Manimozhiyan | Bailey, Mark | Baumgartner, Laura | Birren, Bruce W | Blaser, Martin J | Bonazzi, Vivien | Booth, Tim | Bork, Peer | Bushman, Frederic D | Buttigieg, Pier Luigi | Chain, Patrick S G | Charlson, Emily | Costello, Elizabeth K | Huot-Creasy, Heather | Dawyndt, Peter | DeSantis, Todd | Fierer, Noah | Fuhrman, Jed A | Gallery, Rachel E | Gevers, Dirk | Gibbs, Richard A | Gil, Inigo San | Gonzalez, Antonio | Gordon, Jeffrey I | Guralnick, Robert | Hankeln, Wolfgang | Highlander, Sarah | Hugenholtz, Philip | Jansson, Janet | Kau, Andrew L | Kelley, Scott T | Kennedy, Jerry | Knights, Dan | Koren, Omry | Kuczynski, Justin | Kyrpides, Nikos | Larsen, Robert | Lauber, Christian L | Legg, Teresa | Ley, Ruth E | Lozupone, Catherine A | Ludwig, Wolfgang | Lyons, Donna | Maguire, Eamonn | Methé, Barbara A | Meyer, Folker | Muegge, Brian | Nakielny, Sara | Nelson, Karen E | Nemergut, Diana | Neufeld, Josh D | Newbold, Lindsay K | Oliver, Anna E | Pace, Norman R | Palanisamy, Giriprakash | Peplies, Jörg | Petrosino, Joseph | Proctor, Lita | Pruesse, Elmar | Quast, Christian | Raes, Jeroen | Ratnasingham, Sujeevan | Ravel, Jacques | Relman, David A | Assunta-Sansone, Susanna | Schloss, Patrick D | Schriml, Lynn | Sinha, Rohini | Smith, Michelle I | Sodergren, Erica | Spor, Aymé | Stombaugh, Jesse | Tiedje, James M | Ward, Doyle V | Weinstock, George M | Wendel, Doug | White, Owen | Whiteley, Andrew | Wilke, Andreas | Wortman, Jennifer R | Yatsunenko, Tanya | Glöckner, Frank Oliver
Nature Biotechnology  2011;29(5):415-420.
Here we present a standard developed by the Genomic Standards Consortium (GSC) for reporting marker gene sequences—the minimum information about a marker gene sequence (MIMARKS). We also introduce a system for describing the environment from which a biological sample originates. The ‘environmental packages’ apply to any genome sequence of known origin and can be used in combination with MIMARKS and other GSC checklists. Finally, to establish a unified standard for describing sequence data and to provide a single point of entry for the scientific community to access and learn about GSC checklists, we present the minimum information about any (x) sequence (MIxS). Adoption of MIxS will enhance our ability to analyze natural genetic diversity documented by massive DNA sequencing efforts from myriad ecosystems in our ever-changing biosphere.
doi:10.1038/nbt.1823
PMCID: PMC3367316  PMID: 21552244
13.  Microbial and Chemical Characterization of Underwater Fresh Water Springs in the Dead Sea 
PLoS ONE  2012;7(6):e38319.
Due to its extreme salinity and high Mg concentration the Dead Sea is characterized by a very low density of cells most of which are Archaea. We discovered several underwater fresh to brackish water springs in the Dead Sea harboring dense microbial communities. We provide the first characterization of these communities, discuss their possible origin, hydrochemical environment, energetic resources and the putative biogeochemical pathways they are mediating. Pyrosequencing of the 16S rRNA gene and community fingerprinting methods showed that the spring community originates from the Dead Sea sediments and not from the aquifer. Furthermore, it suggested that there is a dense Archaeal community in the shoreline pore water of the lake. Sequences of bacterial sulfate reducers, nitrifiers iron oxidizers and iron reducers were identified as well. Analysis of white and green biofilms suggested that sulfide oxidation through chemolitotrophy and phototrophy is highly significant. Hyperspectral analysis showed a tight association between abundant green sulfur bacteria and cyanobacteria in the green biofilms. Together, our findings show that the Dead Sea floor harbors diverse microbial communities, part of which is not known from other hypersaline environments. Analysis of the water’s chemistry shows evidence of microbial activity along the path and suggests that the springs supply nitrogen, phosphorus and organic matter to the microbial communities in the Dead Sea. The underwater springs are a newly recognized water source for the Dead Sea. Their input of microorganisms and nutrients needs to be considered in the assessment of possible impact of dilution events of the lake surface waters, such as those that will occur in the future due to the intended establishment of the Red Sea−Dead Sea water conduit.
doi:10.1371/journal.pone.0038319
PMCID: PMC3367964  PMID: 22679498
14.  SINA: Accurate high-throughput multiple sequence alignment of ribosomal RNA genes 
Bioinformatics  2012;28(14):1823-1829.
Motivation: In the analysis of homologous sequences, computation of multiple sequence alignments (MSAs) has become a bottleneck. This is especially troublesome for marker genes like the ribosomal RNA (rRNA) where already millions of sequences are publicly available and individual studies can easily produce hundreds of thousands of new sequences. Methods have been developed to cope with such numbers, but further improvements are needed to meet accuracy requirements.
Results: In this study, we present the SILVA Incremental Aligner (SINA) used to align the rRNA gene databases provided by the SILVA ribosomal RNA project. SINA uses a combination of k-mer searching and partial order alignment (POA) to maintain very high alignment accuracy while satisfying high throughput performance demands.
SINA was evaluated in comparison with the commonly used high throughput MSA programs PyNAST and mothur. The three BRAliBase III benchmark MSAs could be reproduced with 99.3, 97.6 and 96.1 accuracy. A larger benchmark MSA comprising 38 772 sequences could be reproduced with 98.9 and 99.3% accuracy using reference MSAs comprising 1000 and 5000 sequences. SINA was able to achieve higher accuracy than PyNAST and mothur in all performed benchmarks.
Availability: Alignment of up to 500 sequences using the latest SILVA SSU/LSU Ref datasets as reference MSA is offered at http://www.arb-silva.de/aligner. This page also links to Linux binaries, user manual and tutorial. SINA is made available under a personal use license.
Contact: epruesse@mpi-bremen.de
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts252
PMCID: PMC3389763  PMID: 22556368
15.  Practical application of self-organizing maps to interrelate biodiversity and functional data in NGS-based metagenomics 
The ISME journal  2010;5(5):918-928.
Next-generation sequencing (NGS) technologies have enabled the application of broad-scale sequencing in microbial biodiversity and metagenome studies. Biodiversity is usually targeted by classifying 16S ribosomal RNA genes, while metagenomic approaches target metabolic genes. However, both approaches remain isolated, as long as the taxonomic and functional information cannot be interrelated. Techniques like self-organizing maps (SOMs) have been applied to cluster metagenomes into taxon-specific bins in order to link biodiversity with functions, but have not been applied to broad-scale NGS-based metagenomics yet. Here, we provide a novel implementation, demonstrate its potential and practicability, and provide a web-based service for public usage. Evaluation with published data sets mimicking varyingly complex habitats resulted into classification specificities and sensitivities of close to 100% to above 90% from phylum to genus level for assemblies exceeding 8 kb for low and medium complexity data. When applied to five real-world metagenomes of medium complexity from direct pyrosequencing of marine subsurface waters, classifications of assemblies above 2.5 kb were in good agreement with fluorescence in situ hybridizations, indicating that biodiversity was mostly retained within the metagenomes, and confirming high classification specificities. This was validated by two protein-based classifications (PBCs) methods. SOMs were able to retrieve the relevant taxa down to the genus level, while surpassing PBCs in resolution. In order to make the approach accessible to a broad audience, we implemented a feature-rich web-based SOM application named TaxSOM, which is freely available at http://www.megx.net/toolbox/taxsom. TaxSOM can classify reads or assemblies exceeding 2.5 kb with high accuracy and thus assists in linking biodiversity and functions in metagenome studies, which is a precondition to study microbial ecology in a holistic fashion.
doi:10.1038/ismej.2010.180
PMCID: PMC3105762  PMID: 21160538
binning; metagenomics; molecular ecology; self-organizing map (SOM); taxonomic classification; TaxSOM
16.  Characterization of Planctomyces limnophilus and Development of Genetic Tools for Its Manipulation Establish It as a Model Species for the Phylum Planctomycetes ▿ †  
Applied and Environmental Microbiology  2011;77(16):5826-5829.
Planctomycetes represent a remarkable clade in the domain Bacteria because they play crucial roles in global carbon and nitrogen cycles and display cellular structures that closely parallel those of eukaryotic cells. Studies on Planctomycetes have been hampered by the lack of genetic tools, which we developed for Planctomyces limnophilus.
doi:10.1128/AEM.05132-11
PMCID: PMC3165242  PMID: 21724885
17.  Ecogenomics and genome landscapes of marine Pseudoalteromonas phage H105/1 
The ISME journal  2010;5(1):107-121.
Marine phages have an astounding global abundance and ecological impact. However, little knowledge is derived from phage genomes, as most of the open reading frames in their small genomes are unknown, novel proteins. To infer potential functional and ecological relevance of sequenced marine Pseudoalteromonas phage H105/1, two strategies were used. First, similarity searches were extended to include six viral and bacterial metagenomes paired with their respective environmental contextual data. This approach revealed ‘ecogenomic' patterns of Pseudoalteromonas phage H105/1, such as its estuarine origin. Second, intrinsic genome signatures (phylogenetic, codon adaptation and tetranucleotide (tetra) frequencies) were evaluated on a resolved intra-genomic level to shed light on the evolution of phage functional modules. On the basis of differential codon adaptation of Phage H105/1 proteins to the sequenced Pseudoalteromonas spp., regions of the phage genome with the most ‘host'-adapted proteins also have the strongest bacterial tetra signature, whereas the least ‘host'-adapted proteins have the strongest phage tetra signature. Such a pattern may reflect the evolutionary history of the respective phage proteins and functional modules. Finally, analysis of the structural proteome identified seven proteins that make up the mature virion, four of which were previously unknown. This integrated approach combines both novel and classical strategies and serves as a model to elucidate ecological inferences and evolutionary relationships from phage genomes that typically abound with unknown gene content.
doi:10.1038/ismej.2010.94
PMCID: PMC3105678  PMID: 20613791
ecogenomics; genome signatures; genomics; marine; phage; Pseudoalteromonas
19.  CDinFusion – Submission-Ready, On-Line Integration of Sequence and Contextual Data 
PLoS ONE  2011;6(9):e24797.
State of the art (DNA) sequencing methods applied in “Omics” studies grant insight into the ‘blueprints’ of organisms from all domains of life. Sequencing is carried out around the globe and the data is submitted to the public repositories of the International Nucleotide Sequence Database Collaboration. However, the context in which these studies are conducted often gets lost, because experimental data, as well as information about the environment are rarely submitted along with the sequence data. If these contextual or metadata are missing, key opportunities of comparison and analysis across studies and habitats are hampered or even impossible. To address this problem, the Genomic Standards Consortium (GSC) promotes checklists and standards to better describe our sequence data collection and to promote the capturing, exchange and integration of sequence data with contextual data. In a recent community effort the GSC has developed a series of recommendations for contextual data that should be submitted along with sequence data. To support the scientific community to significantly enhance the quality and quantity of contextual data in the public sequence data repositories, specialized software tools are needed. In this work we present CDinFusion, a web-based tool to integrate contextual and sequence data in (Multi)FASTA format prior to submission. The tool is open source and available under the Lesser GNU Public License 3. A public installation is hosted and maintained at the Max Planck Institute for Marine Microbiology at http://www.megx.net/cdinfusion. The tool may also be installed locally using the open source code available at http://code.google.com/p/cdinfusion.
doi:10.1371/journal.pone.0024797
PMCID: PMC3172294  PMID: 21935468
20.  Quantifying the effect of environment stability on the transcription factor repertoire of marine microbes 
Background
DNA-binding transcription factors (TFs) regulate cellular functions in prokaryotes, often in response to environmental stimuli. Thus, the environment exerts constant selective pressure on the TF gene content of microbial communities. Recently a study on marine Synechococcus strains detected differences in their genomic TF content related to environmental adaptation, but so far the effect of environmental parameters on the content of TFs in bacterial communities has not been systematically investigated.
Results
We quantified the effect of environment stability on the transcription factor repertoire of marine pelagic microbes from the Global Ocean Sampling (GOS) metagenome using interpolated physico-chemical parameters and multivariate statistics. Thirty-five percent of the difference in relative TF abundances between samples could be explained by environment stability. Six percent was attributable to spatial distance but none to a combination of both spatial distance and stability. Some individual TFs showed a stronger relationship to environment stability and space than the total TF pool.
Conclusions
Environmental stability appears to have a clearly detectable effect on TF gene content in bacterioplanktonic communities described by the GOS metagenome. Interpolated environmental parameters were shown to compare well to in situ measurements and were essential for quantifying the effect of the environment on the TF content. It is demonstrated that comprehensive and well-structured contextual data will strongly enhance our ability to interpret the functional potential of microbes from metagenomic data.
doi:10.1186/2042-5783-1-9
PMCID: PMC3372289  PMID: 22587903
transcription factors; ecological metagenomics; interpolated environmental data; multivariate statistics
21.  The Genomic Standards Consortium 
PLoS Biology  2011;9(6):e1001088.
A vast and rich body of information has grown up as a result of the world's enthusiasm for 'omics technologies. Finding ways to describe and make available this information that maximise its usefulness has become a major effort across the 'omics world. At the heart of this effort is the Genomic Standards Consortium (GSC), an open-membership organization that drives community-based standardization activities, Here we provide a short history of the GSC, provide an overview of its range of current activities, and make a call for the scientific community to join forces to improve the quality and quantity of contextual information about our public collections of genomes, metagenomes, and marker gene sequences.
doi:10.1371/journal.pbio.1001088
PMCID: PMC3119656  PMID: 21713030
22.  Enriching public descriptions of marine phages using the Genomic Standards Consortium MIGS standard 
Standards in Genomic Sciences  2011;4(2):271-285.
In any sequencing project, the possible depth of comparative analysis is determined largely by the amount and quality of the accompanying contextual data. The structure, content, and storage of this contextual data should be standardized to ensure consistent coverage of all sequenced entities and facilitate comparisons. The Genomic Standards Consortium (GSC) has developed the “Minimum Information about Genome/Metagenome Sequences (MIGS/MIMS)” checklist for the description of genomes and here we annotate all 30 publicly available marine bacteriophage sequences to the MIGS standard. These annotations build on existing International Nucleotide Sequence Database Collaboration (INSDC) records, and confirm, as expected that current submissions lack most MIGS fields. MIGS fields were manually curated from the literature and placed in XML format as specified by the Genomic Contextual Data Markup Language (GCDML). These “machine-readable” reports were then analyzed to highlight patterns describing this collection of genomes. Completed reports are provided in GCDML. This work represents one step towards the annotation of our complete collection of genome sequences and shows the utility of capturing richer metadata along with raw sequences.
doi:10.4056/sigs.621069
PMCID: PMC3111985  PMID: 21677864
marine phages; contextual data; genome standards; markup language
23.  Data shopping in an open marketplace: Introducing the Ontogrator web application for marking up data using ontologies and browsing using facets 
Standards in Genomic Sciences  2011;4(2):286-292.
In the future, we hope to see an open and thriving data market in which users can find and select data from a wide range of data providers. In such an open access market, data are products that must be packaged accordingly. Increasingly, eCommerce sellers present heterogeneous product lines to buyers using faceted browsing. Using this approach we have developed the Ontogrator platform, which allows for rapid retrieval of data in a way that would be familiar to any online shopper. Using Knowledge Organization Systems (KOS), especially ontologies, Ontogrator uses text mining to mark up data and faceted browsing to help users navigate, query and retrieve data. Ontogrator offers the potential to impact scientific research in two major ways: 1) by significantly improving the retrieval of relevant information; and 2) by significantly reducing the time required to compose standard database queries and assemble information for further research. Here we present a pilot implementation developed in collaboration with the Genomic Standards Consortium (GSC) that includes content from the StrainInfo, GOLD, CAMERA, Silva and Pubmed databases. This implementation demonstrates the power of ontogration and highlights that the usefulness of this approach is fully dependent on both the quality of data and the KOS (ontologies) used. Ideally, the use and further expansion of this collaborative system will help to surface issues associated with the underlying quality of annotation and could lead to a systematic means for accessing integrated data resources.
doi:10.4056/sigs.1344279
PMCID: PMC3111990  PMID: 21677865
24.  The Earth Microbiome Project: Meeting report of the “1st EMP meeting on sample selection and acquisition” at Argonne National Laboratory October 6th 2010. 
Standards in Genomic Sciences  2010;3(3):249-253.
This report details the outcome the first meeting of the Earth Microbiome Project to discuss sample selection and acquisition. The meeting, held at the Argonne National Laboratory on Wednesday October 6th 2010, focused on discussion of how to prioritize environmental samples for sequencing and metagenomic analysis as part of the global effort of the EMP to systematically determine the functional and phylogenetic diversity of microbial communities across the world.
doi:10.4056/aigs.1443528
PMCID: PMC3035312  PMID: 21304728
25.  Meeting Report from the Genomic Standards Consortium (GSC) Workshop 9 
Standards in Genomic Sciences  2010;3(3):216-224.
This report summarizes the proceedings of the 9th workshop of the Genomic Standards Consortium (GSC), held at the J. Craig Venter Institute, Rockville, MD, USA. It was the first GSC workshop to have open registration and attracted over 90 participants. This workshop featured sessions that provided overviews of the full range of ongoing GSC projects. It included sessions on Standards in Genomic Sciences, the open access journal of the GSC, building standards for genome annotation, the M5 platform for next-generation collaborative computational infrastructures, building ties with the biodiversity research community and two discussion panels with government and industry participants. Progress was made on all fronts, and major outcomes included the completion of the MIENS specification for publication and the formation of the Biodiversity working group.
doi:10.4056/sigs.1353455
PMCID: PMC3035308  PMID: 21304722

Results 1-25 (57)