1.  Toward interoperable bioscience data 
Nature genetics  2012;44(2):121-126.
To make full use of research data, the bioscience community needs to adopt technologies and reward mechanisms that support interoperability and promote the growth of an open ‘data commoning’ culture. Here we describe the prerequisites for data commoning and present an established and growing ecosystem of solutions using the shared ‘Investigation-Study-Assay’ framework to support that vision.
PMCID: PMC3428019  PMID: 22281772
2.  MeRy-B: a web knowledgebase for the storage, visualization, analysis and annotation of plant NMR metabolomic profiles 
BMC Plant Biology  2011;11:104.
Improvements in the techniques for metabolomics analyses and growing interest in metabolomic approaches are resulting in the generation of increasing numbers of metabolomic profiles. Platforms are required for profile management, as a function of experimental design, and for metabolite identification, to facilitate the mining of the corresponding data. Various databases have been created, including organism-specific knowledgebases and analytical technique-specific spectral databases. However, there is currently no platform meeting the requirements for both profile management and metabolite identification for nuclear magnetic resonance (NMR) experiments.
MeRy-B, the first platform for plant 1H-NMR metabolomic profiles, is designed (i) to provide a knowledgebase of curated plant profiles and metabolites obtained by NMR, together with the corresponding experimental and analytical metadata, (ii) for queries and visualization of the data, (iii) to discriminate between profiles with spectrum visualization tools and statistical analysis, (iv) to facilitate compound identification. It contains lists of plant metabolites and unknown compounds, with information about experimental conditions, the factors studied and metabolite concentrations for several plant species, compiled from more than one thousand annotated NMR profiles for various organs or tissues.
MeRy-B manages all the data generated by NMR-based plant metabolomics experiments, from description of the biological source to identification of the metabolites and determinations of their concentrations. It is the first database allowing the display and overlay of NMR metabolomic profiles selected through queries on data or metadata. MeRy-B is available from
PMCID: PMC3141636  PMID: 21668943
3.  Bioinformatic analysis of ESTs collected by Sanger and pyrosequencing methods for a keystone forest tree species: oak 
BMC Genomics  2010;11:650.
The Fagaceae family comprises about 1,000 woody species worldwide. About half belong to the Quercus family. These oaks are often a source of raw material for biomass wood and fiber. Pedunculate and sessile oaks, are among the most important deciduous forest tree species in Europe. Despite their ecological and economical importance, very few genomic resources have yet been generated for these species. Here, we describe the development of an EST catalogue that will support ecosystem genomics studies, where geneticists, ecophysiologists, molecular biologists and ecologists join their efforts for understanding, monitoring and predicting functional genetic diversity.
We generated 145,827 sequence reads from 20 cDNA libraries using the Sanger method. Unexploitable chromatograms and quality checking lead us to eliminate 19,941 sequences. Finally a total of 125,925 ESTs were retained from 111,361 cDNA clones. Pyrosequencing was also conducted for 14 libraries, generating 1,948,579 reads, from which 370,566 sequences (19.0%) were eliminated, resulting in 1,578,192 sequences. Following clustering and assembly using TGICL pipeline, 1,704,117 EST sequences collapsed into 69,154 tentative contigs and 153,517 singletons, providing 222,671 non-redundant sequences (including alternative transcripts). We also assembled the sequences using MIRA and PartiGene software and compared the three unigene sets. Gene ontology annotation was then assigned to 29,303 unigene elements. Blast search against the SWISS-PROT database revealed putative homologs for 32,810 (14.7%) unigene elements, but more extensive search with Pfam, Refseq_protein, Refseq_RNA and eight gene indices revealed homology for 67.4% of them. The EST catalogue was examined for putative homologs of candidate genes involved in bud phenology, cuticle formation, phenylpropanoids biosynthesis and cell wall formation. Our results suggest a good coverage of genes involved in these traits. Comparative orthologous sequences (COS) with other plant gene models were identified and allow to unravel the oak paleo-history. Simple sequence repeats (SSRs) and single nucleotide polymorphisms (SNPs) were searched, resulting in 52,834 SSRs and 36,411 SNPs. All of these are available through the Oak Contig Browser
This genomic resource provides a unique tool to discover genes of interest, study the oak transcriptome, and develop new markers to investigate functional diversity in natural populations.
PMCID: PMC3017864  PMID: 21092232
4.  A fast and cost-effective approach to develop and map EST-SSR markers: oak as a case study 
BMC Genomics  2010;11:570.
Expressed Sequence Tags (ESTs) are a source of simple sequence repeats (SSRs) that can be used to develop molecular markers for genetic studies. The availability of ESTs for Quercus robur and Quercus petraea provided a unique opportunity to develop microsatellite markers to accelerate research aimed at studying adaptation of these long-lived species to their environment. As a first step toward the construction of a SSR-based linkage map of oak for quantitative trait locus (QTL) mapping, we describe the mining and survey of EST-SSRs as well as a fast and cost-effective approach (bin mapping) to assign these markers to an approximate map position. We also compared the level of polymorphism between genomic and EST-derived SSRs and address the transferability of EST-SSRs in Castanea sativa (chestnut).
A catalogue of 103,000 Sanger ESTs was assembled into 28,024 unigenes from which 18.6% presented one or more SSR motifs. More than 42% of these SSRs corresponded to trinucleotides. Primer pairs were designed for 748 putative unigenes. Overall 37.7% (283) were found to amplify a single polymorphic locus in a reference full-sib pedigree of Quercus robur. The usefulness of these loci for establishing a genetic map was assessed using a bin mapping approach. Bin maps were constructed for the male and female parental tree for which framework linkage maps based on AFLP markers were available. The bin set consisting of 14 highly informative offspring selected based on the number and position of crossover sites. The female and male maps comprised 44 and 37 bins, with an average bin length of 16.5 cM and 20.99 cM, respectively. A total of 256 EST-SSRs were assigned to bins and their map position was further validated by linkage mapping. EST-SSRs were found to be less polymorphic than genomic SSRs, but their transferability rate to chestnut, a phylogenetically related species to oak, was higher.
We have generated a bin map for oak comprising 256 EST-SSRs. This resource constitutes a first step toward the establishment of a gene-based map for this genus that will facilitate the dissection of QTLs affecting complex traits of ecological importance.
PMCID: PMC3091719  PMID: 20950475
5.  Life on Arginine for Mycoplasma hominis: Clues from Its Minimal Genome and Comparison with Other Human Urogenital Mycoplasmas 
PLoS Genetics  2009;5(10):e1000677.
Mycoplasma hominis is an opportunistic human mycoplasma. Two other pathogenic human species, M. genitalium and Ureaplasma parvum, reside within the same natural niche as M. hominis: the urogenital tract. These three species have overlapping, but distinct, pathogenic roles. They have minimal genomes and, thus, reduced metabolic capabilities characterized by distinct energy-generating pathways. Analysis of the M. hominis PG21 genome sequence revealed that it is the second smallest genome among self-replicating free living organisms (665,445 bp, 537 coding sequences (CDSs)). Five clusters of genes were predicted to have undergone horizontal gene transfer (HGT) between M. hominis and the phylogenetically distant U. parvum species. We reconstructed M. hominis metabolic pathways from the predicted genes, with particular emphasis on energy-generating pathways. The Embden–Meyerhoff–Parnas pathway was incomplete, with a single enzyme absent. We identified the three proteins constituting the arginine dihydrolase pathway. This pathway was found essential to promote growth in vivo. The predicted presence of dimethylarginine dimethylaminohydrolase suggested that arginine catabolism is more complex than initially described. This enzyme may have been acquired by HGT from non-mollicute bacteria. Comparison of the three minimal mollicute genomes showed that 247 CDSs were common to all three genomes, whereas 220 CDSs were specific to M. hominis, 172 CDSs were specific to M. genitalium, and 280 CDSs were specific to U. parvum. Within these species-specific genes, two major sets of genes could be identified: one including genes involved in various energy-generating pathways, depending on the energy source used (glucose, urea, or arginine) and another involved in cytadherence and virulence. Therefore, a minimal mycoplasma cell, not including cytadherence and virulence-related genes, could be envisaged containing a core genome (247 genes), plus a set of genes required for providing energy. For M. hominis, this set would include 247+9 genes, resulting in a theoretical minimal genome of 256 genes.
Author Summary
Mycoplasma hominis, M. genitalium, and Ureaplasma parvum are human pathogenic bacteria that colonize the urogenital tract. They have minimal genomes, and thus have a minimal metabolic capacity. However, they have distinct energy-generating pathways and distinct pathogenic roles. We compared the genomes of these three human pathogen minimal species, providing further insight into the composition of hypothetical minimal gene sets needed for life. To this end, we sequenced the whole M. hominis genome and reconstructed its energy-generating pathways from gene predictions. Its unusual major energy-producing pathway through arginine hydrolysis was confirmed in both genome analyses and in vivo assays. Our findings suggest that M. hominis and U. parvum underwent genetic exchange, probably while sharing a common host. We proposed a set of genes likely to represent a minimal genome. For M. hominis, this minimal genome, not including cytadherence and virulence-related genes, can be defined comprising the 247 genes shared by the three minimal genital mollicutes, combined with a set of nine genes needed for energy production for cell metabolism. This study provides insight for the synthesis of artificial genomes.
PMCID: PMC2751442  PMID: 19816563
6.  Oenococcus oeni Genome Plasticity Is Associated with Fitness▿ †  
Oenococcus oeni strains are well-known for their considerable phenotypic variations in terms of tolerance to harsh wine conditions and malolactic activity. Genomic subtractive hybridization (SH) between two isolates with differing enological potentials was used to elucidate the genetic bases of this intraspecies diversity and identify novel genes involved in adaptation to wine. SH revealed 182 tester-specific fragments corresponding to 126 open reading frames (ORFs). A large proportion of the chromosome-related ORFs resembled genes involved in carbohydrate transport and metabolism, cell wall/membrane/envelope biogenesis, and replication, recombination, and repair. Six regions of genomic plasticity were identified, and their analysis suggested that both limited recombination and insertion/deletion events contributed to the vast genomic diversity observed in O. oeni. The association of selected sequences with adaptation to wine was further assessed by screening a large collection of strains using PCR. No sequences were found to be specific to highly performing (HP) strains alone. However, there was a statistically significant positive association between HP strains and the presence of eight gene sequences located on regions 2, 4, and 5. Gene expression patterns were significantly modified in HP strains, following exposure to one or more of the common stresses in wines. Regions 2 and 5 showed no traces of mobile elements and had normal GC content. In contrast, region 4 had the typical hallmarks of horizontal transfer, suggesting that the strategy of acquiring genes from other bacteria enhances the fitness of O. oeni strains.
PMCID: PMC2663225  PMID: 19218413
7.  Observing metabolic functions at the genome scale 
Genome Biology  2007;8(6):R123.
A modular approach is presented that allows the observation of the transcriptional activity of metabolic functions at the genome scale.
High-throughput techniques have multiplied the amount and the types of available biological data, and for the first time achieving a global comprehension of the physiology of biological cells has become an achievable goal. This aim requires the integration of large amounts of heterogeneous data at different scales. It is notably necessary to extend the traditional focus on genomic data towards a truly functional focus, where the activity of cells is described in terms of actual metabolic processes performing the functions necessary for cells to live.
In this work, we present a new approach for metabolic analysis that allows us to observe the transcriptional activity of metabolic functions at the genome scale. These functions are described in terms of elementary modes, which can be computed in a genome-scale model thanks to a modular approach. We exemplify this new perspective by presenting a detailed analysis of the transcriptional metabolic response of yeast cells to stress. The integration of elementary mode analysis with gene expression data allows us to identify a number of functionally induced or repressed metabolic processes in different stress conditions. The assembly of these elementary modes leads to the identification of specific metabolic backbones.
This study opens a new framework for the cell-scale analysis of metabolism, where transcriptional activity can be analyzed in terms of whole processes instead of individual genes. We furthermore show that the set of active elementary modes exhibits a highly uneven organization, where most of them conduct specialized tasks while a smaller proportion performs multi-task functions and dominates the general stress response.
PMCID: PMC2394767  PMID: 17594483
8.  Large-scale identification of human genes implicated in epidermal barrier function 
Genome Biology  2007;8(6):R107.
Identification of genes expressed in epidermal granular keratinocytes by ORESTES, including a number that are highly specific for these cells.
During epidermal differentiation, keratinocytes progressing through the suprabasal layers undergo complex and tightly regulated biochemical modifications leading to cornification and desquamation. The last living cells, the granular keratinocytes (GKs), produce almost all of the proteins and lipids required for the protective barrier function before their programmed cell death gives rise to corneocytes. We present here the first analysis of the transcriptome of human GKs, purified from healthy epidermis by an original approach.
Using the ORESTES method, 22,585 expressed sequence tags (ESTs) were produced that matched 3,387 genes. Despite normalization provided by this method (mean 4.6 ORESTES per gene), some highly transcribed genes, including that encoding dermokine, were overrepresented. About 330 expressed genes displayed less than 100 ESTs in UniGene clusters and are most likely to be specific for GKs and potentially involved in barrier function. This hypothesis was tested by comparing the relative expression of 73 genes in the basal and granular layers of epidermis by quantitative RT-PCR. Among these, 33 were identified as new, highly specific markers of GKs, including those encoding a protease, protease inhibitors and proteins involved in lipid metabolism and transport. We identified filaggrin 2 (also called ifapsoriasin), a poorly characterized member of the epidermal differentiation complex, as well as three new lipase genes clustered with paralogous genes on chromosome 10q23.31. A new gene of unknown function, C1orf81, is specifically disrupted in the human genome by a frameshift mutation.
These data increase the present knowledge of genes responsible for the formation of the skin barrier and suggest new candidates for genodermatoses of unknown origin.
PMCID: PMC2394760  PMID: 17562024
9.  Being Pathogenic, Plastic, and Sexual while Living with a Nearly Minimal Bacterial Genome 
PLoS Genetics  2007;3(5):e75.
Mycoplasmas are commonly described as the simplest self-replicating organisms, whose evolution was mainly characterized by genome downsizing with a proposed evolutionary scenario similar to that of obligate intracellular bacteria such as insect endosymbionts. Thus far, analysis of mycoplasma genomes indicates a low level of horizontal gene transfer (HGT) implying that DNA acquisition is strongly limited in these minimal bacteria. In this study, the genome of the ruminant pathogen Mycoplasma agalactiae was sequenced. Comparative genomic data and phylogenetic tree reconstruction revealed that ∼18% of its small genome (877,438 bp) has undergone HGT with the phylogenetically distinct mycoides cluster, which is composed of significant ruminant pathogens. HGT involves genes often found as clusters, several of which encode lipoproteins that usually play an important role in mycoplasma–host interaction. A decayed form of a conjugative element also described in a member of the mycoides cluster was found in the M. agalactiae genome, suggesting that HGT may have occurred by mobilizing a related genetic element. The possibility of HGT events among other mycoplasmas was evaluated with the available sequenced genomes. Our data indicate marginal levels of HGT among Mycoplasma species except for those described above and, to a lesser extent, for those observed in between the two bird pathogens, M. gallisepticum and M. synoviae. This first description of large-scale HGT among mycoplasmas sharing the same ecological niche challenges the generally accepted evolutionary scenario in which gene loss is the main driving force of mycoplasma evolution. The latter clearly differs from that of other bacteria with small genomes, particularly obligate intracellular bacteria that are isolated within host cells. Consequently, mycoplasmas are not only able to subvert complex hosts but presumably have retained sexual competence, a trait that may prevent them from genome stasis and contribute to adaptation to new hosts.
Author Summary
Mycoplasmas are cell wall–lacking prokaryotes that evolved from ancestors common to Gram-positive bacteria by way of massive losses of genetic material. With their minimal genome, mycoplasmas are considered to be the simplest free-living organisms, yet several species are successful pathogens of man and animal. In this study, we challenged the commonly accepted view in which mycoplasma evolution is driven only by genome down-sizing. Indeed, we showed that a significant amount of genes underwent horizontal transfer among different mycoplasma species that share the same ruminant hosts. In these species, the occurrence of a genetic element that can promote DNA transfer via cell-to-cell contact suggests that some mycoplasmas may have retained or acquired sexual competence. Transferred genes were found to encode proteins that are likely to be associated with mycoplasma–host interactions. Sharing genetic resources via horizontal gene transfer may provide mycoplasmas with a means for adapting to new niches or to new hosts and for avoiding irreversible genome erosion.
PMCID: PMC1868952  PMID: 17511520
10.  New strategy for the representation and the integration of biomolecular knowledge at a cellular scale 
Nucleic Acids Research  2004;32(12):3581-3589.
The combination of sequencing and post-sequencing experimental approaches produces huge collections of data that are highly heterogeneous both in structure and in semantics. We propose a new strategy for the integration of such data. This strategy uses structured sets of sequences as a unified representation of biological information and defines a probabilistic measure of similarity between the sets. Sets can be composed of sequences that are known to have a biological relationship (e.g. proteins involved in a complex or a pathway) or that share similar values for a particular attribute (e.g. expression profile). We have developed a software, BlastSets, which implements this strategy. It exploits a database where the sets derived from diverse biological information can be deposited using a standard XML format. For a given query set, BlastSets returns target sets found in the database whose similarity to the query is statistically significant. The tool allowed us to automatically identify verified relationships between correlated expression profiles and biological pathways using publicly available data for Saccharomyces cerevisiae. It was also used to retrieve the members of a complex (ribosome) based on the mining of expression profiles. These first results validate the relevance of the strategy and demonstrate the promising potential of BlastSets.
PMCID: PMC484170  PMID: 15240831
11.  MolliGen, a database dedicated to the comparative genomics of Mollicutes 
Nucleic Acids Research  2004;32(Database issue):D307-D310.
Bacteria belonging to the class Mollicutes were among the first ones to be selected for complete genome sequencing because of the minimal size of their genomes and their pathogenicity for humans and a broad range of animals and plants. At this time six genome sequences have been publicly released (Mycoplasma genitalium, Mycoplasma pneumoniae, Ureaplasma urealyticum-parvum, Mycoplasma pulmonis, Mycoplasma penetrans and Mycoplasma gallisepticum) and as the number of available mollicute genomes increases, comparative genomics analysis within this model group of organisms becomes more and more instructive. However, such an analysis is difficult to carry out without a suitable platform gathering not only the original annotations but also relevant information available in public databases or obtained by applying common bioinformatics methods. With the aim of solving these difficulties, we have developed a web-accessible database named MolliGen ( After selecting a set of genomes the user can launch various types of search based on annotation, position on the chromosomes or sequence similarity. In addition, relationships of putative orthology have been precomputed to allow differential genome queries. The results are presented in table format with multiple links to public databases and to bioinformatic analyses such as multiple alignments or BLAST search. Specific tools were also developed for the graphical visualization of the results, including a multi- genome browser for displaying dynamic pictures with clickable objects and for viewing relationships of precomputed similarity. MolliGen is designed to integrate all the complete genomes of mollicutes as they become available.
PMCID: PMC308848  PMID: 14681420
13.  RIBDB: An SRS Based Infrastructure for REALIS 
The REALIS project is an EU-funded consortium for the post genomic analysis of the food pathogen Listeria monocytogenes. The data generated by the consortium members is stored under the RIBDB database, a system built using SRS which integrates consortium data, public databases, and applications for analysis. RIBDB is available to all consortium members through a web server, with the option of installing a local mirror of the main server for local analysis.
PMCID: PMC2447238  PMID: 18628878

