1.  Sequence based analysis of U-2973, a cell line established from a double-hit B-cell lymphoma with concurrent MYC and BCL2 rearrangements 
BMC Research Notes  2012;5:648.
Double-hit lymphoma is a complex and highly aggressive sub-type of B-cell lymphoma, which has recently been classified and is an area of active research interest due to the poor prognosis for patients with this disease. It is characterized by the presence of both an activating MYC chromosomal translocation and a simultaneous additional oncogenic translocation, often of the BCL2 gene. Recently, a cell line was established from a patient with this complex lymphoma and analyzed using conventional tools revealing it contains both MYC and BCL2 translocation events.
In this work, we reanalyzed the genome of the cell line using next generation whole genome sequencing technology in order to catalogue translocations, insertions and deletions which may contribute to the pathology of this lymphoma type.
We describe the cell line in much greater detail, and pinpoint the exact locations of the chromosomal breakpoints. We also find several rearrangements within cancer-associated genes, which were not found using conventional tools, suggesting that high throughput sequencing may reveal novel targets for therapy, which could be used concurrently with existing treatments.
PMCID: PMC3534606  PMID: 23171647
Double hit lymphoma; Sequencing; Chromosomal rearrangements
2.  Medusa: A tool for exploring and clustering biological networks 
BMC Research Notes  2011;4:384.
Biological processes such as metabolic pathways, gene regulation or protein-protein interactions are often represented as graphs in systems biology. The understanding of such networks, their analysis, and their visualization are today important challenges in life sciences. While a great variety of visualization tools that try to address most of these challenges already exists, only few of them succeed to bridge the gap between visualization and network analysis.
Medusa is a powerful tool for visualization and clustering analysis of large-scale biological networks. It is highly interactive and it supports weighted and unweighted multi-edged directed and undirected graphs. It combines a variety of layouts and clustering methods for comprehensive views and advanced data analysis. Its main purpose is to integrate visualization and analysis of heterogeneous data from different sources into a single network.
Medusa provides a concise visual tool, which is helpful for network analysis and interpretation. Medusa is offered both as a standalone application and as an applet written in Java. It can be found at:
PMCID: PMC3197509  PMID: 21978489
graph; visualization; biological networks; clustering analysis; data integration
3.  Gene rearrangements in hormone receptor negative breast cancers revealed by mate pair sequencing 
BMC Genomics  2013;14:165.
Chromosomal rearrangements in the form of deletions, insertions, inversions and translocations are frequently observed in breast cancer genomes, and a subset of these rearrangements may play a crucial role in tumorigenesis. To identify novel somatic chromosomal rearrangements, we determined the genome structures of 15 hormone-receptor negative breast tumors by long-insert mate pair massively parallel sequencing.
We identified and validated 40 somatic structural alterations, including the recurring fusion between genes DDX10 and SKA3 and translocations involving the EPHA5 gene. Other rearrangements were found to affect genes in pathways involved in epigenetic regulation, mitosis and signal transduction, underscoring their potential role in breast tumorigenesis. RNA interference-mediated suppression of five candidate genes (DDX10, SKA3, EPHA5, CLTC and TNIK) led to inhibition of breast cancer cell growth. Moreover, downregulation of DDX10 in breast cancer cells lead to an increased frequency of apoptotic nuclear morphology.
Using whole genome mate pair sequencing and RNA interference assays, we have discovered a number of novel gene rearrangements in breast cancer genomes and identified DDX10, SKA3, EPHA5, CLTC and TNIK as potential cancer genes with impact on the growth and proliferation of breast cancer cells.
PMCID: PMC3600027  PMID: 23496902
4.  Genome-wide sequencing for the identification of rearrangements associated with Tourette syndrome and obsessive-compulsive disorder 
BMC Medical Genetics  2012;13:123.
Tourette Syndrome (TS) is a neuropsychiatric disorder in children characterized by motor and verbal tics. Although several genes have been suggested in the etiology of TS, the genetic mechanisms remain poorly understood.
Using cytogenetics and FISH analysis, we identified an apparently balanced t(6,22)(q16.2;p13) in a male patient with TS and obsessive-compulsive disorder (OCD). In order to map the breakpoints and to identify additional submicroscopic rearrangements, we performed whole genome mate-pair sequencing and CGH-array analysis on DNA from the proband.
Sequence and CGH array analysis revealed a 400 kb deletion located 1.3 Mb telomeric of the chromosome 6q breakpoint, which has not been reported in controls. The deletion affects three genes (GPR63, NDUFA4 and KLHL32) and overlaps a region previously found deleted in a girl with autistic features and speech delay. The proband’s mother, also a carrier of the translocation, was diagnosed with OCD and shares the deletion. We also describe a further potentially related rearrangement which, while unmapped in Homo sapiens, was consistent with the chimpanzee genome.
We conclude that genome-wide sequencing at relatively low resolution can be used for the identification of submicroscopic rearrangements. We also show that large rearrangements may escape detection using standard analysis of whole genome sequencing data. Our findings further provide a candidate region for TS and OCD on chromosome 6q16.
PMCID: PMC3556158  PMID: 23253088
Tourette syndrome; Paired end sequencing; Chromosomal translocation; Structural variations
5.  Novel Insights into the Diversity of Catabolic Metabolism from Ten Haloarchaeal Genomes 
PLoS ONE  2011;6(5):e20237.
The extremely halophilic archaea are present worldwide in saline environments and have important biotechnological applications. Ten complete genomes of haloarchaea are now available, providing an opportunity for comparative analysis.
Methodology/Principal Findings
We report here the comparative analysis of five newly sequenced haloarchaeal genomes with five previously published ones. Whole genome trees based on protein sequences provide strong support for deep relationships between the ten organisms. Using a soft clustering approach, we identified 887 protein clusters present in all halophiles. Of these core clusters, 112 are not found in any other archaea and therefore constitute the haloarchaeal signature. Four of the halophiles were isolated from water, and four were isolated from soil or sediment. Although there are few habitat-specific clusters, the soil/sediment halophiles tend to have greater capacity for polysaccharide degradation, siderophore synthesis, and cell wall modification. Halorhabdus utahensis and Haloterrigena turkmenica encode over forty glycosyl hydrolases each, and may be capable of breaking down naturally occurring complex carbohydrates. H. utahensis is specialized for growth on carbohydrates and has few amino acid degradation pathways. It uses the non-oxidative pentose phosphate pathway instead of the oxidative pathway, giving it more flexibility in the metabolism of pentoses.
These new genomes expand our understanding of haloarchaeal catabolic pathways, providing a basis for further experimental analysis, especially with regard to carbohydrate metabolism. Halophilic glycosyl hydrolases for use in biofuel production are more likely to be found in halophiles isolated from soil or sediment.
PMCID: PMC3102087  PMID: 21633497
6.  A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea 
Nature  2009;462(7276):1056-1060.
Sequencing of bacterial and archaeal genomes has revolutionized our understanding of the many roles played by microorganisms1. There are now nearly 1,000 completed bacterial and archaeal genomes available2, most of which were chosen for sequencing on the basis of their physiology. As a result, the perspective provided by the currently available genomes is limited by a highly biased phylogenetic distribution3–5. To explore the value added by choosing microbial genomes for sequencing on the basis of their evolutionary relationships, we have sequenced and analysed the genomes of 56 culturable species of Bacteria and Archaea selected to maximize phylogenetic coverage. Analysis of these genomes demonstrated pronounced benefits (compared to an equivalent set of genomes randomly selected from the existing database) in diverse areas including the reconstruction of phylogenetic history, the discovery of new protein families and biological properties, and the prediction of functions for known genes from other organisms. Our results strongly support the need for systematic ‘phylogenomic’ efforts to compile a phylogeny-driven ‘Genomic Encyclopedia of Bacteria and Archaea’ in order to derive maximum knowledge from existing microbial genome data as well as from genome sequences to come.
PMCID: PMC3073058  PMID: 20033048
7.  The Complete Multipartite Genome Sequence of Cupriavidus necator JMP134, a Versatile Pollutant Degrader 
PLoS ONE  2010;5(3):e9729.
Cupriavidus necator JMP134 is a Gram-negative β-proteobacterium able to grow on a variety of aromatic and chloroaromatic compounds as its sole carbon and energy source.
Methodology/Principal Findings
Its genome consists of four replicons (two chromosomes and two plasmids) containing a total of 6631 protein coding genes. Comparative analysis identified 1910 core genes common to the four genomes compared (C. necator JMP134, C. necator H16, C. metallidurans CH34, R. solanacearum GMI1000). Although secondary chromosomes found in the Cupriavidus, Ralstonia, and Burkholderia lineages are all derived from plasmids, analyses of the plasmid partition proteins located on those chromosomes indicate that different plasmids gave rise to the secondary chromosomes in each lineage. The C. necator JMP134 genome contains 300 genes putatively involved in the catabolism of aromatic compounds and encodes most of the central ring-cleavage pathways. This strain also shows additional metabolic capabilities towards alicyclic compounds and the potential for catabolism of almost all proteinogenic amino acids. This remarkable catabolic potential seems to be sustained by a high degree of genetic redundancy, most probably enabling this catabolically versatile bacterium with different levels of metabolic responses and alternative regulation necessary to cope with a challenging environment. From the comparison of Cupriavidus genomes, it is possible to state that a broad metabolic capability is a general trait for Cupriavidus genus, however certain specialization towards a nutritional niche (xenobiotics degradation, chemolithoautotrophy or symbiotic nitrogen fixation) seems to be shaped mostly by the acquisition of “specialized” plasmids.
The availability of the complete genome sequence for C. necator JMP134 provides the groundwork for further elucidation of the mechanisms and regulation of chloroaromatic compound biodegradation.
PMCID: PMC2842291  PMID: 20339589
8.  Estimating DNA coverage and abundance in metagenomes using a gamma approximation 
Bioinformatics  2009;26(3):295-301.
Motivation: Shotgun sequencing generates large numbers of short DNA reads from either an isolated organism or, in the case of metagenomics projects, from the aggregate genome of a microbial community. These reads are then assembled based on overlapping sequences into larger, contiguous sequences (contigs). The feasibility of assembly and the coverage achieved (reads per nucleotide or distinct sequence of nucleotides) depend on several factors: the number of reads sequenced, the read length and the relative abundances of their source genomes in the microbial community. A low coverage suggests that most of the genomic DNA in the sample has not been sequenced, but it is often difficult to estimate either the extent of the uncaptured diversity or the amount of additional sequencing that would be most efficacious. In this work, we regard a metagenome as a population of DNA fragments (bins), each of which may be covered by one or more reads. We employ a gamma distribution to model this bin population due to its flexibility and ease of use. When a gamma approximation can be found that adequately fits the data, we may estimate the number of bins that were not sequenced and that could potentially be revealed by additional sequencing. We evaluated the performance of this model using simulated metagenomes and demonstrate its applicability on three recent metagenomic datasets.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2815663  PMID: 20008478
9.  Gene Context Analysis in the Integrated Microbial Genomes (IMG) Data Management System 
PLoS ONE  2009;4(11):e7979.
Computational methods for determining the function of genes in newly sequenced genomes have been traditionally based on sequence similarity to genes whose function has been identified experimentally. Function prediction methods can be extended using gene context analysis approaches such as examining the conservation of chromosomal gene clusters, gene fusion events and co-occurrence profiles across genomes. Context analysis is based on the observation that functionally related genes are often having similar gene context and relies on the identification of such events across phylogenetically diverse collection of genomes. We have used the data management system of the Integrated Microbial Genomes (IMG) as the framework to implement and explore the power of gene context analysis methods because it provides one of the largest available genome integrations. Visualization and search tools to facilitate gene context analysis have been developed and applied across all publicly available archaeal and bacterial genomes in IMG. These computations are now maintained as part of IMG's regular genome content update cycle. IMG is available at:
PMCID: PMC2776528  PMID: 19956731
10.  Genomic Characterization of Methanomicrobiales Reveals Three Classes of Methanogens 
PLoS ONE  2009;4(6):e5797.
Methanomicrobiales is the least studied order of methanogens. While these organisms appear to be more closely related to the Methanosarcinales in ribosomal-based phylogenetic analyses, they are metabolically more similar to Class I methanogens.
Methodology/Principal Findings
In order to improve our understanding of this lineage, we have completely sequenced the genomes of two members of this order, Methanocorpusculum labreanum Z and Methanoculleus marisnigri JR1, and compared them with the genome of a third, Methanospirillum hungatei JF-1. Similar to Class I methanogens, Methanomicrobiales use a partial reductive citric acid cycle for 2-oxoglutarate biosynthesis, and they have the Eha energy-converting hydrogenase. In common with Methanosarcinales, Methanomicrobiales possess the Ech hydrogenase and at least some of them may couple formylmethanofuran formation and heterodisulfide reduction to transmembrane ion gradients. Uniquely, M. labreanum and M. hungatei contain hydrogenases similar to the Pyrococcus furiosus Mbh hydrogenase, and all three Methanomicrobiales have anti-sigma factor and anti-anti-sigma factor regulatory proteins not found in other methanogens. Phylogenetic analysis based on seven core proteins of methanogenesis and cofactor biosynthesis places the Methanomicrobiales equidistant from Class I methanogens and Methanosarcinales.
Our results indicate that Methanomicrobiales, rather than being similar to Class I methanogens or Methanomicrobiales, share some features of both and have some unique properties. We find that there are three distinct classes of methanogens: the Class I methanogens, the Methanomicrobiales (Class II), and the Methanosarcinales (Class III).
PMCID: PMC2686161  PMID: 19495416
11.  jClust: a clustering and visualization toolbox 
Bioinformatics  2009;25(15):1994-1996.
jClust is a user-friendly application which provides access to a set of widely used clustering and clique finding algorithms. The toolbox allows a range of filtering procedures to be applied and is combined with an advanced implementation of the Medusa interactive visualization module. These implemented algorithms are k-Means, Affinity propagation, Bron–Kerbosch, MULIC, Restricted neighborhood search cluster algorithm, Markov clustering and Spectral clustering, while the supported filtering procedures are haircut, outside–inside, best neighbors and density control operations. The combination of a simple input file format, a set of clustering and filtering algorithms linked together with the visualization tool provides a powerful tool for data analysis and information extraction.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2712340  PMID: 19454618
12.  Microbial co-habitation and lateral gene transfer: what transposases can tell us 
Genome Biology  2009;10(4):R45.
Interactions between microbial communities are revealed using a network of lateral gene transfer events.
Determining the habitat range for various microbes is not a simple, straightforward matter, as habitats interlace, microbes move between habitats, and microbial communities change over time. In this study, we explore an approach using the history of lateral gene transfer recorded in microbial genomes to begin to answer two key questions: where have you been and who have you been with?
All currently sequenced microbial genomes were surveyed to identify pairs of taxa that share a transposase that is likely to have been acquired through lateral gene transfer. A microbial interaction network including almost 800 organisms was then derived from these connections. Although the majority of the connections are between closely related organisms with the same or overlapping habitat assignments, numerous examples were found of cross-habitat and cross-phylum connections.
We present a large-scale study of the distributions of transposases across phylogeny and habitat, and find a significant correlation between habitat and transposase connections. We observed cases where phylogenetic boundaries are traversed, especially when organisms share habitats; this suggests that the potential exists for genetic material to move laterally between diverse groups via bridging connections. The results presented here also suggest that the complex dynamics of microbial ecology may be traceable in the microbial genomes.
PMCID: PMC2688936  PMID: 19393086
13.  OnTheFly: a tool for automated document-based text annotation, data linking and network generation 
Bioinformatics  2009;25(7):977-978.
OnTheFly is a web-based application that applies biological named entity recognition to enrich Microsoft Office, PDF and plain text documents. The input files are converted into the HTML format and then sent to the Reflect tagging server, which highlights biological entity names like genes, proteins and chemicals, and attaches to them JavaScript code to invoke a summary pop-up window. The window provides an overview of relevant information about the entity, such as a protein description, the domain composition, a link to the 3D structure and links to other relevant online resources. OnTheFly is also able to extract the bioentities mentioned in a set of files and to produce a graphical representation of the networks of the known and predicted associations of these entities by retrieving the information from the STITCH database.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2660876  PMID: 19223449
14.  Integration of phenotypic metadata and protein similarity in Archaea using a spectral bipartitioning approach 
Nucleic Acids Research  2009;37(7):2096-2104.
In order to simplify and meaningfully categorize large sets of protein sequence data, it is commonplace to cluster proteins based on the similarity of those sequences. However, it quickly becomes clear that the sequence flexibility allowed a given protein varies significantly among different protein families. The degree to which sequences are conserved not only differs for each protein family, but also is affected by the phylogenetic divergence of the source organisms. Clustering techniques that use similarity thresholds for protein families do not always allow for these variations and thus cannot be confidently used for applications such as automated annotation and phylogenetic profiling. In this work, we applied a spectral bipartitioning technique to all proteins from 53 archaeal genomes. Comparisons between different taxonomic levels allowed us to study the effects of phylogenetic distances on cluster structure. Likewise, by associating functional annotations and phenotypic metadata with each protein, we could compare our protein similarity clusters with both protein function and associated phenotype. Our clusters can be analyzed graphically and interactively online.
PMCID: PMC2673424  PMID: 19223325
15.  Genome Analysis of the Anaerobic Thermohalophilic Bacterium Halothermothrix orenii 
PLoS ONE  2009;4(1):e4192.
Halothermothirx orenii is a strictly anaerobic thermohalophilic bacterium isolated from sediment of a Tunisian salt lake. It belongs to the order Halanaerobiales in the phylum Firmicutes. The complete sequence revealed that the genome consists of one circular chromosome of 2578146 bps encoding 2451 predicted genes. This is the first genome sequence of an organism belonging to the Haloanaerobiales. Features of both Gram positive and Gram negative bacteria were identified with the presence of both a sporulating mechanism typical of Firmicutes and a characteristic Gram negative lipopolysaccharide being the most prominent. Protein sequence analyses and metabolic reconstruction reveal a unique combination of strategies for thermophilic and halophilic adaptation. H. orenii can serve as a model organism for the study of the evolution of the Gram negative phenotype as well as the adaptation under thermohalophilic conditions and the development of biotechnological applications under conditions that require high temperatures and high salt concentrations.
PMCID: PMC2626281  PMID: 19145256
16.  A Molecular Study of Microbe Transfer between Distant Environments 
PLoS ONE  2008;3(7):e2607.
Environments and their organic content are generally not static and isolated, but in a constant state of exchange and interaction with each other. Through physical or biological processes, organisms, especially microbes, may be transferred between environments whose characteristics may be quite different. The transferred microbes may not survive in their new environment, but their DNA will be deposited. In this study, we compare two environmental sequencing projects to find molecular evidence of transfer of microbes over vast geographical distances.
By studying synonymous nucleotide composition, oligomer frequency and orthology between predicted genes in metagenomics data from two environments, terrestrial and aquatic, and by correlating with phylogenetic mappings, we find that both environments are likely to contain trace amounts of microbes which have been far removed from their original habitat. We also suggest a bias in direction from soil to sea, which is consistent with the cycles of planetary wind and water.
Our findings support the Baas-Becking hypothesis formulated in 1934, which states that due to dispersion and population sizes, microbes are likely to be found in widely disparate environments. Furthermore, the availability of genetic material from distant environments is a possible font of novel gene functions for lateral gene transfer.
PMCID: PMC2442867  PMID: 18612393
17.  Identification of tightly regulated groups of genes during Drosophila melanogaster embryogenesis 
Time-series analysis of whole-genome expression data during Drosophila melanogaster development indicates that up to 86% of its genes change their relative transcript level during embryogenesis. By applying conservative filtering criteria and requiring ‘sharp' transcript changes, we identified 1534 maternal genes, 792 transient zygotic genes, and 1053 genes whose transcript levels increase during embryogenesis. Each of these three categories is dominated by groups of genes where all transcript levels increase and/or decrease at similar times, suggesting a common mode of regulation. For example, 34% of the transiently expressed genes fall into three groups, with increased transcript levels between 2.5–12, 11–20, and 15–20 h of development, respectively. We highlight common and distinctive functional features of these expression groups and identify a coupling between downregulation of transcript levels and targeted protein degradation. By mapping the groups to the protein network, we also predict and experimentally confirm new functional associations.
PMCID: PMC1800352  PMID: 17224916
Drosophila embryogenesis; Notch pathway; supervised clustering; transient expression
18.  Duplication is more common among laterally transferred genes than among indigenous genes 
Genome Biology  2003;4(8):R48.
Using both a compositional method and a gene-tree approach, a number of proposed laterally transferred genes have been identified and their nucleotide composition and frequency of duplication studied.
Recent developments in the understanding of paralogous evolution have prompted a focus not only on obviously advantageous genes, but also on genes that can be considered to have a weak or sporadic impact on the survival of the organism. Here we examine the duplicative behavior of a category of genes that can be considered to be mostly transient in the genome, namely laterally transferred genes. Using both a compositional method and a gene-tree approach, we identify a number of proposed laterally transferred genes and study their nucleotide composition and frequency of duplication.
It is found that duplications are significantly overrepresented among potential laterally transferred genes compared to the indigenous ones. Furthermore, the GC3 distribution of potential laterally transferred genes was found to be largely uniform in some genomes, suggesting an import from a broad range of donors.
The results are discussed not in a context of strongly optimized established genes, but rather of genes with weak or ancillary functions. The importance of duplication may therefore depend on the variability and availability of weak genes for which novel functions may be discovered. Therefore, lateral transfer may accelerate the evolutionary process of duplication by bringing foreign genes that have mainly weak or no function into the genome.
PMCID: PMC193641  PMID: 12914657
19.  Gradients in nucleotide and codon usage along Escherichia coli genes 
Nucleic Acids Research  2000;28(18):3517-3523.
The usage of codons and nucleotide combinations varies along genes and systematic variation causes gradients in usage. We have studied such gradients of nucleotides and nucleotide combinations and their immediate context in Escherichia coli. To distinguish mutational and selectional effects, the genes were subdivided into three groups with different codon usage bias and the gradients of nucleotide usage were studied in each group. Some combinations that can be associated with a propensity for processivity errors show strong negative gradients that become weaker in genes with low codon bias, consistent with a selection on translational efficiency. One of the strongest gradients is for third position G, which shows a pervasive positive gradient in usage in most contexts of surrounding bases.
PMCID: PMC110745  PMID: 10982871
20.  Systematic Association of Genes to Phenotypes by Genome and Literature Mining 
PLoS Biology  2005;3(5):e134.
One of the major challenges of functional genomics is to unravel the connection between genotype and phenotype. So far no global analysis has attempted to explore those connections in the light of the large phenotypic variability seen in nature. Here, we use an unsupervised, systematic approach for associating genes and phenotypic characteristics that combines literature mining with comparative genome analysis. We first mine the MEDLINE literature database for terms that reflect phenotypic similarities of species. Subsequently we predict the likely genomic determinants: genes specifically present in the respective genomes. In a global analysis involving 92 prokaryotic genomes we retrieve 323 clusters containing a total of 2,700 significant gene–phenotype associations. Some clusters contain mostly known relationships, such as genes involved in motility or plant degradation, often with additional hypothetical proteins associated with those phenotypes. Other clusters comprise unexpected associations; for example, a group of terms related to food and spoilage is linked to genes predicted to be involved in bacterial food poisoning. Among the clusters, we observe an enrichment of pathogenicity-related associations, suggesting that the approach reveals many novel genes likely to play a role in infectious diseases.
The combination of text mining and comparative genomics is shown to be a powerful approach to predicting phenotypes that are associated with particular genes in bacterial genomes
PMCID: PMC1073694  PMID: 15799710
21.  Structural Alterations from Multiple Displacement Amplification of a Human Genome Revealed by Mate-Pair Sequencing 
PLoS ONE  2011;6(7):e22250.
Comprehensive identification of the acquired mutations that cause common cancers will require genomic analyses of large sets of tumor samples. Typically, the tissue material available from tumor specimens is limited, which creates a demand for accurate template amplification. We therefore evaluated whether phi29-mediated whole genome amplification introduces false positive structural mutations by massive mate-pair sequencing of a normal human genome before and after such amplification. Multiple displacement amplification led to a decrease in clone coverage and an increase by two orders of magnitude in the prevalence of inversions, but did not increase the prevalence of translocations. While multiple strand displacement amplification may find uses in translocation analyses, it is likely that alternative amplification strategies need to be developed to meet the demands of cancer genomics.
PMCID: PMC3142133  PMID: 21799804

