Search tips
Search criteria

Results 1-24 (24)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
more »
Document Types
1.  Mismatch repair deficiency endows tumors with a unique mutation signature and sensitivity to DNA double-strand breaks 
eLife  2014;3:e02725.
DNA replication errors that persist as mismatch mutations make up the molecular fingerprint of mismatch repair (MMR)-deficient tumors and convey them with resistance to standard therapy. Using whole-genome and whole-exome sequencing, we here confirm an MMR-deficient mutation signature that is distinct from other tumor genomes, but surprisingly similar to germ-line DNA, indicating that a substantial fraction of human genetic variation arises through mutations escaping MMR. Moreover, we identify a large set of recurrent indels that may serve to detect microsatellite instability (MSI). Indeed, using endometrial tumors with immunohistochemically proven MMR deficiency, we optimize a novel marker set capable of detecting MSI and show it to have greater specificity and selectivity than standard MSI tests. Additionally, we show that recurrent indels are enriched for the ‘DNA double-strand break repair by homologous recombination’ pathway. Consequently, DSB repair is reduced in MMR-deficient tumors, triggering a dose-dependent sensitivity of MMR-deficient tumor cultures to DSB inducers.
eLife digest
Before a cell divides, it must first copy all of its genetic material. Any mistakes that are made during this process are called mutations. Mutations can give rise to new traits but are mostly harmful to the cells, or cause cancer; therefore, cells have evolved tools that can efficiently spot these mistakes and repair them. One of the main tools is called mismatch repair (MMR).
Defects in the cell's mismatch repair tools can wreak havoc as this allows many mutations to accumulate. Zhao et al. looked at the genomes of tumors where mismatch repair was not working properly to see what makes these ‘MMR-deficient tumors’ different from other tumors. This revealed that MMR-deficient tumors have similar patterns of mutations to those seen in egg and sperm cells. This was unexpected and suggests that mutations that are not corrected by mismatch repair are an important source of the genetic differences found between different humans, and between humans and their ancestors.
Identifying cancerous tumors that are MMR-deficient is vital, as these tumors tend not to respond to commonly used cancer treatments. However, current clinical methods to identify MMR-deficient tumors often fail or produce results that are difficult to interpret. MMR-deficient tumors commonly contain mutations called indels, where short fragments of DNA are inserted or deleted into longer DNA sequences. Zhao et al. have found 59 indels that can be used to detect MMR-deficient tumors, where each indel had been identified in several tumors taken from different tissues. This new approach allowed MMR-deficiency to be identified in several types of tumor, including colon and ovarian cancers, with greater sensitivity and accuracy than the existing methods.
Zhao et al. also found that the indels in MMR-deficient tumors reduce the ability of the tumors to repair a type of DNA damage called double-strand breaks. In these, both strands of DNA that make up the double helix are broken and the DNA chain is severed. As this kind of damage is very harmful to a cell, making more double-strand breaks could therefore form part of a more effective treatment against MMR-deficient tumors; further research is needed to investigate this possibility.
PMCID: PMC4141275  PMID: 25085081
whole-genome sequencing; mismatch repair deficiency; mutation pattern; MSI; DNA double-strand breaks; DSB inducers; human
2.  iRegulon: From a Gene List to a Gene Regulatory Network Using Large Motif and Track Collections 
PLoS Computational Biology  2014;10(7):e1003731.
Identifying master regulators of biological processes and mapping their downstream gene networks are key challenges in systems biology. We developed a computational method, called iRegulon, to reverse-engineer the transcriptional regulatory network underlying a co-expressed gene set using cis-regulatory sequence analysis. iRegulon implements a genome-wide ranking-and-recovery approach to detect enriched transcription factor motifs and their optimal sets of direct targets. We increase the accuracy of network inference by using very large motif collections of up to ten thousand position weight matrices collected from various species, and linking these to candidate human TFs via a motif2TF procedure. We validate iRegulon on gene sets derived from ENCODE ChIP-seq data with increasing levels of noise, and we compare iRegulon with existing motif discovery methods. Next, we use iRegulon on more challenging types of gene lists, including microRNA target sets, protein-protein interaction networks, and genetic perturbation data. In particular, we over-activate p53 in breast cancer cells, followed by RNA-seq and ChIP-seq, and could identify an extensive up-regulated network controlled directly by p53. Similarly we map a repressive network with no indication of direct p53 regulation but rather an indirect effect via E2F and NFY. Finally, we generalize our computational framework to include regulatory tracks such as ChIP-seq data and show how motif and track discovery can be combined to map functional regulatory interactions among co-expressed genes. iRegulon is available as a Cytoscape plugin from
Author Summary
Gene regulatory networks control developmental, homeostatic, and disease processes by governing precise levels and spatio-temporal patterns of gene expression. Determining their topology can provide mechanistic insight into these processes. Gene regulatory networks consist of interactions between transcription factors and their direct target genes. Each regulatory interaction represents the binding of the transcription factor to a specific DNA binding site near its target gene. Here we present a computational method, called iRegulon, to identify master regulators and direct target genes in a human gene signature, i.e. a set of co-expressed genes. iRegulon relies on the analysis of the regulatory sequences around each gene in the gene set to detect enriched TF motifs or ChIP-seq peaks, using databases of nearly 10.000 TF motifs and 1000 ChIP-seq data sets or “tracks”. Next, it associates enriched motifs and tracks with candidate transcription factors and determines the optimal subset of direct target genes. We validate iRegulon on ENCODE data, and use it in combination with RNA-seq and ChIP-seq data to map a p53 downstream network with new predicted co-factors and targets. iRegulon is available as a Cytoscape plugin, supporting human, mouse, and Drosophila genes, and provides access to hundreds of cancer-related TF-target subnetworks or “regulons”.
PMCID: PMC4109854  PMID: 25058159
3.  Identification of cis-regulatory modules encoding temporal dynamics during development 
BMC Genomics  2014;15(1):534.
Developmental transcriptional regulatory networks are circuits of transcription factors (TFs) and cis-acting DNA elements (Cis Regulatory Modules, CRMs) that dynamically control expression of downstream genes. Comprehensive knowledge of these networks is an essential step towards our understanding of developmental processes. However, this knowledge is mostly based on genome-wide mapping of transcription factor binding sites, and therefore requires prior knowledge regarding the TFs involved in the network.
Focusing on how temporal control of gene expression is integrated within a developmental network, we applied an in silico approach to discover regulatory motifs and CRMs of co-expressed genes, with no prior knowledge about the involved TFs. Our aim was to identify regulatory motifs and potential trans-acting factors which regulate the temporal expression of co-expressed gene sets during a particular process of organogenesis, namely adult heart formation in Drosophila. Starting from whole genome tissue specific expression dynamics, we used an in silico method, cisTargetX, to predict TF binding motifs and CRMs. Potential Nuclear Receptor (NR) binding motifs were predicted to control the temporal expression profile of a gene set with increased expression levels during mid metamorphosis. The predicted CRMs and NR motifs were validated in vivo by reporter gene essays. In addition, we provide evidence that three NRs modulate CRM activity and behave as temporal regulators of target enhancers.
Our approach was successful in identifying CRMs and potential TFs acting on the temporal regulation of target genes. In addition, our results suggest a modular architecture of the regulatory machinery, in which the temporal and spatial regulation can be uncoupled and encoded by distinct CRMs.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-534) contains supplementary material, which is available to authorized users.
PMCID: PMC4097164  PMID: 24972496
Cis-regulatory modules; Temporal control; Motif discovery; Transcription; Drosophila metamorphosis; Cardiogenesis
4.  Male-Specific Fruitless Isoforms Target Neurodevelopmental Genes to Specify a Sexually Dimorphic Nervous System 
Current Biology  2014;24(3):229-241.
In Drosophila, male courtship behavior is regulated in large part by the gene fruitless (fru). fru encodes a set of putative transcription factors that promote male sexual behavior by controlling the development of sexually dimorphic neuronal circuitry. Little is known about how Fru proteins function at the level of transcriptional regulation or the role that isoform diversity plays in the formation of a male-specific nervous system.
To characterize the roles of sex-specific Fru isoforms in specifying male behavior, we generated novel isoform-specific mutants and used a genomic approach to identify direct Fru isoform targets during development. We demonstrate that all Fru isoforms directly target genes involved in the development of the nervous system, with individual isoforms exhibiting unique binding specificities. We observe that fru behavioral phenotypes are specified by either a single isoform or a combination of isoforms. Finally, we illustrate the utility of these data for the identification of novel sexually dimorphic genomic enhancers and novel downstream regulators of male sexual behavior.
These findings suggest that Fru isoform diversity facilitates both redundancy and specificity in gene expression, and that the regulation of neuronal developmental genes may be the most ancient and conserved role of fru in the specification of a male-specific nervous system.
•Isoform-specific fru mutants reveal both functional redundancy and specificity•Fru isoform-specific genomic occupancy is characterized in the Drosophila nervous system•All Fru isoforms directly target neuronal morphogenesis genes•Isoform-specific motifs are associated with specific Fru isoform occupancy
Neville et al. characterize the roles of sex-specific Fruitless isoforms in specifying male behavior in Drosophila by generating novel isoform-specific mutants, along with using a genomic approach to identify direct Fruitless isoform targets during development.
PMCID: PMC3969260  PMID: 24440396
5.  Comprehensive Analysis of Transcriptome Variation Uncovers Known and Novel Driver Events in T-Cell Acute Lymphoblastic Leukemia 
PLoS Genetics  2013;9(12):e1003997.
RNA-seq is a promising technology to re-sequence protein coding genes for the identification of single nucleotide variants (SNV), while simultaneously obtaining information on structural variations and gene expression perturbations. We asked whether RNA-seq is suitable for the detection of driver mutations in T-cell acute lymphoblastic leukemia (T-ALL). These leukemias are caused by a combination of gene fusions, over-expression of transcription factors and cooperative point mutations in oncogenes and tumor suppressor genes. We analyzed 31 T-ALL patient samples and 18 T-ALL cell lines by high-coverage paired-end RNA-seq. First, we optimized the detection of SNVs in RNA-seq data by comparing the results with exome re-sequencing data. We identified known driver genes with recurrent protein altering variations, as well as several new candidates including H3F3A, PTK2B, and STAT5B. Next, we determined accurate gene expression levels from the RNA-seq data through normalizations and batch effect removal, and used these to classify patients into T-ALL subtypes. Finally, we detected gene fusions, of which several can explain the over-expression of key driver genes such as TLX1, PLAG1, LMO1, or NKX2-1; and others result in novel fusion transcripts encoding activated kinases (SSBP2-FER and TPM3-JAK2) or involving MLLT10. In conclusion, we present novel analysis pipelines for variant calling, variant filtering, and expression normalization on RNA-seq data, and successfully applied these for the detection of translocations, point mutations, INDELs, exon-skipping events, and expression perturbations in T-ALL.
Author Summary
The quest for somatic mutations underlying oncogenic processes is a central theme in today's cancer research. High-throughput genomics approaches including amplicon re-sequencing, exome re-sequencing, full genome re-sequencing, and SNP arrays have contributed to cataloguing driver genes across cancer types. Thus far transcriptome sequencing by RNA-seq has been mainly used for the detection of fusion genes, while few studies have assessed its value for the combined detection of SNPs, INDELs, fusions, gene expression changes, and alternative transcript events. Here we apply RNA-seq to 49 T-ALL samples and perform a critical assessment of the bioinformatics pipelines and filters to identify each type of aberration. By comparing to exome re-sequencing, and by exploiting the catalogues of known cancer drivers, we identified many known and several novel driver genes in T-ALL. We also determined an optimal normalization strategy to obtain accurate gene expression levels and used these to identify over-expressed transcription factors that characterize different T-ALL subtypes. Finally, by PCR, cloning, and in vitro cellular assays we uncover new fusion genes that have consequences at the level of gene expression, oncogenic chimaeras, and tumor suppressor inactivation. In conclusion, we present the first RNA-seq data set across T-ALL patients and identify new driver events.
PMCID: PMC3868543  PMID: 24367274
6.  Alteration of the microRNA network during the progression of Alzheimer's disease 
EMBO Molecular Medicine  2013;5(10):1613-1634.
An overview of miRNAs altered in Alzheimer's disease (AD) was established by profiling the hippocampus of a cohort of 41 late-onset AD (LOAD) patients and 23 controls, showing deregulation of 35 miRNAs. Profiling of miRNAs in the prefrontal cortex of a second independent cohort of 49 patients grouped by Braak stages revealed 41 deregulated miRNAs. We focused on miR-132-3p which is strongly altered in both brain areas. Downregulation of this miRNA occurs already at Braak stages III and IV, before loss of neuron-specific miRNAs. Next-generation sequencing confirmed a strong decrease of miR-132-3p and of three family-related miRNAs encoded by the same miRNA cluster on chromosome 17. Deregulation of miR-132-3p in AD brain appears to occur mainly in neurons displaying Tau hyper-phosphorylation. We provide evidence that miR-132-3p may contribute to disease progression through aberrant regulation of mRNA targets in the Tau network. The transcription factor (TF) FOXO1a appears to be a key target of miR-132-3p in this pathway.
PMCID: PMC3799583  PMID: 24014289
Alzheimer's disease; hippocampus; prefrontal cortex; microRNA; miR-132-3p
7.  Genome-wide analyses of Shavenbaby target genes reveals distinct features of enhancer organization 
Genome Biology  2013;14(8):R86.
Developmental programs are implemented by regulatory interactions between Transcription Factors (TFs) and their target genes, which remain poorly understood. While recent studies have focused on regulatory cascades of TFs that govern early development, little is known about how the ultimate effectors of cell differentiation are selected and controlled. We addressed this question during late Drosophila embryogenesis, when the finely tuned expression of the TF Ovo/Shavenbaby (Svb) triggers the morphological differentiation of epidermal trichomes.
We defined a sizeable set of genes downstream of Svb and used in vivo assays to delineate 14 enhancers driving their specific expression in trichome cells. Coupling computational modeling to functional dissection, we investigated the regulatory logic of these enhancers. Extending the repertoire of epidermal effectors using genome-wide approaches showed that the regulatory models learned from this first sample are representative of the whole set of trichome enhancers. These enhancers harbor remarkable features with respect to their functional architectures, including a weak or non-existent clustering of Svb binding sites. The in vivo function of each site relies on its intimate context, notably the flanking nucleotides. Two additional cis-regulatory motifs, present in a broad diversity of composition and positioning among trichome enhancers, critically contribute to enhancer activity.
Our results show that Svb directly regulates a large set of terminal effectors of the remodeling of epidermal cells. Further, these data reveal that trichome formation is underpinned by unexpectedly diverse modes of regulation, providing fresh insights into the functional architecture of enhancers governing a terminal differentiation program.
PMCID: PMC4053989  PMID: 23972280
8.  i-cisTarget: an integrative genomics method for the prediction of regulatory features and cis-regulatory modules 
Nucleic Acids Research  2012;40(15):e114.
The field of regulatory genomics today is characterized by the generation of high-throughput data sets that capture genome-wide transcription factor (TF) binding, histone modifications, or DNAseI hypersensitive regions across many cell types and conditions. In this context, a critical question is how to make optimal use of these publicly available datasets when studying transcriptional regulation. Here, we address this question in Drosophila melanogaster for which a large number of high-throughput regulatory datasets are available. We developed i-cisTarget (where the ‘i’ stands for integrative), for the first time enabling the discovery of different types of enriched ‘regulatory features’ in a set of co-regulated sequences in one analysis, being either TF motifs or ‘in vivo’ chromatin features, or combinations thereof. We have validated our approach on 15 co-expressed gene sets, 21 ChIP data sets, 628 curated gene sets and multiple individual case studies, and show that meaningful regulatory features can be confidently discovered; that bona fide enhancers can be identified, both by in vivo events and by TF motifs; and that combinations of in vivo events and TF motifs further increase the performance of enhancer prediction.
PMCID: PMC3424583  PMID: 22718975
9.  High Accuracy Mutation Detection in Leukemia on a Selected Panel of Cancer Genes 
PLoS ONE  2012;7(6):e38463.
With the advent of whole-genome and whole-exome sequencing, high-quality catalogs of recurrently mutated cancer genes are becoming available for many cancer types. Increasing access to sequencing technology, including bench-top sequencers, provide the opportunity to re-sequence a limited set of cancer genes across a patient cohort with limited processing time. Here, we re-sequenced a set of cancer genes in T-cell acute lymphoblastic leukemia (T-ALL) using Nimblegen sequence capture coupled with Roche/454 technology. First, we investigated how a maximal sensitivity and specificity of mutation detection can be achieved through a benchmark study. We tested nine combinations of different mapping and variant-calling methods, varied the variant calling parameters, and compared the predicted mutations with a large independent validation set obtained by capillary re-sequencing. We found that the combination of two mapping algorithms, namely BWA-SW and SSAHA2, coupled with the variant calling algorithm Atlas-SNP2 yields the highest sensitivity (95%) and the highest specificity (93%). Next, we applied this analysis pipeline to identify mutations in a set of 58 cancer genes, in a panel of 18 T-ALL cell lines and 15 T-ALL patient samples. We confirmed mutations in known T-ALL drivers, including PHF6, NF1, FBXW7, NOTCH1, KRAS, NRAS, PIK3CA, and PTEN. Interestingly, we also found mutations in several cancer genes that had not been linked to T-ALL before, including JAK3. Finally, we re-sequenced a small set of 39 candidate genes and identified recurrent mutations in TET1, SPRY3 and SPRY4. In conclusion, we established an optimized analysis pipeline for Roche/454 data that can be applied to accurately detect gene mutations in cancer, which led to the identification of several new candidate T-ALL driver mutations.
PMCID: PMC3366948  PMID: 22675565
10.  Robust Target Gene Discovery through Transcriptome Perturbations and Genome-Wide Enhancer Predictions in Drosophila Uncovers a Regulatory Basis for Sensory Specification 
PLoS Biology  2010;8(7):e1000435.
CisTarget X is a novel computational method that accurately predicts Atonal governed regulatory networks in the retina of the fruit fly.
A comprehensive systems-level understanding of developmental programs requires the mapping of the underlying gene regulatory networks. While significant progress has been made in mapping a few such networks, almost all gene regulatory networks underlying cell-fate specification remain unknown and their discovery is significantly hampered by the paucity of generalized, in vivo validated tools of target gene and functional enhancer discovery. We combined genetic transcriptome perturbations and comprehensive computational analyses to identify a large cohort of target genes of the proneural and tumor suppressor factor Atonal, which specifies the switch from undifferentiated pluripotent cells to R8 photoreceptor neurons during larval development. Extensive in vivo validations of the predicted targets for the proneural factor Atonal demonstrate a 50% success rate of bona fide targets. Furthermore we show that these enhancers are functionally conserved by cloning orthologous enhancers from Drosophila ananassae and D. virilis in D. melanogaster. Finally, to investigate cis-regulatory cross-talk between Ato and other retinal differentiation transcription factors (TFs), we performed motif analyses and independent target predictions for Eyeless, Senseless, Suppressor of Hairless, Rough, and Glass. Our analyses show that cisTargetX identifies the correct motif from a set of coexpressed genes and accurately predicts target genes of individual TFs. The validated set of novel Ato targets exhibit functional enrichment of signaling molecules and a subset is predicted to be coregulated by other TFs within the retinal gene regulatory network.
Author Summary
Tens of thousands of regulatory elements determine the spatiotemporal expression pattern of protein-coding genes in the metazoan genome. Each regulatory element, when bound by the appropriate transcription factors, can affect the temporal transcription of a nearby target gene in a particular cell type. Annotating the genome for regulatory elements, as well as determining the input transcription factors for each element, is a key challenge in genome biology. In this study, we introduce a computational method, cisTargetX, that predicts transcription factor binding motifs and their target genes through the integration of gene expression data and comparative genomics. We first validate this method in silico using public gene expression data and, then, apply cisTargetX to the developmental program governing photoreceptor neuron specification in the retina of Drosophila melanogaster. Particularly, we perturbed predicted key transcription factors during the initial steps of neurogenesis; measure gene expression by microarrays; identify motifs and predict target genes; validate the predictions in vivo using transgenic animals; and study several functional and evolutionary aspects of the validated regulatory elements for the proneural factor Atonal. Overall, we show that cisTargetX efficiently predicts genetic regulatory interactions and provides mechanistic insight into gene regulatory networks of postembryonic developmental systems.
PMCID: PMC2910651  PMID: 20668662
11.  Sequencing the regulatory genome 
Genome Biology  2008;9(6):313.
A report on the Cold Spring Harbor Laboratory meeting 'Systems Biology: Global Regulation of Gene Expression', Cold Spring Harbor, USA, 27-30 March 2008.
A report on the Cold Spring Harbor Laboratory meeting 'Systems Biology: Global Regulation of Gene Expression', Cold Spring Harbor, USA, 27-30 March 2008.
PMCID: PMC2481419  PMID: 18598374
12.  The Atonal Proneural Transcription Factor Links Differentiation and Tumor Formation in Drosophila 
PLoS Biology  2009;7(2):e1000040.
The acquisition of terminal cell fate and onset of differentiation are instructed by cell type–specific master control genes. Loss of differentiation is frequently observed during cancer progression, but the underlying causes and mechanisms remain poorly understood. We tested the hypothesis that master regulators of differentiation may be key regulators of tumor formation. Using loss- and gain-of-function analyses in Drosophila, we describe a critical anti-oncogenic function for the atonal transcription factor in the fly retina, where atonal instructs tissue differentiation. In the tumor context, atonal acts by regulating cell proliferation and death via the JNK stress response pathway. Combined with evidence that atonal's mammalian homolog, ATOH1, is a tumor suppressor gene, our data support a critical, evolutionarily conserved, function for ato in oncogenesis.
Author Summary
During embryonic development, cells become more and more specialized, and this process is referred to as differentiation. In contrast to normal adult cells, cancer cells—like embryonic cells—display fewer differentiated properties. It has been postulated that the acquisition of terminal differentiation helps inhibit tumor formation; however, no direct evidence for this hypothesis was available. The development of the eye in the fruit fly, Drosophila melanogaster, has long been used as a model for studying genetic factors controlling differentiation. More recently, eye development has also been used to study how tumors can form and progress. In this study, we used this model to show that genes, such as atonal, that instruct the differentiation of specific tissues can act as tumor suppressers and inhibit the formation and progression of tumors in those tissues. Losing such genes can generate tumors, whereas activating them can strongly inhibit these tumors.
We establish a direct genetic link between cancer and the initiation of differentiation in theDrosophila eye.
PMCID: PMC2652389  PMID: 19243220
13.  Integrating Computational Biology and Forward Genetics in Drosophila 
PLoS Genetics  2009;5(1):e1000351.
Genetic screens are powerful methods for the discovery of gene–phenotype associations. However, a systems biology approach to genetics must leverage the massive amount of “omics” data to enhance the power and speed of functional gene discovery in vivo. Thus far, few computational methods for gene function prediction have been rigorously tested for their performance on a genome-wide scale in vivo. In this work, we demonstrate that integrating genome-wide computational gene prioritization with large-scale genetic screening is a powerful tool for functional gene discovery. To discover genes involved in neural development in Drosophila, we extend our strategy for the prioritization of human candidate disease genes to functional prioritization in Drosophila. We then integrate this prioritization strategy with a large-scale genetic screen for interactors of the proneural transcription factor Atonal using genomic deficiencies and mutant and RNAi collections. Using the prioritized genes validated in our genetic screen, we describe a novel genetic interaction network for Atonal. Lastly, we prioritize the whole Drosophila genome and identify candidate gene associations for ten receptor-signaling pathways. This novel database of prioritized pathway candidates, as well as a web application for functional prioritization in Drosophila, called Endeavour-HighFly, and the Atonal network, are publicly available resources. A systems genetics approach that combines the power of computational predictions with in vivo genetic screens strongly enhances the process of gene function and gene–gene association discovery.
Author Summary
Genome sequencing and annotation, combined with large-scale molecular experiments to query gene expression and molecular interactions, collectively known as Systems Biology, have resulted in an enormous wealth in biological databases. Yet, it remains a daunting task to use these data to decipher the rules that govern biological systems. One of the most trusted approaches in biology is genetic analysis because of its emphasis on gene function in living organisms. Genetics, however, proceeds slowly and unravels small-scale interactions. Turning genetics into an effective tool of Systems Biology requires harnessing the large-scale molecular data for the design and execution of genetic screens. In this work, we test the idea of exploiting a computational approach known as gene prioritization to pre-rank genes for the likelihood of their involvement in a process of interest. By carrying out a gene prioritization–supported genetic screen, we greatly enhance the speed and output of in vivo genetic screens without compromising their sensitivity. These results mean that future genetic screens can be custom-catered for any process of interest and carried out with a speed and efficiency that is comparable to other large-scale molecular experiments. We refer to this combined approach as Systems Genetics.
PMCID: PMC2628282  PMID: 19165344
14.  Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures 
Nature  2007;450(7167):219-232.
Sequencing of multiple related species followed by comparative genomics analysis constitutes a powerful approach for the systematic understanding of any genome. Here, we use the genomes of 12 Drosophila species for the de novo discovery of functional elements in the fly. Each type of functional element shows characteristic patterns of change, or ‘evolutionary signatures’, dictated by its precise selective constraints. Such signatures enable recognition of new protein-coding genes and exons, spurious and incorrect gene annotations, and numerous unusual gene structures, including abundant stop-codon readthrough. Similarly, we predict non-protein-coding RNA genes and structures, and new microRNA (miRNA) genes. We provide evidence of miRNA processing and functionality from both hairpin arms and both DNA strands. We identify several classes of pre- and post-transcriptional regulatory motifs, and predict individual motif instances with high confidence. We also study how discovery power scales with the divergence and number of species compared, and we provide general guidelines for comparative studies.
PMCID: PMC2474711  PMID: 17994088
15.  Endeavour update: a web resource for gene prioritization in multiple species 
Nucleic Acids Research  2008;36(Web Server issue):W377-W384.
Endeavour (; this web site is free and open to all users and there is no login requirement) is a web resource for the prioritization of candidate genes. Using a training set of genes known to be involved in a biological process of interest, our approach consists of (i) inferring several models (based on various genomic data sources), (ii) applying each model to the candidate genes to rank those candidates against the profile of the known genes and (iii) merging the several rankings into a global ranking of the candidate genes. In the present article, we describe the latest developments of Endeavour. First, we provide a web-based user interface, besides our Java client, to make Endeavour more universally accessible. Second, we support multiple species: in addition to Homo sapiens, we now provide gene prioritization for three major model organisms: Mus musculus, Rattus norvegicus and Caenorhabditis elegans. Third, Endeavour makes use of additional data sources and is now including numerous databases: ontologies and annotations, protein–protein interactions, cis-regulatory information, gene expression data sets, sequence information and text-mining data. We tested the novel version of Endeavour on 32 recent disease gene associations from the literature. Additionally, we describe a number of recent independent studies that made use of Endeavour to prioritize candidate genes for obesity and Type II diabetes, cleft lip and cleft palate, and pulmonary fibrosis.
PMCID: PMC2447805  PMID: 18508807
16.  ModuleMiner - improved computational detection of cis-regulatory modules: are there different modes of gene regulation in embryonic development and adult tissues? 
Genome Biology  2008;9(4):R66.
ModuleMiner detects cis-regulatory modules in a set of co-expressed genes in tissue-specific microarray clusters and in embryonic development datasets.
We present ModuleMiner, a novel algorithm for computationally detecting cis-regulatory modules (CRMs) in a set of co-expressed genes. ModuleMiner outperforms other methods for CRM detection on benchmark data, and successfully detects CRMs in tissue-specific microarray clusters and in embryonic development gene sets. Interestingly, CRM predictions for differentiated tissues exhibit strong enrichment close to the transcription start site, whereas CRM predictions for embryonic development gene sets are depleted in this region.
PMCID: PMC2643937  PMID: 18394174
17.  Text-mining assisted regulatory annotation 
Genome Biology  2008;9(2):R31.
Text-mining technologies can be integrated with genome annotation systems, increasing the availability of annotated cis-regulatory data.
Decoding transcriptional regulatory networks and the genomic cis-regulatory logic implemented in their control nodes is a fundamental challenge in genome biology. High-throughput computational and experimental analyses of regulatory networks and sequences rely heavily on positive control data from prior small-scale experiments, but the vast majority of previously discovered regulatory data remains locked in the biomedical literature.
We develop text-mining strategies to identify relevant publications and extract sequence information to assist the regulatory annotation process. Using a vector space model to identify Medline abstracts from papers likely to have high cis-regulatory content, we demonstrate that document relevance ranking can assist the curation of transcriptional regulatory networks and estimate that, minimally, 30,000 papers harbor unannotated cis-regulatory data. In addition, we show that DNA sequences can be extracted from primary text with high cis-regulatory content and mapped to genome sequences as a means of identifying the location, organism and target gene information that is critical to the cis-regulatory annotation process.
Our results demonstrate that text-mining technologies can be successfully integrated with genome annotation systems, thereby increasing the availability of annotated cis-regulatory data needed to catalyze advances in the field of gene regulation.
PMCID: PMC2374703  PMID: 18271954
18.  ORegAnno: an open-access community-driven resource for regulatory annotation 
Nucleic Acids Research  2007;36(Database issue):D107-D113.
ORegAnno is an open-source, open-access database and literature curation system for community-based annotation of experimentally identified DNA regulatory regions, transcription factor binding sites and regulatory variants. The current release comprises 30 145 records curated from 922 publications and describing regulatory sequences for over 3853 genes and 465 transcription factors from 19 species. A new feature called the ‘publication queue’ allows users to input relevant papers from scientific literature as targets for annotation. The queue contains 4438 gene regulation papers entered by experts and another 54 351 identified by text-mining methods. Users can enter or ‘check out’ papers from the queue for manual curation using a series of user-friendly annotation pages. A typical record entry consists of species, sequence type, sequence, target gene, binding factor, experimental outcome and one or more lines of experimental evidence. An evidence ontology was developed to describe and categorize these experiments. Records are cross-referenced to Ensembl or Entrez gene identifiers, PubMed and dbSNP and can be visualized in the Ensembl or UCSC genome browsers. All data are freely available through search pages, XML data dumps or web services at:
PMCID: PMC2239002  PMID: 18006570
19.  Fine-Tuning Enhancer Models to Predict Transcriptional Targets across Multiple Genomes 
PLoS ONE  2007;2(11):e1115.
Networks of regulatory relations between transcription factors (TF) and their target genes (TG)- implemented through TF binding sites (TFBS)- are key features of biology. An idealized approach to solving such networks consists of starting from a consensus TFBS or a position weight matrix (PWM) to generate a high accuracy list of candidate TGs for biological validation. Developing and evaluating such approaches remains a formidable challenge in regulatory bioinformatics. We perform a benchmark study on 34 Drosophila TFs to assess existing TFBS and cis-regulatory module (CRM) detection methods, with a strong focus on the use of multiple genomes. Particularly, for CRM-modelling we investigate the addition of orthologous sites to a known PWM to construct phyloPWMs and we assess the added value of phylogenentic footprinting to predict contextual motifs around known TFBSs. For CRM-prediction, we compare motif conservation with network-level conservation approaches across multiple genomes. Choosing the optimal training and scoring strategies strongly enhances the performance of TG prediction for more than half of the tested TFs. Finally, we analyse a 35th TF, namely Eyeless, and find a significant overlap between predicted TGs and candidate TGs identified by microarray expression studies. In summary we identify several ways to optimize TF-specific TG predictions, some of which can be applied to all TFs, and others that can be applied only to particular TFs. The ability to model known TF-TG relations, together with the use of multiple genomes, results in a significant step forward in solving the architecture of gene regulatory networks.
PMCID: PMC2047340  PMID: 17973026
20.  Prediction of a key role of motifs binding E2F and NR2F in down-regulation of numerous genes during the development of the mouse hippocampus 
BMC Bioinformatics  2006;7:367.
We previously demonstrated that gene expression profiles during neuronal differentiation in vitro and hippocampal development in vivo were very similar, due to a conservation of the important second singular value decomposition (SVD) mode (Mode 2) of expression. The conservation of Mode 2 suggests that it reflects a regulatory mechanism conserved between the two systems. In either dataset, the expression vectors of all the genes form two large clusters that differ in the sign of the contribution of Mode 2, which for the majority of them reflects the difference between down- or up-regulation.
In the current work, we used a novel approach of analyzing cis-regulation of gene expression in a subspace of a single SVD mode of temporal expression profiles. In the putative upstream regulatory sequences identified by mouse-human homology for all the genes represented in either dataset, we searched for simple features (motifs and pairs of motifs) associated with either sign of the loading of Mode 2. Using a cross-system training-test set approach, we identified E2F binding sites as predictors of down-regulation of gene expression during hippocampal development. NR2F binding sites, for the transcription factors Nr2f/COUP and Hnf4, and also NR2F_SP1 pairs of binding sites, were predictors of down-regulation of expression both during hippocampal development and neuronal differentiation. Analysis of another dataset, from gene profiling of myoblast differentiation in vitro, shows that the conservation of Mode 2 extends to the differentiation of mesenchymal cells. This permitted the identification of two more pairs of motifs, one of which included the CDE/CHR tandem element, as features associated with down-regulation both in the differentiating myoblasts and in the developing hippocampus. Of the features we identified, the E2F and CDE/CHR motifs may be associated with the cycling progenitor cell status, while NR2F may be related to the entry into differentiation along the neuronal pathway.
Our results constitute the first prediction of an expression pattern from the genomic sequence for the developing mammalian brain, and demonstrate a potential for the analysis of gene regulation in a subspace of a single SVD mode of expression.
PMCID: PMC1560171  PMID: 16884529
21.  TOUCAN 2: the all-inclusive open source workbench for regulatory sequence analysis 
Nucleic Acids Research  2005;33(Web Server issue):W393-W396.
We present the second and improved release of the TOUCAN workbench for cis-regulatory sequence analysis. TOUCAN implements and integrates fast state-of-the-art methods and strategies in gene regulation bioinformatics, including algorithms for comparative genomics and for the detection of cis-regulatory modules. This second release of TOUCAN has become open source and thereby carries the potential to evolve rapidly. The main goal of TOUCAN is to allow a user to come to testable hypotheses regarding the regulation of a gene or of a set of co-regulated genes. TOUCAN can be launched from this location: .
PMCID: PMC1160115  PMID: 15980497
22.  Comprehensive analysis of the base composition around the transcription start site in Metazoa 
BMC Genomics  2004;5:34.
The transcription start site of a metazoan gene remains poorly understood, mostly because there is no clear signal present in all genes. Now that several sequenced metazoan genomes have been annotated, we have been able to compare the base composition around the transcription start site for all annotated genes across multiple genomes.
The most prominent feature in the base compositions is a significant local variation in G+C content over a large region around the transcription start site. The change is present in all animal phyla but the extent of variation is different between distinct classes of vertebrates, and the shape of the variation is completely different between vertebrates and arthropods. Furthermore, the height of the variation correlates with CpG frequencies in vertebrates but not in invertebrates and it also correlates with gene expression, especially in mammals. We also detect GC and AT skews in all clades (where %G is not equal to %C or %A is not equal to %T respectively) but these occur in a more confined region around the transcription start site and in the coding region.
The dramatic changes in nucleotide composition in humans are a consequence of CpG nucleotide frequencies and of gene expression, the changes in Fugu could point to primordial CpG islands, and the changes in the fly are of a totally different kind and unrelated to dinucleotide frequencies.
PMCID: PMC436054  PMID: 15171795
23.  INCLUSive: a web portal and service registry for microarray and regulatory sequence analysis 
Nucleic Acids Research  2003;31(13):3468-3470.
INCLUSive is a suite of algorithms and tools for the analysis of gene expression data and the discovery of cis-regulatory sequence elements. The tools allow normalization, filtering and clustering of microarray data, functional scoring of gene clusters, sequence retrieval, and detection of known and unknown regulatory elements using probabilistic sequence models and Gibbs sampling. All tools are available via different web pages and as web services. The web pages are connected and integrated to reflect a methodology and facilitate complex analysis using different tools. The web services can be invoked using standard SOAP messaging. Example clients are available for download to invoke the services from a remote computer or to be integrated with other applications. All services are catalogued and described in a web service registry. The INCLUSive web portal is available for academic purposes at
PMCID: PMC169021  PMID: 12824346
24.  Toucan: deciphering the cis-regulatory logic of coregulated genes 
Nucleic Acids Research  2003;31(6):1753-1764.
TOUCAN is a Java application for the rapid discovery of significant cis-regulatory elements from sets of coexpressed or coregulated genes. Biologists can automatically (i) retrieve genes and intergenic regions, (ii) identify putative regulatory regions, (iii) score sequences for known transcription factor binding sites, (iv) identify candidate motifs for unknown binding sites, and (v) detect those statistically over-represented sites that are characteristic for a gene set. Genes or intergenic regions are retrieved from Ensembl or EMBL, together with orthologs and supporting information. Orthologs are aligned and syntenic regions are selected as candidate regulatory regions. Putative sites for known transcription factors are detected using our MotifScanner, which scores position weight matrices using a probabilistic model. New motifs are detected using our MotifSampler based on Gibbs sampling. Binding sites characteristic for a gene set—and thus statistically over-represented with respect to a reference sequence set—are found using a binomial test. We have validated Toucan by analyzing muscle-specific genes, liver-specific genes and E2F target genes; we have easily detected many known binding sites within intergenic DNA and identified new biologically plausible sites for known and unknown transcription factors. Software available at∼dna/BioI/Software.html.
PMCID: PMC152870  PMID: 12626717

Results 1-24 (24)