1.  Transcription Factor Networks in Drosophila melanogaster 
Cell reports  2014;8(6):2031-2043.
Specific cellular fates and functions depend on differential gene expression, which occurs primarily at the transcriptional level, controlled by complex regulatory networks of transcription factors. Transcription factors act through combinatorial interactions with other transcription factors, co-factors and chromatin-remodelling proteins. We present a study of 459 Drosophila melanogaster transcription related factors, defining protein-protein interactions using a co-affinity purification mass spectrometry methodology, representing approximately half of the established catalogue of transcription factors. We probe this network in vivo, demonstrating functional interactions for many interacting proteins testing the predictive value for our data set. Building on these analyses, we combine regulatory network inference models with physical interactions to define an integrated network, connecting combinatorial transcription factor protein interactions to the transcriptional regulatory network of the cell. We use this integrated network as a tool to connect the functional network of genetic modifiers related to mastermind, a transcriptional co-factor of the Notch pathway.
Graphical Abstract
PMCID: PMC4403667  PMID: 25242320
Drosophila; Transcription Factor; Interactome
2.  Diversity and dynamics of the Drosophila transcriptome 
Nature  2014;512(7515):393-399.
Animal transcriptomes are dynamic, each cell type, tissue and organ system expressing an ensemble of transcript isoforms that give rise to substantial diversity. We identified new genes, transcripts, and proteins using poly(A)+ RNA sequence from Drosophila melanogaster cultured cell lines, dissected organ systems, and environmental perturbations. We found a small set of mostly neural-specific genes has the potential to encode thousands of transcripts each through extensive alternative promoter usage and RNA splicing. The magnitudes of splicing changes are larger between tissues than between developmental stages, and most sex-specific splicing is gonad-specific. Gonads express hundreds of previously unknown coding and long noncoding RNAs (lncRNAs) some of which are antisense to protein-coding genes and produce short regulatory RNAs. Furthermore, previously identified pervasive intergenic transcription occurs primarily within newly identified introns. The fly transcriptome is substantially more complex than previously recognized arising from combinatorial usage of promoters, splice sites, and polyadenylation sites.
PMCID: PMC4152413  PMID: 24670639
3.  Comparative Analysis of the Transcriptome across Distant Species 
Gerstein, Mark B. | Rozowsky, Joel | Yan, Koon-Kiu | Wang, Daifeng | Cheng, Chao | Brown, James B. | Davis, Carrie A | Hillier, LaDeana | Sisu, Cristina | Li, Jingyi Jessica | Pei, Baikang | Harmanci, Arif O. | Duff, Michael O. | Djebali, Sarah | Alexander, Roger P. | Alver, Burak H. | Auerbach, Raymond | Bell, Kimberly | Bickel, Peter J. | Boeck, Max E. | Boley, Nathan P. | Booth, Benjamin W. | Cherbas, Lucy | Cherbas, Peter | Di, Chao | Dobin, Alex | Drenkow, Jorg | Ewing, Brent | Fang, Gang | Fastuca, Megan | Feingold, Elise A. | Frankish, Adam | Gao, Guanjun | Good, Peter J. | Guigó, Roderic | Hammonds, Ann | Harrow, Jen | Hoskins, Roger A. | Howald, Cédric | Hu, Long | Huang, Haiyan | Hubbard, Tim J. P. | Huynh, Chau | Jha, Sonali | Kasper, Dionna | Kato, Masaomi | Kaufman, Thomas C. | Kitchen, Robert R. | Ladewig, Erik | Lagarde, Julien | Lai, Eric | Leng, Jing | Lu, Zhi | MacCoss, Michael | May, Gemma | McWhirter, Rebecca | Merrihew, Gennifer | Miller, David M. | Mortazavi, Ali | Murad, Rabi | Oliver, Brian | Olson, Sara | Park, Peter J. | Pazin, Michael J. | Perrimon, Norbert | Pervouchine, Dmitri | Reinke, Valerie | Reymond, Alexandre | Robinson, Garrett | Samsonova, Anastasia | Saunders, Gary I. | Schlesinger, Felix | Sethi, Anurag | Slack, Frank J. | Spencer, William C. | Stoiber, Marcus H. | Strasbourger, Pnina | Tanzer, Andrea | Thompson, Owen A. | Wan, Kenneth H. | Wang, Guilin | Wang, Huaien | Watkins, Kathie L. | Wen, Jiayu | Wen, Kejia | Xue, Chenghai | Yang, Li | Yip, Kevin | Zaleski, Chris | Zhang, Yan | Zheng, Henry | Brenner, Steven E. | Graveley, Brenton R. | Celniker, Susan E. | Gingeras, Thomas R | Waterston, Robert
Nature  2014;512(7515):445-448.
PMCID: PMC4155737  PMID: 25164755
4.  Long-read, whole-genome shotgun sequence data for five model organisms 
Scientific Data  2014;1:140045.
Single molecule, real-time (SMRT) sequencing from Pacific Biosciences is increasingly used in many areas of biological research including de novo genome assembly, structural-variant identification, haplotype phasing, mRNA isoform discovery, and base-modification analyses. High-quality, public datasets of SMRT sequences can spur development of analytic tools that can accommodate unique characteristics of SMRT data (long read lengths, lack of GC or amplification bias, and a random error profile leading to high consensus accuracy). In this paper, we describe eight high-coverage SMRT sequence datasets from five organisms (Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster) that have been publicly released to the general scientific community (NCBI Sequence Read Archive ID SRP040522). Data were generated using two sequencing chemistries (P4C2 and P5C3) on the PacBio RS II instrument. The datasets reported here can be used without restriction by the research community to generate whole-genome assemblies, test new algorithms, investigate genome structure and evolution, and identify base modifications in some of the most widely-studied model systems in biological research.
PMCID: PMC4365909  PMID: 25977796
5.  Genome-guided transcript assembly from integrative analysis of RNA sequence data 
Nature biotechnology  2014;32(4):341-346.
The identification of full length transcripts entirely from short-read RNA sequencing data (RNA-seq) remains a challenge in genome annotation pipelines. Here we describe an automated pipeline for genome annotation that integrates RNA-seq and gene-boundary data sets, which we call generalized RNA integration tool, or GRIT. By applying GRIT to Drosophila melanogaster short-read RNA-seq, cap analysis of gene expression (CAGE) and poly(A)-site-seq data collected for the modENCODE project, we recover the vast majority of previously annotated transcripts and double the total number of transcripts cataloged. We find that 20% of protein coding genes encode multiple protein-localization signals, and that, in 20 day old adult fly heads, genes with multiple poly-adenylation sites are more common than genes with alternate splicing or alternate promoters. When compared to the most widely used transcript assembly tools, GRIT recovers a larger fraction of annotated transcripts at higher precision. GRIT will enable the automated generation of high-quality genome annotations without necessitating extensive manual annotation.
PMCID: PMC4037530  PMID: 24633242
6.  Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures 
Nature  2007;450(7167):219-232.
Sequencing of multiple related species followed by comparative genomics analysis constitutes a powerful approach for the systematic understanding of any genome. Here, we use the genomes of 12 Drosophila species for the de novo discovery of functional elements in the fly. Each type of functional element shows characteristic patterns of change, or ‘evolutionary signatures’, dictated by its precise selective constraints. Such signatures enable recognition of new protein-coding genes and exons, spurious and incorrect gene annotations, and numerous unusual gene structures, including abundant stop-codon readthrough. Similarly, we predict non-protein-coding RNA genes and structures, and new microRNA (miRNA) genes. We provide evidence of miRNA processing and functionality from both hairpin arms and both DNA strands. We identify several classes of pre- and post-transcriptional regulatory motifs, and predict individual motif instances with high confidence. We also study how discovery power scales with the divergence and number of species compared, and we provide general guidelines for comparative studies.
PMCID: PMC2474711  PMID: 17994088
7.  Automated protein-DNA interaction screening of Drosophila regulatory elements 
Nature methods  2011;8(12):1065-1070.
Drosophila melanogaster has one of the best characterized metazoan genomes in terms of functionally annotated regulatory elements. To explore how these elements contribute to gene regulation in the context of gene regulatory networks, we need convenient tools to identify the proteins that bind to them. Here, we present the development and validation of a highly automated protein-DNA interaction detection method, enabling the high-throughput yeast one-hybrid-based screening of DNA elements versus an array of full-length, sequence-verified clones containing 647 (over 85%) of predicted Drosophila transcription factors (TFs). Using six well-characterized regulatory elements (82 bp – 1kb), we identified 33 TF-DNA interactions of which 27 are novel. To simultaneously validate these interactions and locate their binding sites of involved TFs, we implemented a novel microfluidics-based approach that enables us to conduct hundreds of gel shift-like assays at once, thus allowing the retrieval of DNA occupancy data for each TF throughout the respective target DNA elements. Finally, we biologically validate several interactions and specifically identify two novel regulators of sine oculis gene expression and hence eye development.
PMCID: PMC3929264  PMID: 22037703
8.  An extracellular interactome of Immunoglobulin and LRR proteins reveals receptor-ligand networks 
Cell  2013;154(1):228-239.
Extracellular domains of cell-surface receptors and ligands mediate cell-cell communication, adhesion, and initiation of signaling events, but most existing protein-protein “interactome” datasets lack information for extracellular interactions. We probed interactions between receptor extracellular domains, focusing on the Immunoglobulin Superfamily (IgSF), Fibronectin type-III (FnIII) and Leucine-rich repeat (LRR) families of Drosophila, a set of 202 proteins, many of which are known to be important in neuronal and developmental functions. Out of 20503 candidate protein pairs tested, we observed 106 interactions, 83 of which were previously unknown. We ‘deorphanized’ the 20-member subfamily of defective in proboscis IgSF proteins, showing that they selectively interact with an 11-member subfamily of previously uncharacterized IgSF proteins. Both subfamilies interact with a single common ‘orphan’ LRR protein. We also observed new interactions between Hedgehog and EGFR pathway components. Several of these interactions could be visualized in live-dissected embryos, demonstrating that this approach can identify physiologically relevant receptor-ligand pairs.
PMCID: PMC3756661  PMID: 23827685
9.  Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome 
Genome Biology  2002;3(12):research0086.1-86.2.
Analysis of conservation in eight genomic regions (apterous, even-skipped, fushi tarazu, twist, and Rhodopsins 1, 2, 3 and 4) from four Drosophila species (D. erecta, D. pseudoobscura, D. willistoni, and D. littoralis) covering more than 500 kb of the D. melanogaster genome. All D. melanogaster genes (and 78-82% of coding exons) identified in divergent species such as D. pseudoobscura show evidence of functional constraint. Addition of a third species can reveal functional constraint in otherwise non-significant pairwise exon comparisons.
It is widely accepted that comparative sequence data can aid the functional annotation of genome sequences; however, the most informative species and features of genome evolution for comparison remain to be determined.
We analyzed conservation in eight genomic regions (apterous, even-skipped, fushi tarazu, twist, and Rhodopsins 1, 2, 3 and 4) from four Drosophila species (D. erecta, D. pseudoobscura, D. willistoni, and D. littoralis) covering more than 500 kb of the D. melanogaster genome. All D. melanogaster genes (and 78-82% of coding exons) identified in divergent species such as D. pseudoobscura show evidence of functional constraint. Addition of a third species can reveal functional constraint in otherwise non-significant pairwise exon comparisons. Microsynteny is largely conserved, with rearrangement breakpoints, novel transposable element insertions, and gene transpositions occurring in similar numbers. Rates of amino-acid substitution are higher in uncharacterized genes relative to genes that have previously been studied. Conserved non-coding sequences (CNCSs) tend to be spatially clustered with conserved spacing between CNCSs, and clusters of CNCSs can be used to predict enhancer sequences.
Our results provide the basis for choosing species whose genome sequences would be most useful in aiding the functional annotation of coding and cis-regulatory sequences in Drosophila. Furthermore, this work shows how decoding the spatial organization of conserved sequences, such as the clustering of CNCSs, can complement efforts to annotate eukaryotic genomes on the basis of sequence conservation alone.
PMCID: PMC151188  PMID: 12537575
10.  Finishing a whole-genome shotgun: Release 3 of the Drosophila melanogaster euchromatic genome sequence 
Genome Biology  2002;3(12):research0079.1-79.14.
The Drosophila melanogaster genome was the first metazoan genome to be sequenced by whole-genome shotgun. Now, the sequence has been finished in a process designed to close gaps, improve sequence quality and validate the assembly.
The Drosophila melanogaster genome was the first metazoan genome to have been sequenced by the whole-genome shotgun (WGS) method. Two issues relating to this achievement were widely debated in the genomics community: how correct is the sequence with respect to base-pair (bp) accuracy and frequency of assembly errors? And, how difficult is it to bring a WGS sequence to the accepted standard for finished sequence? We are now in a position to answer these questions.
Our finishing process was designed to close gaps, improve sequence quality and validate the assembly. Sequence traces derived from the WGS and draft sequencing of individual bacterial artificial chromosomes (BACs) were assembled into BAC-sized segments. These segments were brought to high quality, and then joined to constitute the sequence of each chromosome arm. Overall assembly was verified by comparison to a physical map of fingerprinted BAC clones. In the current version of the 116.9 Mb euchromatic genome, called Release 3, the six euchromatic chromosome arms are represented by 13 scaffolds with a total of 37 sequence gaps. We compared Release 3 to Release 2; in autosomal regions of unique sequence, the error rate of Release 2 was one in 20,000 bp.
The WGS strategy can efficiently produce a high-quality sequence of a metazoan genome while generating the reagents required for sequence finishing. However, the initial method of repeat assembly was flawed. The sequence we report here, Release 3, is a reliable resource for molecular genetic experimentation and computational analysis.
PMCID: PMC151181  PMID: 12537568
11.  Spatial expression of transcription factors in Drosophila embryonic organ development 
Genome Biology  2013;14(12):R140.
Site-specific transcription factors (TFs) bind DNA regulatory elements to control expression of target genes, forming the core of gene regulatory networks. Despite decades of research, most studies focus on only a small number of TFs and the roles of many remain unknown.
We present a systematic characterization of spatiotemporal gene expression patterns for all known or predicted Drosophila TFs throughout embryogenesis, the first such comprehensive study for any metazoan animal. We generated RNA expression patterns for all 708 TFs by in situ hybridization, annotated the patterns using an anatomical controlled vocabulary, and analyzed TF expression in the context of organ system development. Nearly all TFs are expressed during embryogenesis and more than half are specifically expressed in the central nervous system. Compared to other genes, TFs are enriched early in the development of most organ systems, and throughout the development of the nervous system. Of the 535 TFs with spatially restricted expression, 79% are dynamically expressed in multiple organ systems while 21% show single-organ specificity. Of those expressed in multiple organ systems, 77 TFs are restricted to a single organ system either early or late in development. Expression patterns for 354 TFs are characterized for the first time in this study.
We produced a reference TF dataset for the investigation of gene regulatory networks in embryogenesis, and gained insight into the expression dynamics of the full complement of TFs controlling the development of each organ system.
PMCID: PMC4053779  PMID: 24359758
13.  Computational Identification of Diverse Mechanisms Underlying Transcription Factor-DNA Occupancy 
PLoS Genetics  2013;9(8):e1003571.
ChIP-based genome-wide assays of transcription factor (TF) occupancy have emerged as a powerful, high-throughput method to understand transcriptional regulation, especially on a global scale. This has led to great interest in the underlying biochemical mechanisms that direct TF-DNA binding, with the ultimate goal of computationally predicting a TF's occupancy profile in any cellular condition. In this study, we examined the influence of various potential determinants of TF-DNA binding on a much larger scale than previously undertaken. We used a thermodynamics-based model of TF-DNA binding, called “STAP,” to analyze 45 TF-ChIP data sets from Drosophila embryonic development. We built a cross-validation framework that compares a baseline model, based on the ChIP'ed (“primary”) TF's motif, to more complex models where binding by secondary TFs is hypothesized to influence the primary TF's occupancy. Candidates interacting TFs were chosen based on RNA-SEQ expression data from the time point of the ChIP experiment. We found widespread evidence of both cooperative and antagonistic effects by secondary TFs, and explicitly quantified these effects. We were able to identify multiple classes of interactions, including (1) long-range interactions between primary and secondary motifs (separated by ≤150 bp), suggestive of indirect effects such as chromatin remodeling, (2) short-range interactions with specific inter-site spacing biases, suggestive of direct physical interactions, and (3) overlapping binding sites suggesting competitive binding. Furthermore, by factoring out the previously reported strong correlation between TF occupancy and DNA accessibility, we were able to categorize the effects into those that are likely to be mediated by the secondary TF's effect on local accessibility and those that utilize accessibility-independent mechanisms. Finally, we conducted in vitro pull-down assays to test model-based predictions of short-range cooperative interactions, and found that seven of the eight TF pairs tested physically interact and that some of these interactions mediate cooperative binding to DNA.
Author Summary
Chromatin Immunoprecipitation (ChIP)-based genome-wide assays of transcription factor (TF) occupancy have emerged as a powerful, high throughput method to understand transcriptional regulation, especially on a global scale. Here, we utilize 45 ChIP-chip and ChIP-SEQ data sets from Drosophila to explore the underlying mechanisms of TF-DNA binding. For this, we employ a biophysically motivated computational model, in conjunction with over 300 TF motifs (binding specificities) as well as gene expression and DNA accessibility data from different developmental stages in Drosophila embryos. Our findings provide robust statistical evidence of the role played by TF-TF interactions in shaping genome-wide TF-DNA binding profiles, and thus in directing gene regulation. Our method allows us to go beyond simply recognizing the existence of such interactions, to quantifying their effects on TF occupancy. We are able to categorize the probable mechanisms of these effects as involving direct physical interactions versus accessibility-mediated indirect interactions, long-range versus short-range interactions, and cooperative versus antagonistic interactions. Our analysis reveals widespread evidence of combinatorial regulation present in recently generated ChIP data sets, and sets the stage for rich integrative models of the future that will predict cell type-specific TF occupancy values from sequence and expression data.
PMCID: PMC3731213  PMID: 23935523
14.  Global Patterns of Tissue-Specific Alternative Polyadenylation in Drosophila 
Cell reports  2012;1(3):277-289.
We analyzed the usage and consequences of alternative cleavage and polyadenylation (APA) in Drosophila melanogaster by using >1 billion reads of stranded mRNA-seq across a variety of dissected tissues. Beyond demonstrating that a majority of fly transcripts are subject to APA, we observed broad trends for 3′ untranslated region (UTR) shortening in the testis and lengthening in the central nervous system (CNS); the latter included hundreds of unannotated extensions ranging up to 18 kb. Extensive northern analyses validated the accumulation of full-length neural extended transcripts, and in situ hybridization indicated their spatial restriction to the CNS. Genes encoding RNA binding proteins (RBPs) and transcription factors were preferentially subject to 3′ UTR extensions. Motif analysis indicated enrichment of miRNA and RBP sites in the neural extensions, and their termini were enriched in canonical cis elements that promote cleavage and polyadenylation. Altogether, we reveal broad tissue-specific patterns of APA in Drosophila and transcripts with unprecedented 3′ UTR length in the nervous system.
PMCID: PMC3368434  PMID: 22685694
15.  Identification of Functional Elements and Regulatory Circuits by Drosophila modENCODE 
Roy, Sushmita | Ernst, Jason | Kharchenko, Peter V. | Kheradpour, Pouya | Negre, Nicolas | Eaton, Matthew L. | Landolin, Jane M. | Bristow, Christopher A. | Ma, Lijia | Lin, Michael F. | Washietl, Stefan | Arshinoff, Bradley I. | Ay, Ferhat | Meyer, Patrick E. | Robine, Nicolas | Washington, Nicole L. | Di Stefano, Luisa | Berezikov, Eugene | Brown, Christopher D. | Candeias, Rogerio | Carlson, Joseph W. | Carr, Adrian | Jungreis, Irwin | Marbach, Daniel | Sealfon, Rachel | Tolstorukov, Michael Y. | Will, Sebastian | Alekseyenko, Artyom A. | Artieri, Carlo | Booth, Benjamin W. | Brooks, Angela N. | Dai, Qi | Davis, Carrie A. | Duff, Michael O. | Feng, Xin | Gorchakov, Andrey A. | Gu, Tingting | Henikoff, Jorja G. | Kapranov, Philipp | Li, Renhua | MacAlpine, Heather K. | Malone, John | Minoda, Aki | Nordman, Jared | Okamura, Katsutomo | Perry, Marc | Powell, Sara K. | Riddle, Nicole C. | Sakai, Akiko | Samsonova, Anastasia | Sandler, Jeremy E. | Schwartz, Yuri B. | Sher, Noa | Spokony, Rebecca | Sturgill, David | van Baren, Marijke | Wan, Kenneth H. | Yang, Li | Yu, Charles | Feingold, Elise | Good, Peter | Guyer, Mark | Lowdon, Rebecca | Ahmad, Kami | Andrews, Justen | Berger, Bonnie | Brenner, Steven E. | Brent, Michael R. | Cherbas, Lucy | Elgin, Sarah C. R. | Gingeras, Thomas R. | Grossman, Robert | Hoskins, Roger A. | Kaufman, Thomas C. | Kent, William | Kuroda, Mitzi I. | Orr-Weaver, Terry | Perrimon, Norbert | Pirrotta, Vincenzo | Posakony, James W. | Ren, Bing | Russell, Steven | Cherbas, Peter | Graveley, Brenton R. | Lewis, Suzanna | Micklem, Gos | Oliver, Brian | Park, Peter J. | Celniker, Susan E. | Henikoff, Steven | Karpen, Gary H. | Lai, Eric C. | MacAlpine, David M. | Stein, Lincoln D. | White, Kevin P. | Kellis, Manolis
Science (New York, N.Y.)  2010;330(6012):1787-1797.
To gain insight into how genomic information is translated into cellular and developmental programs, the Drosophila model organism Encyclopedia of DNA Elements (modENCODE) project is comprehensively mapping transcripts, histone modifications, chromosomal proteins, transcription factors, replication proteins and intermediates, and nucleosome properties across a developmental time course and in multiple cell lines. We have generated more than 700 data sets and discovered protein-coding, noncoding, RNA regulatory, replication, and chromatin elements, more than tripling the annotated portion of the Drosophila genome. Correlated activity patterns of these elements reveal a functional regulatory network, which predicts putative new functions for genes, reveals stage- and tissue-specific regulators, and enables gene-expression prediction. Our results provide a foundation for directed experimental and computational studies in Drosophila and related species and also a model for systematic data integration toward comprehensive genomic and functional annotation.
PMCID: PMC3192495  PMID: 21177974
16.  The Developmental Transcriptome of Drosophila melanogaster 
Nature  2010;471(7339):473-479.
Drosophila melanogaster is one of the most well studied genetic model organisms, nonetheless its genome still contains unannotated coding and non-coding genes, transcripts, exons, and RNA editing sites. Full discovery and annotation are prerequisites for understanding how the regulation of transcription, splicing, and RNA editing directs development of this complex organism. We used RNA-Seq, tiling microarrays, and cDNA sequencing to explore the transcriptome in 30 distinct developmental stages. We identified 111,195 new elements, including thousands of genes, coding and non-coding transcripts, exons, splicing and editing events and inferred protein isoforms that previously eluded discovery using established experimental, prediction and conservation-based approaches. Together, these data substantially expand the number of known transcribed elements in the Drosophila genome and provide a high-resolution view of transcriptome dynamics throughout development.
PMCID: PMC3075879  PMID: 21179090
17.  Dynamic reprogramming of chromatin accessibility during Drosophila embryo development 
Genome Biology  2011;12(5):R43.
The development of complex organisms is believed to involve progressive restrictions in cellular fate. Understanding the scope and features of chromatin dynamics during embryogenesis, and identifying regulatory elements important for directing developmental processes remain key goals of developmental biology.
We used in vivo DNaseI sensitivity to map the locations of regulatory elements, and explore the changing chromatin landscape during the first 11 hours of Drosophila embryonic development. We identified thousands of conserved, developmentally dynamic, distal DNaseI hypersensitive sites associated with spatial and temporal expression patterning of linked genes and with large regions of chromatin plasticity. We observed a nearly uniform balance between developmentally up- and down-regulated DNaseI hypersensitive sites. Analysis of promoter chromatin architecture revealed a novel role for classical core promoter sequence elements in directing temporally regulated chromatin remodeling. Another unexpected feature of the chromatin landscape was the presence of localized accessibility over many protein-coding regions, subsets of which were developmentally regulated or associated with the transcription of genes with prominent maternal RNA contributions in the blastoderm.
Our results provide a global view of the rich and dynamic chromatin landscape of early animal development, as well as novel insights into the organization of developmentally regulated chromatin features.
PMCID: PMC3219966  PMID: 21569360
18.  Quantitative Analysis of the Drosophila Segmentation Regulatory Network Using Pattern Generating Potentials 
PLoS Biology  2010;8(8):e1000456.
A new computational method uses gene expression databases and transcription factor binding specificities to describe regulatory elements in the Drosophila A/P patterning network in unprecedented detail.
Cis-regulatory modules that drive precise spatial-temporal patterns of gene expression are central to the process of metazoan development. We describe a new computational strategy to annotate genomic sequences based on their “pattern generating potential” and to produce quantitative descriptions of transcriptional regulatory networks at the level of individual protein-module interactions. We use this approach to convert the qualitative understanding of interactions that regulate Drosophila segmentation into a network model in which a confidence value is associated with each transcription factor-module interaction. Sequence information from multiple Drosophila species is integrated with transcription factor binding specificities to determine conserved binding site frequencies across the genome. These binding site profiles are combined with transcription factor expression information to create a model to predict module activity patterns. This model is used to scan genomic sequences for the potential to generate all or part of the expression pattern of a nearby gene, obtained from available gene expression databases. Interactions between individual transcription factors and modules are inferred by a statistical method to quantify a factor's contribution to the module's pattern generating potential. We use these pattern generating potentials to systematically describe the location and function of known and novel cis-regulatory modules in the segmentation network, identifying many examples of modules predicted to have overlapping expression activities. Surprisingly, conserved transcription factor binding site frequencies were as effective as experimental measurements of occupancy in predicting module expression patterns or factor-module interactions. Thus, unlike previous module prediction methods, this method predicts not only the location of modules but also their spatial activity pattern and the factors that directly determine this pattern. As databases of transcription factor specificities and in vivo gene expression patterns grow, analysis of pattern generating potentials provides a general method to decode transcriptional regulatory sequences and networks.
Author Summary
The developmental program specifying segmentation along the anterior-posterior axis of the Drosophila embryo is one of the best studied examples of transcriptional regulatory networks. Previous work has identified the location and function of dozens of DNA segments called cis-regulatory “modules” that regulate several genes in precise spatial patterns in the early embryo. In many cases, transcription factors that interact with such modules have also been identified. We present a novel computational framework that turns a qualitative and fragmented understanding of modules and factor-module interactions into a quantitative, systems-level view. The formalism utilizes experimentally characterized binding specificities of transcription factors and gene expression patterns to describe how multiple transcription factors (working as activators or repressors) act together in a module to determine its regulatory activity. This formalism can explain the expression patterns of known modules, infer factor-module interactions and quantify the potential of an arbitrary DNA segment to drive a gene's expression. We have also employed databases of gene expression patterns to find novel modules of the regulatory network. As databases of binding motifs and gene expression patterns grow, this new approach provides a general method to decode transcriptional regulatory sequences and networks.
PMCID: PMC2923081  PMID: 20808951
19.  Unlocking the secrets of the genome 
Nature  2009;459(7249):927-930.
Despite the successes of genomics, little is known about how genetic information produces complex organisms. A look at the crucial functional elements of fly and worm genomes could change that.
PMCID: PMC2843545  PMID: 19536255
20.  Sequence Finishing and Mapping of Drosophila melanogaster Heterochromatin 
Science (New York, N.Y.)  2007;316(5831):1625-1628.
Genome sequences for most metazoans and plants are incomplete because of the presence of repeated DNA in the heterochromatin. The heterochromatic regions of Drosophila melanogaster contain 20 million bases (Mb) of sequence amenable to mapping, sequence assembly, and finishing. We describe the generation of 15 Mb of finished or improved heterochromatic sequence with the use of available clone resources and assembly methods. We also constructed a bacterial artificial chromosome–based physical map that spans 13 Mb of the pericentromeric heterochromatin and a cytogenetic map that positions 11 Mb in specific chromosomal locations. We have approached a complete assembly and mapping of the nonsatellite component of Drosophila heterochromatin. The strategy we describe is also applicable to generating substantially more information about heterochromatin in other species, including humans.
PMCID: PMC2825053  PMID: 17569867
21.  Systematic image-driven analysis of the spatial Drosophila embryonic expression landscape 
We created innovative virtual representation for our large scale Drosophila insitu expression dataset. We aligned an elliptically shaped mesh comprised of small triangular regions to the outline of each embryo. Each triangle defines a unique location in the embryo and comparing corresponding triangles allows easy identification of similar expression patterns.The virtual representation was used to organize the expression landscape at stage 4-6. We identified regions with similar expression in the embryo and clustered genes with similar expression patterns.We created algorithms to mine the dataset for adjacent non-overlapping patterns and anti-correlated patterns. We were able to mine the dataset to identify co-expressed and putative interacting genes.Using co-expression we were able to assign putative functions to unknown genes.
Analyzing both temporal and spatial gene expression is essential for understanding development and regulatory networks of multicellular organisms. Interacting genes are commonly expressed in overlapping or adjacent domains. Thus, gene expression patterns can be used to assign putative gene functions and mined to infer candidates for networks.
We have generated a systematic two-dimensional mRNA expression atlas profiling embryonic development of Drosophila melanogaster (Tomancak et al, 2002, 2007). To date, we have collected over 70 000 images for over 6000 genes. To explore spatial relationships between gene expression patterns, we used a novel computational image-processing approach by converting expression patterns from the images into virtual representations (Figure 1). Using a custom-designed automated pipeline, for each image, we segmented and aligned the outline of the embryo to an elliptically shaped mesh, comprised of 311 small triangular regions each defining a unique location within the embryo. By comparing corresponding triangles, we produced a distance score to identify similar patterns. We generated those triangulated images (TIs) for our entire data set at all developmental stages and demonstrated that this representation can be used as for objective computationally defined description for expression in in situ hybridization images from various sources, including images from the literature.
We used the TIs to conduct a comprehensive analysis of the expression landscape. To this end, we created a novel approach to temporally sort and compact TIs to a non-redundant data set suitable for further computational processing. Although generally applicable for all developmental stages, for this study, we focused on developmental stages 4–6. For this stage range, we reduced the initial set of about 5800 TIs to 553 TIs containing 364 genes. Using this filtered data set, to discover how expression subdivides the embryo into regions, we clustered areas with similar expression and demonstrated that expression patterns divide the early embryo into distinct spatial regions resembling a fate map (Figure 3). To discover the range of unique expression patterns, we used affinity propagation clustering (Frey and Dueck, 2007) to group TIs with similar patterns and identified 39 clusters each representing a distinct pattern class. We integrated the remaining genes into the 39 clusters and studied the distribution of expression patterns and the relationships between the clusters.
The clustered expression patterns were used to identify putative positive and negative regulatory interactions. The similar TIs in each cluster not only grouped already known genes with related functions, but previously undescribed genes. A comparative analysis identified subtle differences between the genes within each expression cluster. To investigate these differences, we developed a novel Markov Random Field (MRF) segmentation algorithm to extract patterns. We then extended the MRF algorithm to detect shared expression boundaries, generate similarity measurements, and discriminate even faint/uncertain patterns between two TIs. This enabled us to identify more subtle partial expression pattern overlaps and adjacent non-overlapping patterns. For example, by conducting this analysis on the cluster containing the gene snail, we identified the previously known huckebein, which restricts snail expression (Reuter and Leptin, 1994), and zfh1, which interacts with tinman (Broihier et al, 1998; Su et al, 1999).
By studying the functions of known genes, we assigned putative developmental roles to each of the 39 clusters. Of the 1800 genes investigated, only half of them had previously assigned functions.
Representing expression patterns with geometric meshes facilitates the analysis of a complex process involving thousands of genes. This approach is complementary to the cellular resolution 3D atlas for the Drosophila embryo (Fowlkes et al, 2008). Our method can be used as a rapid, fully automated, high-throughput approach to obtain a map of co-expression, which will serve to select specific genes for detailed multiplex in-situ hybridization and confocal analysis for a fine-grain atlas. Our data are similar to the data in the literature, and research groups studying reporter constructs, mutant animals, or orthologs can easily produce in situ hybridizations. TIs can be readily created and provide representations that are both comparable to each other and our data set. We have demonstrated that our approach can be used for predicting relationships in regulatory and developmental pathways.
Discovery of temporal and spatial patterns of gene expression is essential for understanding the regulatory networks and development in multicellular organisms. We analyzed the images from our large-scale spatial expression data set of early Drosophila embryonic development and present a comprehensive computational image analysis of the expression landscape. For this study, we created an innovative virtual representation of embryonic expression patterns using an elliptically shaped mesh grid that allows us to make quantitative comparisons of gene expression using a common frame of reference. Demonstrating the power of our approach, we used gene co-expression to identify distinct expression domains in the early embryo; the result is surprisingly similar to the fate map determined using laser ablation. We also used a clustering strategy to find genes with similar patterns and developed new analysis tools to detect variation within consensus patterns, adjacent non-overlapping patterns, and anti-correlated patterns. Of the 1800 genes investigated, only half had previously assigned functions. The known genes suggest developmental roles for the clusters, and identification of related patterns predicts requirements for co-occurring biological functions.
PMCID: PMC2824522  PMID: 20087342
biological function; embryo; gene expression; in situ hybridization; Markov Random Field
22.  Determination of gene expression patterns using high-throughput RNA in situ hybridization to whole-mount Drosophila embryos 
Nature protocols  2009;4(5):605-618.
We describe a high-throughput protocol for RNA in situ hybridization (ISH) to Drosophila embryos in 96-well format. cDNA or genomic DNA templates are amplified by PCR and then digoxigenin-labeled ribonucleotides are incorporated into anti-sense RNA probes by in vitro transcription. The quality of each probe is evaluated prior to in situ hybridization using a RNA Probe Quantification (dot blot) assay. RNA probes are hybridized to fixed, mixed-staged Drosophila embryos in 96-well plates. The resulting stained embryos can be examined and photographed immediately or stored at 4°C for later analysis. Starting with fixed, staged embryos, the protocol takes 6 days from probe template production through hybridization. Preparation of fixed embryos requires a minimum of two weeks to collect embryos representing all stages. The method has been used to determine the expression patterns of over 6000 genes throughout embryogenesis.
PMCID: PMC2780369  PMID: 19360017
23.  Functional Evolution of cis-Regulatory Modules at a Homeotic Gene in Drosophila 
PLoS Genetics  2009;5(11):e1000709.
It is a long-held belief in evolutionary biology that the rate of molecular evolution for a given DNA sequence is inversely related to the level of functional constraint. This belief holds true for the protein-coding homeotic (Hox) genes originally discovered in Drosophila melanogaster. Expression of the Hox genes in Drosophila embryos is essential for body patterning and is controlled by an extensive array of cis-regulatory modules (CRMs). How the regulatory modules functionally evolve in different species is not clear. A comparison of the CRMs for the Abdominal-B gene from different Drosophila species reveals relatively low levels of overall sequence conservation. However, embryonic enhancer CRMs from other Drosophila species direct transgenic reporter gene expression in the same spatial and temporal patterns during development as their D. melanogaster orthologs. Bioinformatic analysis reveals the presence of short conserved sequences within defined CRMs, representing gap and pair-rule transcription factor binding sites. One predicted binding site for the gap transcription factor KRUPPEL in the IAB5 CRM was found to be altered in Superabdominal (Sab) mutations. In Sab mutant flies, the third abdominal segment is transformed into a copy of the fifth abdominal segment. A model for KRUPPEL-mediated repression at this binding site is presented. These findings challenge our current understanding of the relationship between sequence evolution at the molecular level and functional activity of a CRM. While the overall sequence conservation at Drosophila CRMs is not distinctive from neighboring genomic regions, functionally critical transcription factor binding sites within embryonic enhancer CRMs are highly conserved. These results have implications for understanding mechanisms of gene expression during embryonic development, enhancer function, and the molecular evolution of eukaryotic regulatory modules.
Author Summary
The fertilized animal embryo is a mass of uniform cells that becomes a complex, segmented, and highly organized structure of differentiated cells through the process of development. This vital process is controlled by networks of developmental genes interacting with each other on the molecular level. Because these genes are crucial for animal development, they are conserved both in function and at the DNA sequence level in related species. We have examined critical DNA sequence modules which regulate genes that pattern the early embryo in different species of the fruit fly. We found that despite rapid evolution of the DNA sequences, the regulatory sequences from one fruit fly species are able to operate when tested in another fruit fly species. Further analysis reveals that there are sequences within these regulatory DNA modules which are conserved across different species and which are critical for regulatory function. These conserved sequences represent critical binding sites for protein transcription factors. These findings have important implications for our understanding of gene regulation during development and evolution across diverse animal species ranging from the fruit fly to humans.
PMCID: PMC2763271  PMID: 19893611
24.  Comparative Genomics of the Eukaryotes 
Science (New York, N.Y.)  2000;287(5461):2204-2215.
A comparative analysis of the genomes of Drosophila melanogaster, Caenorhabditis elegans, and Saccharomyces cerevisiae—and the proteins they are predicted to encode—was undertaken in the context of cellular, developmental, and evolutionary processes. The nonredundant protein sets of flies and worms are similar in size and are only twice that of yeast, but different gene families are expanded in each genome, and the multidomain proteins and signaling pathways of the fly and worm are far more complex than those of yeast. The fly has orthologs to 177 of the 289 human disease genes examined and provides the foundation for rapid analysis of some of the basic processes involved in human disease.
PMCID: PMC2754258  PMID: 10731134
25.  Regulation of Early Endosomal Entry by the Drosophila Tumor Suppressors Rabenosyn and Vps45 
Molecular Biology of the Cell  2008;19(10):4167-4176.
The small GTPase Rab5 has emerged as an important regulator of animal development, and it is essential for endocytic trafficking. However, the mechanisms that link Rab5 activation to cargo entry into early endosomes remain unclear. We show here that Drosophila Rabenosyn (Rbsn) is a Rab5 effector that bridges an interaction between Rab5 and the Sec1/Munc18-family protein Vps45, and we further identify the syntaxin Avalanche (Avl) as a target for Vps45 activity. Rbsn and Vps45, like Avl and Rab5, are specifically localized to early endosomes and are required for endocytosis. Ultrastructural analysis of rbsn, Vps45, avl, and Rab5 null mutant cells, which show identical defects, demonstrates that all four proteins are required for vesicle fusion to form early endosomes. These defects lead to loss of epithelial polarity in mutant tissues, which overproliferate to form neoplastic tumors. This work represents the first characterization of a Rab5 effector as a tumor suppressor, and it provides in vivo evidence for a Rbsn–Vps45 complex on early endosomes that links Rab5 to the SNARE fusion machinery.
PMCID: PMC2555928  PMID: 18685079

