PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (32)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
more »
1.  An Age-of-Allele Test of Neutrality for Transposable Element Insertions 
Genetics  2013;196(2):523-538.
How natural selection acts to limit the proliferation of transposable elements (TEs) in genomes has been of interest to evolutionary biologists for many years. To describe TE dynamics in populations, previous studies have used models of transposition–selection equilibrium that assume a constant rate of transposition. However, since TE invasions are known to happen in bursts through time, this assumption may not be reasonable. Here we propose a test of neutrality for TE insertions that does not rely on the assumption of a constant transposition rate. We consider the case of TE insertions that have been ascertained from a single haploid reference genome sequence. By conditioning on the age of an individual TE insertion allele (inferred by the number of unique substitutions that have occurred within the particular TE sequence since insertion), we determine the probability distribution of the insertion allele frequency in a population sample under neutrality. Taking models of varying population size into account, we then evaluate predictions of our model against allele frequency data from 190 retrotransposon insertions sampled from North American and African populations of Drosophila melanogaster. Using this nonequilibrium neutral model, we are able to explain ∼80% of the variance in TE insertion allele frequencies based on age alone. Controlling for both nonequilibrium dynamics of transposition and host demography, we provide evidence for negative selection acting against most TEs as well as for positive selection acting on a small subset of TEs. Our work establishes a new framework for the analysis of the evolutionary forces governing large insertion mutations like TEs, gene duplications, or other copy number variants.
doi:10.1534/genetics.113.158147
PMCID: PMC3914624  PMID: 24336751
transposable elements (TEs); test of neutrality; Drosophila melanogaster; genome evolution; population genomics
2.  A proposal for the reference-based annotation of de novo transposable element insertions 
Mobile Genetic Elements  2012;2(1):51-54.
Understanding the causes and consequences of transposable element (TE) activity in the genomic era requires sophisticated bioinformatics approaches to accurately identify individual insertion sites. Next-generation sequencing technology now makes it possible to rapidly identify new TE insertions using resequencing data, opening up new possibilities to study the nature of TE-induced mutation and the target site preferences of different TE families. While the identification of new TE insertion sites is seemingly a simple task, the mechanisms of transposition present unique challenges for the annotation of de novo transposable element insertions mapped to a reference genome. Here I discuss these challenges and propose a framework for the annotation of de novo TE insertions that accommodates known mechanisms of TE insertion and established coordinate systems for genome annotation.
doi:10.4161/mge.19479
PMCID: PMC3383450  PMID: 22754753
coordinate systems; genome bioinformatics; next generation sequencing; target site duplications; transposable elements
4.  The Drosophila melanogaster Genetic Reference Panel 
Nature  2012;482(7384):173-178.
A major challenge of biology is understanding the relationship between molecular genetic variation and variation in quantitative traits, including fitness. This relationship determines our ability to predict phenotypes from genotypes and to understand how evolutionary forces shape variation within and between species. Previous efforts to dissect the genotype-phenotype map were based on incomplete genotypic information. Here, we describe the Drosophila melanogaster Genetic Reference Panel (DGRP), a community resource for analysis of population genomics and quantitative traits. The DGRP consists of fully sequenced inbred lines derived from a natural population. Population genomic analyses reveal reduced polymorphism in centromeric autosomal regions and the X chromosome, evidence for positive and negative selection, and rapid evolution of the X chromosome. Many variants in novel genes, most at low frequency, are associated with quantitative traits and explain a large fraction of the phenotypic variance. The DGRP facilitates genotype-phenotype mapping using the power of Drosophila genetics.
doi:10.1038/nature10811
PMCID: PMC3683990  PMID: 22318601
5.  REDfly 2.0: an integrated database of cis-regulatory modules and transcription factor binding sites in Drosophila 
Nucleic Acids Research  2007;36(Database issue):D594-D598.
The identification and study of the cis-regulatory elements that control gene expression are important areas of biological research, but few resources exist to facilitate large-scale bioinformatics studies of cis-regulation in metazoan species. Drosophila melanogaster, with its well-annotated genome, exceptional resources for comparative genomics and long history of experimental studies of transcriptional regulation, represents the ideal system for regulatory bioinformatics. We have merged two existing Drosophila resources, the REDfly database of cis-regulatory modules and the FlyReg database of transcription factor binding sites (TFBSs), into a single integrated database containing extensive annotation of empirically validated cis-regulatory modules and their constituent binding sites. With the enhanced functionality made possible through this integration of TFBS data into REDfly, together with additional improvements to the REDfly infrastructure, we have constructed a one-stop portal for Drosophila cis-regulatory data that will serve as a powerful resource for both computational and experimental studies of transcriptional regulation. REDfly is freely accessible at http://redfly.ccr.buffalo.edu.
doi:10.1093/nar/gkm876
PMCID: PMC2238825  PMID: 18039705
6.  Population Genomics of the Wolbachia Endosymbiont in Drosophila melanogaster 
PLoS Genetics  2012;8(12):e1003129.
Wolbachia are maternally inherited symbiotic bacteria, commonly found in arthropods, which are able to manipulate the reproduction of their host in order to maximise their transmission. The evolutionary history of endosymbionts like Wolbachia can be revealed by integrating information on infection status in natural populations with patterns of sequence variation in Wolbachia and host mitochondrial genomes. Here we use whole-genome resequencing data from 290 lines of Drosophila melanogaster from North America, Europe, and Africa to predict Wolbachia infection status, estimate relative cytoplasmic genome copy number, and reconstruct Wolbachia and mitochondrial genome sequences. Overall, 63% of Drosophila strains were predicted to be infected with Wolbachia by our in silico analysis pipeline, which shows 99% concordance with infection status determined by diagnostic PCR. Complete Wolbachia and mitochondrial genomes show congruent phylogenies, consistent with strict vertical transmission through the maternal cytoplasm and imperfect transmission of Wolbachia. Bayesian phylogenetic analysis reveals that the most recent common ancestor of all Wolbachia and mitochondrial genomes in D. melanogaster dates to around 8,000 years ago. We find evidence for a recent global replacement of ancestral Wolbachia and mtDNA lineages, but our data suggest that the derived wMel lineage arose several thousand years ago, not in the 20th century as previously proposed. Our data also provide evidence that this global replacement event is incomplete and is likely to be one of several similar incomplete replacement events that have occurred since the out-of-Africa migration that allowed D. melanogaster to colonize worldwide habitats. This study provides a complete genomic analysis of the evolutionary mode and temporal dynamics of the D. melanogaster–Wolbachia symbiosis, as well as important resources for further analyses of the impact of Wolbachia on host biology.
Author Summary
Host–microbe interactions play important roles in the physiology, development, and ecology of many organisms. Studying how hosts and their microbial symbionts evolve together over time is crucial for understanding the impact that microbes have on host biology. With the advent of high-throughput sequencing technologies, it is now possible to obtain complete genomic information for hosts and their associated microbes. Here we use whole-genome sequences from ∼300 strains of the fruitfly Drosophila melanogaster to reveal the evolutionary history of this model species and its intracellular bacterial symbiont Wolbachia. The major findings of this study are that Wolbachia in D. melanogaster is inherited strictly through the egg with no evidence of horizontal transfer from other species, that the genealogies of Wolbachia and mitochondrial genomes are virtually the same, and that both Wolbachia and mitochondrial genomes show evidence for a recent incomplete global replacement event, which has left remnant lineages in North America, Europe, and Africa. We also use the fact that Wolbachia and mitochondrial genomes have the same genealogy to estimate the rate of molecular evolution for Wolbachia, which allows us to put dates on key events in the history of this important host–microbe model system.
doi:10.1371/journal.pgen.1003129
PMCID: PMC3527207  PMID: 23284297
7.  Evolutionary Genomics of Transposable Elements in Saccharomyces cerevisiae 
PLoS ONE  2012;7(11):e50978.
Saccharomyces cerevisiae is one of the premier model systems for studying the genomics and evolution of transposable elements. The availability of the S. cerevisiae genome led to unprecedented insights into its five known transposable element families (the LTR retrotransposons Ty1-Ty5) in the years shortly after its completion. However, subsequent advances in bioinformatics tools for analysing transposable elements and the recent availability of genome sequences for multiple strains and species of yeast motivates new investigations into Ty evolution in S. cerevisiae. Here we provide a comprehensive phylogenetic and population genetic analysis of all Ty families in S. cerevisiae based on a systematic re-annotation of Ty elements in the S288c reference genome. We show that previous annotation efforts have underestimated the total copy number of Ty elements for all known families. In addition, we identify a new family of Ty3-like elements related to the S. paradoxus Ty3p which is composed entirely of degenerate solo LTRs. Phylogenetic analyses of LTR sequences identified three families with short-branch, recently active clades nested among long branch, inactive insertions (Ty1, Ty3, Ty4), one family with essentially all recently active elements (Ty2) and two families with only inactive elements (Ty3p and Ty5). Population genomic data from 38 additional strains of S. cerevisiae show that the majority of Ty insertions in the S288c reference genome are fixed in the species, with insertions in active clades being predominantly polymorphic and insertions in inactive clades being predominantly fixed. Finally, we use comparative genomic data to provide evidence that the Ty2 and Ty3p families have arisen in the S. cerevisiae genome by horizontal transfer. Our results demonstrate that the genome of a single individual contains important information about the state of TE population dynamics within a species and suggest that horizontal transfer may play an important role in shaping the genomic diversity of transposable elements in unicellular eukaryotes.
doi:10.1371/journal.pone.0050978
PMCID: PMC3511429  PMID: 23226439
8.  BioContext: an integrated text mining system for large-scale extraction and contextualization of biomolecular events 
Bioinformatics  2012;28(16):2154-2161.
Motivation: Although the amount of data in biology is rapidly increasing, critical information for understanding biological events like phosphorylation or gene expression remains locked in the biomedical literature. Most current text mining (TM) approaches to extract information about biological events are focused on either limited-scale studies and/or abstracts, with data extracted lacking context and rarely available to support further research.
Results: Here we present BioContext, an integrated TM system which extracts, extends and integrates results from a number of tools performing entity recognition, biomolecular event extraction and contextualization. Application of our system to 10.9 million MEDLINE abstracts and 234 000 open-access full-text articles from PubMed Central yielded over 36 million mentions representing 11.4 million distinct events. Event participants included over 290 000 distinct genes/proteins that are mentioned more than 80 million times and linked where possible to Entrez Gene identifiers. Over a third of events contain contextual information such as the anatomical location of the event occurrence or whether the event is reported as negated or speculative.
Availability: The BioContext pipeline is available for download (under the BSD license) at http://www.biocontext.org, along with the extracted data which is also available for online browsing.
Contact: martin.gerner@gmail.com
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts332
PMCID: PMC3413385  PMID: 22711795
9.  Whole Genome Resequencing Reveals Natural Target Site Preferences of Transposable Elements in Drosophila melanogaster 
PLoS ONE  2012;7(2):e30008.
Transposable elements are mobile DNA sequences that integrate into host genomes using diverse mechanisms with varying degrees of target site specificity. While the target site preferences of some engineered transposable elements are well studied, the natural target preferences of most transposable elements are poorly characterized. Using population genomic resequencing data from 166 strains of Drosophila melanogaster, we identified over 8,000 new insertion sites not present in the reference genome sequence that we used to decode the natural target preferences of 22 families of transposable element in this species. We found that terminal inverted repeat transposon and long terminal repeat retrotransposon families present clade-specific target site duplications and target site sequence motifs. Additionally, we found that the sequence motifs at transposable element target sites are always palindromes that extend beyond the target site duplication. Our results demonstrate the utility of population genomics data for high-throughput inference of transposable element targeting preferences in the wild and establish general rules for terminal inverted repeat transposon and long terminal repeat retrotransposon target site selection in eukaryotic genomes.
doi:10.1371/journal.pone.0030008
PMCID: PMC3276498  PMID: 22347367
10.  pubmed2ensembl: A Resource for Mining the Biological Literature on Genes 
PLoS ONE  2011;6(9):e24716.
Background
The last two decades have witnessed a dramatic acceleration in the production of genomic sequence information and publication of biomedical articles. Despite the fact that genome sequence data and publications are two of the most heavily relied-upon sources of information for many biologists, very little effort has been made to systematically integrate data from genomic sequences directly with the biological literature. For a limited number of model organisms dedicated teams manually curate publications about genes; however for species with no such dedicated staff many thousands of articles are never mapped to genes or genomic regions.
Methodology/Principal Findings
To overcome the lack of integration between genomic data and biological literature, we have developed pubmed2ensembl (http://www.pubmed2ensembl.org), an extension to the BioMart system that links over 2,000,000 articles in PubMed to nearly 150,000 genes in Ensembl from 50 species. We use several sources of curated (e.g., Entrez Gene) and automatically generated (e.g., gene names extracted through text-mining on MEDLINE records) sources of gene-publication links, allowing users to filter and combine different data sources to suit their individual needs for information extraction and biological discovery. In addition to extending the Ensembl BioMart database to include published information on genes, we also implemented a scripting language for automated BioMart construction and a novel BioMart interface that allows text-based queries to be performed against PubMed and PubMed Central documents in conjunction with constraints on genomic features. Finally, we illustrate the potential of pubmed2ensembl through typical use cases that involve integrated queries across the biomedical literature and genomic data.
Conclusion/Significance
By allowing biologists to find the relevant literature on specific genomic regions or sets of functionally related genes more easily, pubmed2ensembl offers a much-needed genome informatics inspired solution to accessing the ever-increasing biomedical literature.
doi:10.1371/journal.pone.0024716
PMCID: PMC3183000  PMID: 21980353
11.  The GNAT library for local and remote gene mention normalization 
Bioinformatics  2011;27(19):2769-2771.
Summary: Identifying mentions of named entities, such as genes or diseases, and normalizing them to database identifiers have become an important step in many text and data mining pipelines. Despite this need, very few entity normalization systems are publicly available as source code or web services for biomedical text mining. Here we present the Gnat Java library for text retrieval, named entity recognition, and normalization of gene and protein mentions in biomedical text. The library can be used as a component to be integrated with other text-mining systems, as a framework to add user-specific extensions, and as an efficient stand-alone application for the identification of gene and protein names for data analysis. On the BioCreative III test data, the current version of Gnat achieves a Tap-20 score of 0.1987.
Availability: The library and web services are implemented in Java and the sources are available from http://gnat.sourceforge.net.
Contact: jorg.hakenberg@roche.com
doi:10.1093/bioinformatics/btr455
PMCID: PMC3179658  PMID: 21813477
12.  Annotating genes and genomes with DNA sequences extracted from biomedical articles 
Bioinformatics  2011;27(7):980-986.
Motivation: Increasing rates of publication and DNA sequencing make the problem of finding relevant articles for a particular gene or genomic region more challenging than ever. Existing text-mining approaches focus on finding gene names or identifiers in English text. These are often not unique and do not identify the exact genomic location of a study.
Results: Here, we report the results of a novel text-mining approach that extracts DNA sequences from biomedical articles and automatically maps them to genomic databases. We find that ∼20% of open access articles in PubMed central (PMC) have extractable DNA sequences that can be accurately mapped to the correct gene (91%) and genome (96%). We illustrate the utility of data extracted by text2genome from more than 150 000 PMC articles for the interpretation of ChIP-seq data and the design of quantitative reverse transcriptase (RT)-PCR experiments.
Conclusion: Our approach links articles to genes and organisms without relying on gene names or identifiers. It also produces genome annotation tracks of the biomedical literature, thereby allowing researchers to use the power of modern genome browsers to access and analyze publications in the context of genomic data.
Availability and implementation: Source code is available under a BSD license from http://sourceforge.net/projects/text2genome/ and results can be browsed and downloaded at http://text2genome.org.
Contact: maximilianh@gmail.com
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr043
PMCID: PMC3065681  PMID: 21325301
13.  REDfly v3.0: toward a comprehensive database of transcriptional regulatory elements in Drosophila 
Nucleic Acids Research  2010;39(Database issue):D118-D123.
The REDfly database of Drosophila transcriptional cis-regulatory elements provides the broadest and most comprehensive available resource for experimentally validated cis-regulatory modules and transcription factor binding sites among the metazoa. The third major release of the database extends the utility of REDfly as a powerful tool for both computational and experimental studies of transcription regulation. REDfly v3.0 includes the introduction of new data classes to expand the types of regulatory elements annotated in the database along with a roughly 40% increase in the number of records. A completely redesigned interface improves access for casual and power users alike; among other features it now automatically provides graphical views of the genome, displays images of reporter gene expression and implements improved capabilities for database searching and results filtering. REDfly is freely accessible at http://redfly.ccr.buffalo.edu.
doi:10.1093/nar/gkq999
PMCID: PMC3013816  PMID: 20965965
14.  Correction: Evolutionary Systems Biology of Amino Acid Biosynthetic Cost in Yeast 
PLoS ONE  2010;5(10):10.1371/annotation/b60feca4-9a4f-4311-8a07-4bf16a2f9316.
doi:10.1371/annotation/b60feca4-9a4f-4311-8a07-4bf16a2f9316
PMCID: PMC2951983
15.  Evolutionary Systems Biology of Amino Acid Biosynthetic Cost in Yeast 
PLoS ONE  2010;5(8):e11935.
Every protein has a biosynthetic cost to the cell based on the synthesis of its constituent amino acids. In order to optimise growth and reproduction, natural selection is expected, where possible, to favour the use of proteins whose constituents are cheaper to produce, as reduced biosynthetic cost may confer a fitness advantage to the organism. Quantifying the cost of amino acid biosynthesis presents challenges, since energetic requirements may change across different cellular and environmental conditions. We developed a systems biology approach to estimate the cost of amino acid synthesis based on genome-scale metabolic models and investigated the effects of the cost of amino acid synthesis on Saccharomyces cerevisiae gene expression and protein evolution. First, we used our two new and six previously reported measures of amino acid cost in conjunction with codon usage bias, tRNA gene number and atomic composition to identify which of these factors best predict transcript and protein levels. Second, we compared amino acid cost with rates of amino acid substitution across four species in the genus Saccharomyces. Regardless of which cost measure is used, amino acid biosynthetic cost is weakly associated with transcript and protein levels. In contrast, we find that biosynthetic cost and amino acid substitution rates show a negative correlation, but for only a subset of cost measures. In the economy of the yeast cell, we find that the cost of amino acid synthesis plays a limited role in shaping transcript and protein expression levels compared to that of translational optimisation. Biosynthetic cost does, however, appear to affect rates of amino acid evolution in Saccharomyces, suggesting that expensive amino acids may only be used when they have specific structural or functional roles in protein sequences. However, as there appears to be no single currency to compute the cost of amino acid synthesis across all cellular and environmental conditions, we conclude that a systems approach is necessary to unravel the full effects of amino acid biosynthetic cost in complex biological systems.
doi:10.1371/journal.pone.0011935
PMCID: PMC2923148  PMID: 20808905
16.  The Evolution of tRNA Genes in Drosophila 
The structure and function of transfer RNA (tRNA) genes have been extensively studied for several decades, yet the general mechanisms controlling tRNA gene family evolution remain unclear, primarily because previous phylogenetics-based methods fail to distinguish between paralogs and orthologs that are highly similar in sequence. We have developed a system for identifying orthologs of tRNAs using flanking sequences to identify regions of conserved synteny and used it to annotate sets of orthologous tRNA genes across the 12 sequenced species of Drosophila. These data have allowed us to place the gains and losses of individual tRNA genes on each branch of the Drosophila tree and estimate rates of tRNA gene turnover. Our results show extensive rearrangement of the Drosophila tRNA gene complement over the last 60 My. We estimate a combined average rate of 2.18 ± 0.10 tRNA gene gains and losses per million years across the Drosophila lineage. We have identified 192 tRNAs that are ancestral to the genus, of which 157 are “core” tRNAs conserved in at least 11 of 12 extant species. We provide evidence that the core set of tRNA genes encode a nearly complete set of anticodons and have different properties from other “peripheral” tRNA genes, such as preferential location outside large tRNA clusters and higher sequence conservation. We also demonstrate that tRNA isoacceptor and alloacceptor changes by anticodon shifts have occurred several times in Drosophila, annotating 16 such events in functional tRNAs during the evolution of the genus.
doi:10.1093/gbe/evq034
PMCID: PMC2997554  PMID: 20624748
transfer RNA; genome evolution; noncoding RNA; tRNA identity; synteny map; gene duplication
17.  LINNAEUS: A species name identification system for biomedical literature 
BMC Bioinformatics  2010;11:85.
Background
The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles.
Results
In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers.
Conclusions
LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at http://linnaeus.sourceforge.net/.
doi:10.1186/1471-2105-11-85
PMCID: PMC2836304  PMID: 20149233
18.  Population Genomic Inferences from Sparse High-Throughput Sequencing of Two Populations of Drosophila melanogaster 
Short-read sequencing techniques provide the opportunity to capture genome-wide sequence data in a single experiment. A current challenge is to identify questions that shallow-depth genomic data can address successfully and to develop corresponding analytical methods that are statistically sound. Here, we apply the Roche/454 platform to survey natural variation in strains of Drosophila melanogaster from an African (n = 3) and a North American (n = 6) population. Reads were aligned to the reference D. melanogaster genomic assembly, single nucleotide polymorphisms were identified, and nucleotide variation was quantified genome wide. Simulations and empirical results suggest that nucleotide diversity can be accurately estimated from sparse data with as little as 0.2× coverage per line. The unbiased genomic sampling provided by random short-read sequencing also allows insight into distributions of transposable elements and copy number polymorphisms found within populations and demonstrates that short-read sequencing methods provide an efficient means to quantify variation in genome organization and content. Continued development of methods for statistical inference of shallow-depth genome-wide sequencing data will allow such sparse, partial data sets to become the norm in the emerging field of population genomics.
doi:10.1093/gbe/evp048
PMCID: PMC2839279  PMID: 20333214
Drosophila; population genomics; next-gen sequencing; transposable elements; copy number polymorphism; nucleotide diversity
19.  Population genomics of domestic and wild yeasts 
Nature  2009;458(7236):337-341.
Since the completion of the genome sequence of Saccharomyces cerevisiae in 19961,2, there has been an exponential increase in complete genome sequences accompanied by great advances in our understanding of genome evolution. Although little is known about the natural and life histories of yeasts in the wild, there are an increasing number of studies looking at ecological and geographic distributions3,4, population structure5-8, and sexual versus asexual reproduction9,10. Less well understood at the whole genome level are the evolutionary processes acting within populations and species leading to adaptation to different environments, phenotypic differences and reproductive isolation. Here we present one- to four-fold or more coverage of the genome sequences of over seventy isolates of the baker's yeast, S. cerevisiae, and its closest relative, S. paradoxus. We examine variation in gene content, SNPs, indels, copy numbers and transposable elements. We find that phenotypic variation broadly correlates with global genome-wide phylogenetic relationships. Interestingly, S. paradoxus populations are well delineated along geographic boundaries while the variation among worldwide S. cerevisiae isolates shows less differentiation and is comparable to a single S. paradoxus population. Rather than one or two domestication events leading to the extant baker's yeasts, the population structure of S. cerevisiae consists of a few well-defined geographically isolated lineages and many different mosaics of these lineages, supporting the idea that human influence provided the opportunity for cross-breeding and production of new combinations of pre-existing variation.
doi:10.1038/nature07743
PMCID: PMC2659681  PMID: 19212322
20.  Testing the palindromic target site model for DNA transposon insertion using the Drosophila melanogaster P-element 
Nucleic Acids Research  2008;36(19):6199-6208.
Understanding the molecular mechanisms that influence transposable element target site preferences is a fundamental challenge in functional and evolutionary genomics. Large-scale transposon insertion projects provide excellent material to study target site preferences in the absence of confounding effects of post-insertion evolutionary change. Growing evidence from a wide variety of prokaryotes and eukaryotes indicates that DNA transposons recognize staggered-cut palindromic target site motifs (TSMs). Here, we use over 10 000 accurately mapped P-element insertions in the Drosophila melanogaster genome to test predictions of the staggered-cut palindromic target site model for DNA transposon insertion. We provide evidence that the P-element targets a 14-bp palindromic motif that can be identified at the primary sequence level, which predicts the local spacing, hotspots and strand orientation of P-element insertions. Intriguingly, we find that the although P-element destroys the complete 14-bp target site upon insertion, the terminal three nucleotides of the P-element inverted repeats complement and restore the original TSM, suggesting a mechanistic link between transposon target sites and their terminal inverted repeats. Finally, we discuss how the staggered-cut palindromic target site model can be used to assess the accuracy of genome mappings for annotated P-element insertions.
doi:10.1093/nar/gkn563
PMCID: PMC2577343  PMID: 18829720
21.  Text mining for biology - the way forward: opinions from leading scientists 
Genome Biology  2008;9(Suppl 2):S7.
This article collects opinions from leading scientists about how text mining can provide better access to the biological literature, how the scientific community can help with this process, what the next steps are, and what role future BioCreative evaluations can play. The responses identify several broad themes, including the possibility of fusing literature and biological databases through text mining; the need for user interfaces tailored to different classes of users and supporting community-based annotation; the importance of scaling text mining technology and inserting it into larger workflows; and suggestions for additional challenge evaluations, new applications, and additional resources needed to make progress.
doi:10.1186/gb-2008-9-s2-s7
PMCID: PMC2559991  PMID: 18834498
22.  Text-mining assisted regulatory annotation 
Genome Biology  2008;9(2):R31.
Text-mining technologies can be integrated with genome annotation systems, increasing the availability of annotated cis-regulatory data.
Background
Decoding transcriptional regulatory networks and the genomic cis-regulatory logic implemented in their control nodes is a fundamental challenge in genome biology. High-throughput computational and experimental analyses of regulatory networks and sequences rely heavily on positive control data from prior small-scale experiments, but the vast majority of previously discovered regulatory data remains locked in the biomedical literature.
Results
We develop text-mining strategies to identify relevant publications and extract sequence information to assist the regulatory annotation process. Using a vector space model to identify Medline abstracts from papers likely to have high cis-regulatory content, we demonstrate that document relevance ranking can assist the curation of transcriptional regulatory networks and estimate that, minimally, 30,000 papers harbor unannotated cis-regulatory data. In addition, we show that DNA sequences can be extracted from primary text with high cis-regulatory content and mapped to genome sequences as a means of identifying the location, organism and target gene information that is critical to the cis-regulatory annotation process.
Conclusion
Our results demonstrate that text-mining technologies can be successfully integrated with genome annotation systems, thereby increasing the availability of annotated cis-regulatory data needed to catalyze advances in the field of gene regulation.
doi:10.1186/gb-2008-9-2-r31
PMCID: PMC2374703  PMID: 18271954
23.  ORegAnno: an open-access community-driven resource for regulatory annotation 
Nucleic Acids Research  2007;36(Database issue):D107-D113.
ORegAnno is an open-source, open-access database and literature curation system for community-based annotation of experimentally identified DNA regulatory regions, transcription factor binding sites and regulatory variants. The current release comprises 30 145 records curated from 922 publications and describing regulatory sequences for over 3853 genes and 465 transcription factors from 19 species. A new feature called the ‘publication queue’ allows users to input relevant papers from scientific literature as targets for annotation. The queue contains 4438 gene regulation papers entered by experts and another 54 351 identified by text-mining methods. Users can enter or ‘check out’ papers from the queue for manual curation using a series of user-friendly annotation pages. A typical record entry consists of species, sequence type, sequence, target gene, binding factor, experimental outcome and one or more lines of experimental evidence. An evidence ontology was developed to describe and categorize these experiments. Records are cross-referenced to Ensembl or Entrez gene identifiers, PubMed and dbSNP and can be visualized in the Ensembl or UCSC genome browsers. All data are freely available through search pages, XML data dumps or web services at: http://www.oreganno.org.
doi:10.1093/nar/gkm967
PMCID: PMC2239002  PMID: 18006570
24.  Principles of Genome Evolution in the Drosophila melanogaster Species Group  
PLoS Biology  2007;5(6):e152.
That closely related species often differ by chromosomal inversions was discovered by Sturtevant and Plunkett in 1926. Our knowledge of how these inversions originate is still very limited, although a prevailing view is that they are facilitated by ectopic recombination events between inverted repetitive sequences. The availability of genome sequences of related species now allows us to study in detail the mechanisms that generate interspecific inversions. We have analyzed the breakpoint regions of the 29 inversions that differentiate the chromosomes of Drosophila melanogaster and two closely related species, D. simulans and D. yakuba, and reconstructed the molecular events that underlie their origin. Experimental and computational analysis revealed that the breakpoint regions of 59% of the inversions (17/29) are associated with inverted duplications of genes or other nonrepetitive sequences. In only two cases do we find evidence for inverted repetitive sequences in inversion breakpoints. We propose that the presence of inverted duplications associated with inversion breakpoint regions is the result of staggered breaks, either isochromatid or chromatid, and that this, rather than ectopic exchange between inverted repetitive sequences, is the prevalent mechanism for the generation of inversions in the melanogaster species group. Outgroup analysis also revealed evidence for widespread breakpoint recycling. Lastly, we have found that expression domains in D. melanogaster may be disrupted in D. yakuba, bringing into question their potential adaptive significance.
Author Summary
The organization of genes on chromosomes changes over evolutionary time. In some organisms, such as fruit flies and mosquitoes, inversions of chromosome regions are widespread. This has been associated with adaptation to environmental pressures and speciation. However, the mechanisms by which inversions are generated at the molecular level are poorly understood. The prevailing view involves the interactions of sequences that are moderately repeated in the genome. Here, we use molecular and computational methods to study 29 inversions that differentiate the chromosomes of three closely related fruit fly species. We find little support for a causal role of repetitive sequences in the origin of inversions and, instead, detect the presence of inverted duplications of ancestrally unique sequences (generally protein-coding genes) in the breakpoint regions of many inversions. This leads us to propose an alternative model in which the generation of inversions is coupled with the generation of duplications of flanking sequences. Additionally, we find evidence for genomic regions that are prone to breakage, being associated with inversions generated independently during the evolution of the ancestors of existing species.
Chromosomal inversion breakpoints were compared between three closely related Drosophila species. Many are associated with inverted gene duplications, suggesting that the prevalent mechanism for their generation involves staggered breakpoints.
doi:10.1371/journal.pbio.0050152
PMCID: PMC1885836  PMID: 17550304
25.  Large-Scale Discovery of Promoter Motifs in Drosophila melanogaster 
A key step in understanding gene regulation is to identify the repertoire of transcription factor binding motifs (TFBMs) that form the building blocks of promoters and other regulatory elements. Identifying these experimentally is very laborious, and the number of TFBMs discovered remains relatively small, especially when compared with the hundreds of transcription factor genes predicted in metazoan genomes. We have used a recently developed statistical motif discovery approach, NestedMICA, to detect candidate TFBMs from a large set of Drosophila melanogaster promoter regions. Of the 120 motifs inferred in our initial analysis, 25 were statistically significant matches to previously reported motifs, while 87 appeared to be novel. Analysis of sequence conservation and motif positioning suggested that the great majority of these discovered motifs are predictive of functional elements in the genome. Many motifs showed associations with specific patterns of gene expression in the D. melanogaster embryo, and we were able to obtain confident annotation of expression patterns for 25 of our motifs, including eight of the novel motifs. The motifs are available through Tiffin, a new database of DNA sequence motifs. We have discovered many new motifs that are overrepresented in D. melanogaster promoter regions, and offer several independent lines of evidence that these are novel TFBMs. Our motif dictionary provides a solid foundation for further investigation of regulatory elements in Drosophila, and demonstrates techniques that should be applicable in other species. We suggest that further improvements in computational motif discovery should narrow the gap between the set of known motifs and the total number of transcription factors in metazoan genomes.
Author Summary
In contrast to the genomic sequences that encode proteins, little is known about the regulatory elements that instruct the cell as to when and where a given gene should be active. Regulatory elements are thought to consist of clusters of short DNA words (motifs), each of which acts as a binding site for sequence-specific DNA binding protein. Thus, building a comprehensive dictionary of such motifs is an important step towards a broader understanding of gene regulation. Using the recently published NestedMICA method for detecting overrepresented motifs in a set of sequences, we build a dictionary of 120 motifs from regulatory sequences in the fruitfly genome, 87 of which are novel. Analysis of positional biases, conservation across species, and association with specific patterns of gene expression in fruitfly embryos suggest that the great majority of these newly discovered motifs represent functional regulatory elements. In addition to providing an initial motif dictionary for one of the most intensively studied model organisms, this work provides an analytical framework for the comprehensive discovery of regulatory motifs in complex animal genomes.
doi:10.1371/journal.pcbi.0030007
PMCID: PMC1779301  PMID: 17238282

Results 1-25 (32)