Imagine if we could compute across phenotype data as easily as genomic data; this article calls for efforts to realize this vision and discusses the potential benefits.
Despite a large and multifaceted effort to understand the vast landscape of phenotypic data, their current form inhibits productive data analysis. The lack of a community-wide, consensus-based, human- and machine-interpretable language for describing phenotypes and their genomic and environmental contexts is perhaps the most pressing scientific bottleneck to integration across many key fields in biology, including genomics, systems biology, development, medicine, evolution, ecology, and systematics. Here we survey the current phenomics landscape, including data resources and handling, and the progress that has been made to accurately capture relevant data descriptions for phenotypes. We present an example of the kind of integration across domains that computable phenotypes would enable, and we call upon the broader biology community, publishers, and relevant funding agencies to support efforts to surmount today's data barriers and facilitate analytical reproducibility.
This paper describes an approach to providing computer-interpretable logical definitions for the terms of the Human Phenotype Ontology (HPO) using PATO, the ontology of phenotypic qualities, to link terms of the HPO to the anatomic and other entities that are affected by abnormal phenotypic qualities. This approach will allow improved computerized reasoning as well as a facility to compare phenotypes between different species. The PATO mapping will also provide direct links from phenotypic abnormalities and underlying anatomic structures encoded using the Foundational Model of Anatomy, which will be a valuable resource for computational investigations of the links between anatomical components and concepts representing diseases with abnormal phenotypes and associated genes.
The identification of orthologs—genes pairs descended from a common ancestor through speciation, rather than duplication—has emerged as an essential component of many bioinformatics applications, ranging from the annotation of new genomes to experimental target prioritization. Yet, the development and application of orthology inference methods is hampered by the lack of consensus on source proteomes, file formats and benchmarks. The second ‘Quest for Orthologs’ meeting brought together stakeholders from various communities to address these challenges. We report on achievements and outcomes of this meeting, focusing on topics of particular relevance to the research community at large. The Quest for Orthologs consortium is an open community that welcomes contributions from all researchers interested in orthology research and applications.
Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web (http://www.geneontology.org) are being constructed: biological process, molecular function and cellular component.
The value of any kind of data is greatly enhanced when it exists in a form that allows it to be integrated with other data. One approach to integration is through the annotation of multiple bodies of data using common controlled vocabularies or ‘ontologies’. Unfortunately, the very success of this approach has led to a proliferation of ontologies, which itself creates obstacles to integration. The Open Biomedical Ontologies (OBO) consortium is pursuing a strategy to overcome this problem. Existing OBO ontologies, including the Gene Ontology, are undergoing coordinated reform, and new ontologies are being created on the basis of an evolving set of shared principles governing ontology development. The result is an expanding family of ontologies designed to be interoperable and logically well formed and to incorporate accurate representations of biological reality. We describe this OBO Foundry initiative and provide guidelines for those who might wish to become involved.
A phenotypic ontology that can be used for the analysis of phenotype-genotype data across multiple species, paving the way for truly cross species translational research.
Phenotype ontologies are typically constructed to serve the needs of a particular community, such as annotation of genotype-phenotype associations in mouse or human. Here we demonstrate how these ontologies can be improved through assignment of logical definitions using a core ontology of phenotypic qualities and multiple additional ontologies from the Open Biological Ontologies library. We also show how these logical definitions can be used for data integration when combined with a unified multi-species anatomy ontology.
A novel method for quantifying the similarity between phenotypes by the use of ontologies can be used to search for candidate genes, pathway members, and human disease models on the basis of phenotypes alone.
Scientists and clinicians who study genetic alterations and disease have traditionally described phenotypes in natural language. The considerable variation in these free-text descriptions has posed a hindrance to the important task of identifying candidate genes and models for human diseases and indicates the need for a computationally tractable method to mine data resources for mutant phenotypes. In this study, we tested the hypothesis that ontological annotation of disease phenotypes will facilitate the discovery of new genotype-phenotype relationships within and across species. To describe phenotypes using ontologies, we used an Entity-Quality (EQ) methodology, wherein the affected entity (E) and how it is affected (Q) are recorded using terms from a variety of ontologies. Using this EQ method, we annotated the phenotypes of 11 gene-linked human diseases described in Online Mendelian Inheritance in Man (OMIM). These human annotations were loaded into our Ontology-Based Database (OBD) along with other ontology-based phenotype descriptions of mutants from various model organism databases. Phenotypes recorded with this EQ method can be computationally compared based on the hierarchy of terms in the ontologies and the frequency of annotation. We utilized four similarity metrics to compare phenotypes and developed an ontology of homologous and analogous anatomical structures to compare phenotypes between species. Using these tools, we demonstrate that we can identify, through the similarity of the recorded phenotypes, other alleles of the same gene, other members of a signaling pathway, and orthologous genes and pathway members across species. We conclude that EQ-based annotation of phenotypes, in conjunction with a cross-species ontology, and a variety of similarity metrics can identify biologically meaningful similarities between genes by comparing phenotypes alone. This annotation and search method provides a novel and efficient means to identify gene candidates and animal models of human disease, which may shorten the lengthy path to identification and understanding of the genetic basis of human disease.
Model organisms such as fruit flies, mice, and zebrafish are useful for investigating gene function because they are easy to grow, dissect, and genetically manipulate in the laboratory. By examining mutations in these organisms, one can identify candidate genes that cause disease in humans, and develop models to better understand human disease and gene function. A fundamental roadblock for analysis is, however, the lack of a computational method for describing and comparing phenotypes of mutant animals and of human diseases when the genetic basis is unknown. We describe here a novel method using ontologies to record and quantify the similarity between phenotypes. We tested our method by using the annotated mutant phenotype of one member of the Hedgehog signaling pathway in zebrafish to identify other pathway members with similar recorded phenotypes. We also compared human disease phenotypes to those produced by mutation in model organisms, and show that orthologous and biologically relevant genes can be identified by this method. Given that the genetic basis of human disease is often unknown, this method provides a means for identifying candidate genes, pathway members, and disease models by computationally identifying similar phenotypes within and across species.
The Minimum Information for Biological and Biomedical Investigations (MIBBI) project provides a resource for those exploring the range of extant minimum information checklists and fosters coordinated development of such checklists.
A comparative analysis of the genomes of Drosophila melanogaster, Caenorhabditis elegans, and Saccharomyces cerevisiae—and the proteins they are predicted to encode—was undertaken in the context of cellular, developmental, and evolutionary processes. The nonredundant protein sets of flies and worms are similar in size and are only twice that of yeast, but different gene families are expanded in each genome, and the multidomain proteins and signaling pathways of the fly and worm are far more complex than those of yeast. The fly has orthologs to 177 of the 289 human disease genes examined and provides the foundation for rapid analysis of some of the basic processes involved in human disease.
Wolbachia are vertically transmitted, obligatory intracellular bacteria that infect a great number of species of arthropods and nematodes. In insects, they are mainly known for disrupting the reproductive biology of their hosts in order to increase their transmission through the female germline. In Drosophila melanogaster, however, a strong and consistent effect of Wolbachia infection has not been found. Here we report that a bacterial infection renders D. melanogaster more resistant to Drosophila C virus, reducing the load of viruses in infected flies. We identify these resistance-inducing bacteria as Wolbachia. Furthermore, we show that Wolbachia also increases resistance of Drosophila to two other RNA virus infections (Nora virus and Flock House virus) but not to a DNA virus infection (Insect Iridescent Virus 6). These results identify a new major factor regulating D. melanogaster resistance to infection by RNA viruses and contribute to the idea that the response of a host to a particular pathogen also depends on its interactions with other microorganisms. This is also, to our knowledge, the first report of a strong beneficial effect of Wolbachia infection in D. melanogaster. The induced resistance to natural viral pathogens may explain Wolbachia prevalence in natural populations and represents a novel Wolbachia–host interaction.
Many symbiotic bacteria confer fitness benefits to the organisms that they infect. Wolbachia are one of the most widespread intracellular bacteria, infecting a great number of species of insects. Here we show that in the fruit fly Drosophila melanogaster, infection with Wolbachia increases resistance to a natural pathogen of Drosophila, an RNA virus called Drosophila C virus. Furthermore, we show that Wolbachia also increases resistance of Drosophila to two other RNA viruses (Nora and Flock House virus) but not to a DNA virus (Insect Iridescent Virus 6). These results identify a significant new factor that regulates D. melanogaster resistance to infection by RNA viruses. Our results add to a growing body of literature showing that the response of an organism to a particular pathogen is modulated by prior or contemporaneous interactions with other microorganisms. That the fruit fly clearly benefits from increased resistance to viruses may provide a solution to the longstanding puzzle as to why Wolbachia is so common in natural populations of D. melanogaster.
Wolbachia are one of the most widespread intracellular bacteria. InDrosophila melanogaster, the presence ofWolbachia increases resistance to infection by RNA viruses.
FlyBase (http://flybase.org) is a database of Drosophila genetic and genomic information. Gene Ontology (GO) terms are used to describe three attributes of wild-type gene products: their molecular function, the biological processes in which they play a role, and their subcellular location. This article describes recent changes to the FlyBase GO annotation strategy that are improving the quality of the GO annotation data. Many of these changes stem from our participation in the GO Reference Genome Annotation Project—a multi-database collaboration producing comprehensive GO annotation sets for 12 diverse species.
With the quantity of genomic data increasing at an exponential rate, it is imperative that these data be captured electronically, in a standard format. Standardization activities must proceed within the auspices of open-access and international working bodies. To tackle the issues surrounding the development of better descriptions of genomic investigations, we have formed the Genomic Standards Consortium (GSC). Here, we introduce the minimum information about a genome sequence (MIGS) specification with the intent of promoting participation in its development and discussing the resources that will be required to develop improved mechanisms of metadata capture and exchange. As part of its wider goals, the GSC also supports improving the ‘transparency’ of the information contained in existing genomic databases.
WikiProteins enables community annotation in a Wiki-based system. Extracts of major data sources have been fused into an editable environment that links out to the original sources. Data from community edits create automatic copies of the original data. Semantic technology captures concepts co-occurring in one sentence and thus potential factual statements. In addition, indirect associations between concepts have been calculated. We call on a 'million minds' to annotate a 'million concepts' and to collect facts from the literature with the reward of collaborative knowledge discovery. The system is available for beta testing at .
Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds. The molecular entities in question are either natural products or synthetic products used to intervene in the processes of living organisms. Genome-encoded macromolecules (nucleic acids, proteins and peptides derived from proteins by cleavage) are not as a rule included in ChEBI. In addition to molecular entities, ChEBI contains groups (parts of molecular entities) and classes of entities. ChEBI includes an ontological classification, whereby the relationships between molecular entities or classes of entities and their parents and/or children are specified. ChEBI is available online at http://www.ebi.ac.uk/chebi/
A combined bioinformatic and genetic approach was used to conduct a systematic analysis of the relationship between ribosomal protein genes and Minute loci in Drosophila melanogaster, allowing the identification of 64 Minute loci corresponding to ribosomal genes.
Mutations in genes encoding ribosomal proteins (RPs) have been shown to cause an array of cellular and developmental defects in a variety of organisms. In Drosophila melanogaster, disruption of RP genes can result in the 'Minute' syndrome of dominant, haploinsufficient phenotypes, which include prolonged development, short and thin bristles, and poor fertility and viability. While more than 50 Minute loci have been defined genetically, only 15 have so far been characterized molecularly and shown to correspond to RP genes.
We combined bioinformatic and genetic approaches to conduct a systematic analysis of the relationship between RP genes and Minute loci. First, we identified 88 genes encoding 79 different cytoplasmic RPs (CRPs) and 75 genes encoding distinct mitochondrial RPs (MRPs). Interestingly, nine CRP genes are present as duplicates and, while all appear to be functional, one member of each gene pair has relatively limited expression. Next, we defined 65 discrete Minute loci by genetic criteria. Of these, 64 correspond to, or very likely correspond to, CRP genes; the single non-CRP-encoding Minute gene encodes a translation initiation factor subunit. Significantly, MRP genes and more than 20 CRP genes do not correspond to Minute loci.
This work answers a longstanding question about the molecular nature of Minute loci and suggests that Minute phenotypes arise from suboptimal protein synthesis resulting from reduced levels of cytoribosomes. Furthermore, by identifying the majority of haplolethal and haplosterile loci at the molecular level, our data will directly benefit efforts to attain complete deletion coverage of the D. melanogaster genome.
This novel web-based database provides unique accessibility and querying of integrated genomic and proteomic data for Drosophila and Anopheles.
FlyMine is a data warehouse that addresses one of the important challenges of modern biology: how to integrate and make use of the diversity and volume of current biological data. Its main focus is genomic and proteomics data for Drosophila and other insects. It provides web access to integrated data at a number of different levels, from simple browsing to construction of complex queries, which can be executed on either single items or lists.
That closely related species often differ by chromosomal inversions was discovered by Sturtevant and Plunkett in 1926. Our knowledge of how these inversions originate is still very limited, although a prevailing view is that they are facilitated by ectopic recombination events between inverted repetitive sequences. The availability of genome sequences of related species now allows us to study in detail the mechanisms that generate interspecific inversions. We have analyzed the breakpoint regions of the 29 inversions that differentiate the chromosomes of Drosophila melanogaster and two closely related species, D. simulans and D. yakuba, and reconstructed the molecular events that underlie their origin. Experimental and computational analysis revealed that the breakpoint regions of 59% of the inversions (17/29) are associated with inverted duplications of genes or other nonrepetitive sequences. In only two cases do we find evidence for inverted repetitive sequences in inversion breakpoints. We propose that the presence of inverted duplications associated with inversion breakpoint regions is the result of staggered breaks, either isochromatid or chromatid, and that this, rather than ectopic exchange between inverted repetitive sequences, is the prevalent mechanism for the generation of inversions in the melanogaster species group. Outgroup analysis also revealed evidence for widespread breakpoint recycling. Lastly, we have found that expression domains in D. melanogaster may be disrupted in D. yakuba, bringing into question their potential adaptive significance.
The organization of genes on chromosomes changes over evolutionary time. In some organisms, such as fruit flies and mosquitoes, inversions of chromosome regions are widespread. This has been associated with adaptation to environmental pressures and speciation. However, the mechanisms by which inversions are generated at the molecular level are poorly understood. The prevailing view involves the interactions of sequences that are moderately repeated in the genome. Here, we use molecular and computational methods to study 29 inversions that differentiate the chromosomes of three closely related fruit fly species. We find little support for a causal role of repetitive sequences in the origin of inversions and, instead, detect the presence of inverted duplications of ancestrally unique sequences (generally protein-coding genes) in the breakpoint regions of many inversions. This leads us to propose an alternative model in which the generation of inversions is coupled with the generation of duplications of flanking sequences. Additionally, we find evidence for genomic regions that are prone to breakage, being associated with inversions generated independently during the evolution of the ancestors of existing species.
Chromosomal inversion breakpoints were compared between three closely related Drosophila species. Many are associated with inverted gene duplications, suggesting that the prevalent mechanism for their generation involves staggered breakpoints.
An analysis of high-resolution transposable element annotations in Drosophila melanogaster suggests the existence of a global surveillance system against the majority of transposable elements families in the fly.
The recent availability of genome sequences has provided unparalleled insights into the broad-scale patterns of transposable element (TE) sequences in eukaryotic genomes. Nevertheless, the difficulties that TEs pose for genome assembly and annotation have prevented detailed, quantitative inferences about the contribution of TEs to genomes sequences.
Using a high-resolution annotation of TEs in Release 4 genome sequence, we revise estimates of TE abundance in Drosophila melanogaster. We show that TEs are non-randomly distributed within regions of high and low TE abundance, and that pericentromeric regions with high TE abundance are mosaics of distinct regions of extreme and normal TE density. Comparative analysis revealed that this punctate pattern evolves jointly by transposition and duplication, but not by inversion of TE-rich regions from unsequenced heterochromatin. Analysis of genome-wide patterns of TE nesting revealed a 'nesting network' that includes virtually all of the known TE families in the genome. Numerous directed cycles exist among TE families in the nesting network, implying concurrent or overlapping periods of transpositional activity.
Rapid restructuring of the genomic landscape by transposition and duplication has recently added hundreds of kilobases of TE sequence to pericentromeric regions in D. melanogaster. These events create ragged transitions between unique and repetitive sequences in the zone between euchromatic and beta-heterochromatic regions. Complex relationships of TE nesting in beta-heterochromatic regions raise the possibility of a co-suppression network that may act as a global surveillance system against the majority of TE families in D. melanogaster.
The honey bee Apis mellifera displays the most complex behavior of any insect. This, and its utility
to humans, makes it a fascinating object of study for biologists. Such studies are now further enabled by the release of the honey-bee genome sequence.
The honey bee Apis mellifera displays the most complex behavior of any insect. This, and its utility to humans, makes it a fascinating object of study for biologists. Such studies are now further enabled by the release of the honey-bee genome sequence.
We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment.
The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified.
This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence.
Transposable elements (TEs) are mobile, repetitive sequences that make up significant fractions of metazoan genomes. Despite their near ubiquity and importance in genome and chromosome biology, most efforts to annotate TEs in genome sequences rely on the results of a single computational program, RepeatMasker. In contrast, recent advances in gene annotation indicate that high-quality gene models can be produced from combining multiple independent sources of computational evidence. To elevate the quality of TE annotations to a level comparable to that of gene models, we have developed a combined evidence-model TE annotation pipeline, analogous to systems used for gene annotation, by integrating results from multiple homology-based and de novo TE identification methods. As proof of principle, we have annotated “TE models” in Drosophila melanogaster Release 4 genomic sequences using the combined computational evidence derived from RepeatMasker, BLASTER, TBLASTX, all-by-all BLASTN, RECON, TE-HMM and the previous Release 3.1 annotation. Our system is designed for use with the Apollo genome annotation tool, allowing automatic results to be curated manually to produce reliable annotations. The euchromatic TE fraction of D. melanogaster is now estimated at 5.3% (cf. 3.86% in Release 3.1), and we found a substantially higher number of TEs (n = 6,013) than previously identified (n = 1,572). Most of the new TEs derive from small fragments of a few hundred nucleotides long and highly abundant families not previously annotated (e.g., INE-1). We also estimated that 518 TE copies (8.6%) are inserted into at least one other TE, forming a nest of elements. The pipeline allows rapid and thorough annotation of even the most complex TE models, including highly deleted and/or nested elements such as those often found in heterochromatic sequences. Our pipeline can be easily adapted to other genome sequences, such as those of the D. melanogaster heterochromatin or other species in the genus Drosophila.
A first step in adding value to the large-scale DNA sequences generated by genome projects is the process of annotation—marking biological features on the raw string of adenines, cytosines, guanines, and thymines. The predominant goal in genome annotation thus far has been to identify gene sequences that encode proteins; however, many functional sequences exist in non-protein-coding regions and their annotation remains incomplete. Mobile, repetitive DNA segments known as transposable elements (TEs) are one class of functional sequence in non-protein-coding regions, which can make up large fractions of genome sequences (e.g., about 45% in the human) and can play important roles in gene and chromosome structure and regulation. As a consequence, there has been increasing interest in the computational identification of TEs in genome sequences. Borrowing current ideas from the field of gene annotation, the authors have developed a pipeline to predict TEs in genome sequences that combines multiple sources of evidence from different computational methods. The authors' combined-evidence pipeline represents an important step towards raising the standards of TE annotation to the same quality as that of genes, and should help catalyze their understanding of the biological role of these fascinating sequences.
The goal of the Sequence Ontology (SO) project is to produce a structured controlled vocabulary with a common set of terms and definitions for parts of a genomic annotation, and to describe the relationships among them. Details of SO construction, design and use, particularly with regard to part-whole relationships are discussed and the practical utility of SO is demonstrated for a set of genome annotations from Drosophila melanogaster.
The Sequence Ontology (SO) is a structured controlled vocabulary for the parts of a genomic annotation. SO provides a common set of terms and definitions that will facilitate the exchange, analysis and management of genomic data. Because SO treats part-whole relationships rigorously, data described with it can become substrates for automated reasoning, and instances of sequence features described by the SO can be subjected to a group of logical operations termed extensional mereology operators.
An ontology for cell types that covers the prokaryotic, fungal, animal and plant worlds is described. It includes over 680 cell types. These cell types are classified under several generic categories and are organized as a directed acyclic graph.
We describe an ontology for cell types that covers the prokaryotic, fungal, animal and plant worlds. It includes over 680 cell types. These cell types are classified under several generic categories and are organized as a directed acyclic graph. The ontology is available in the formats adopted by the Open Biological Ontologies umbrella and is designed to be used in the context of model organism genome and other biological databases. The ontology is freely available at and can be viewed using standard ontology visualization tools such as OBO-Edit and COBrA.
Hormones frequently guide animal development via the induction of cascades of gene activities, whose products further amplify an initial hormonal stimulus. In Drosophila the transformation of the larva into the pupa and the subsequent metamorphosis to the adult stage is triggered by changes in the titer of the steroid hormone 20-hydroxyecdysone. singed wings (swi) is the only gene known in Drosophila melanogaster for which mutations specifically interrupt the transmission of the regulatory signal from early to late ecdysone inducible genes.
We have characterized singed wings locus, showing it to correspond to EG:171E4.2 (CG3095). swi encodes a predicted 68.5-kDa protein that contains N-terminal histidine-rich and threonine-rich domains, a cysteine-rich C-terminal region and two leucine-rich repeats. The SWI protein has a close homolog in D. melanogaster, defining a new family of SWI-like proteins, and is conserved in D. pseudoobscura. A lethal mutation, swit476, shows a severe disruption of the ecdysone pathway and is a C>Y substitution in one of the two conserved CysXCys motifs that are common to SWI and the Drosophila Toll-4 protein.
It is not entirely clear from the present molecular analysis how the SWI protein may function in the ecdysone induced cascade. Currently all predictions agree in that SWI is very unlikely to be a nuclear protein. Thus it probably exercises its control of "late" ecdysone genes indirectly. Apparently the genetic regulation of ecdysone signaling is much more complex then was previously anticipated.
The recent completion of the Drosophila melanogaster genomic sequence to high quality, and the availability of a greatly expanded set of Drosophila cDNA sequences, afforded FlyBase the opportunity to significantly improve genomic annotations.
The recent completion of the Drosophila melanogaster genomic sequence to high quality and the availability of a greatly expanded set of Drosophila cDNA sequences, aligning to 78% of the predicted euchromatic genes, afforded FlyBase the opportunity to significantly improve genomic annotations. We made the annotation process more rigorous by inspecting each gene visually, utilizing a comprehensive set of curation rules, requiring traceable evidence for each gene model, and comparing each predicted peptide to SWISS-PROT and TrEMBL sequences.
Although the number of predicted protein-coding genes in Drosophila remains essentially unchanged, the revised annotation significantly improves gene models, resulting in structural changes to 85% of the transcripts and 45% of the predicted proteins. We annotated transposable elements and non-protein-coding RNAs as new features, and extended the annotation of untranslated (UTR) sequences and alternative transcripts to include more than 70% and 20% of genes, respectively. Finally, cDNA sequence provided evidence for dicistronic transcripts, neighboring genes with overlapping UTRs on the same DNA sequence strand, alternatively spliced genes that encode distinct, non-overlapping peptides, and numerous nested genes.
Identification of so many unusual gene models not only suggests that some mechanisms for gene regulation are more prevalent than previously believed, but also underscores the complex challenges of eukaryotic gene prediction. At present, experimental data and human curation remain essential to generate high-quality genome annotations.