1.  Gene Model Annotations for Drosophila melanogaster: Impact of High-Throughput Data 
G3: Genes|Genomes|Genetics  2015;5(8):1721-1736.
We report the current status of the FlyBase annotated gene set for Drosophila melanogaster and highlight improvements based on high-throughput data. The FlyBase annotated gene set consists entirely of manually annotated gene models, with the exception of some classes of small non-coding RNAs. All gene models have been reviewed using evidence from high-throughput datasets, primarily from the modENCODE project. These datasets include RNA-Seq coverage data, RNA-Seq junction data, transcription start site profiles, and translation stop-codon read-through predictions. New annotation guidelines were developed to take into account the use of the high-throughput data. We describe how this flood of new data was incorporated into thousands of new and revised annotations. FlyBase has adopted a philosophy of excluding low-confidence and low-frequency data from gene model annotations; we also do not attempt to represent all possible permutations for complex and modularly organized genes. This has allowed us to produce a high-confidence, manageable gene annotation dataset that is available at FlyBase ( Interesting aspects of new annotations include new genes (coding, non-coding, and antisense), many genes with alternative transcripts with very long 3′ UTRs (up to 15–18 kb), and a stunning mismatch in the number of male-specific genes (approximately 13% of all annotated gene models) vs. female-specific genes (less than 1%). The number of identified pseudogenes and mutations in the sequenced strain also increased significantly. We discuss remaining challenges, for instance, identification of functional small polypeptides and detection of alternative translation starts.
PMCID: PMC4528329  PMID: 26109357
transcriptome; alternative splice; lncRNA; transcription start site; exon junction
2.  Gene Model Annotations for Drosophila melanogaster: The Rule-Benders 
G3: Genes|Genomes|Genetics  2015;5(8):1737-1749.
In the context of the FlyBase annotated gene models in Drosophila melanogaster, we describe the many exceptional cases we have curated from the literature or identified in the course of FlyBase analysis. These range from atypical but common examples such as dicistronic and polycistronic transcripts, noncanonical splices, trans-spliced transcripts, noncanonical translation starts, and stop-codon readthroughs, to single exceptional cases such as ribosomal frameshifting and HAC1-type intron processing. In FlyBase, exceptional genes and transcripts are flagged with Sequence Ontology terms and/or standardized comments. Because some of the rule-benders create problems for handlers of high-throughput data, we discuss plans for flagging these cases in bulk data downloads.
PMCID: PMC4528330  PMID: 26109356
bicistronic; stop-codon suppression; multiphasic exon; shared promoter; non-AUG translation start
3.  Role of Electrostatics in the Assembly Pathway of a Single-Stranded RNA Virus 
Journal of Virology  2014;88(18):10472-10479.
We have recently discovered (R. D. Cadena-Nava et al., J. Virol. 86:3318–3326, 2012, doi:10.1128/JVI.06566-11) that the in vitro packaging of RNA by the capsid protein (CP) of cowpea chlorotic mottle virus is optimal when there is a significant excess of CP, specifically that complete packaging of all of the RNA in solution requires sufficient CP to provide charge matching of the N-terminal positively charged arginine-rich motifs (ARMS) of the CPs with the negatively charged phosphate backbone of the RNA. We show here that packaging results from the initial formation of a charge-matched protocapsid consisting of RNA decorated by a disordered arrangement of CPs. This protocapsid reorganizes into the final, icosahedrally symmetric nucleocapsid by displacing the excess CPs from the RNA to the exterior surface of the emerging capsid through electrostatic attraction between the ARMs of the excess CP and the negative charge density of the capsid exterior. As a test of this scenario, we prepare CP mutants with extra and missing (relative to the wild type) cationic residues and show that a correspondingly smaller and larger excess, respectively, of CP is needed for complete packaging of RNA.
IMPORTANCE Cowpea chlorotic mottle virus (CCMV) has long been studied as a model system for the assembly of single-stranded RNA viruses. While much is known about the electrostatic interactions within the CCMV virion, relatively little is known about these interactions during assembly, i.e., within intermediate states preceding the final nucleocapsid structure. Theoretical models and coarse-grained molecular dynamics simulations suggest that viruses like CCMV assemble by the bulk adsorption of CPs onto the RNA driven by electrostatic attraction, followed by structural reorganization into the final capsid. Such a mechanism facilitates assembly by condensing the RNA for packaging while simultaneously concentrating the local density of CP for capsid nucleation. We provide experimental evidence of such a mechanism by demonstrating that efficient assembly is initiated by the formation of a disordered protocapsid complex whose stoichiometry is governed by electrostatics (charge matching of the anionic RNA and the cationic N termini of the CP).
PMCID: PMC4178897  PMID: 24965458
4.  FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations 
Nucleic Acids Research  2014;43(Database issue):D690-D697.
Release 6, the latest reference genome assembly of the fruit fly Drosophila melanogaster, was released by the Berkeley Drosophila Genome Project in 2014; it replaces their previous Release 5 genome assembly, which had been the reference genome assembly for over 7 years. With the enormous amount of information now attached to the D. melanogaster genome in public repositories and individual laboratories, the replacement of the previous assembly by the new one is a major event requiring careful migration of annotations and genome-anchored data to the new, improved assembly. In this report, we describe the attributes of the new Release 6 reference genome assembly, the migration of FlyBase genome annotations to this new assembly, how genome features on this new assembly can be viewed in FlyBase ( and how users can convert coordinates for their own data to the corresponding Release 6 coordinates.
PMCID: PMC4383921  PMID: 25398896
5.  In Vitro Quantification of the Relative Packaging Efficiencies of Single-Stranded RNA Molecules by Viral Capsid Protein 
Journal of Virology  2012;86(22):12271-12282.
While most T=3 single-stranded RNA (ssRNA) viruses package in vivo about 3,000 nucleotides (nt), in vitro experiments have demonstrated that a broad range of RNA lengths can be packaged. Under the right solution conditions, for example, cowpea chlorotic mottle virus (CCMV) capsid protein (CP) has been shown to package RNA molecules whose lengths range from 100 to 10,000 nt. Furthermore, in each case it can package the RNA completely, as long as the mass ratio of CP to nucleic acid in the assembly mixture is 6:1 or higher. Yet the packaging efficiencies of the RNAs can differ widely, as we demonstrate by measurements in which two RNAs compete head-to-head for a limited amount of CP. We show that the relative efficiency depends nonmonotonically on the RNA length, with 3,200 nt being optimum for packaging by the T=3 capsids preferred by CCMV CP. When two RNAs of the same length—and hence the same charge—compete for CP, differences in packaging efficiency are necessarily due to differences in their secondary structures and/or three-dimensional (3D) sizes. For example, the heterologous RNA1 of brome mosaic virus (BMV) is packaged three times more efficiently by CCMV CP than is RNA1 of CCMV, even though the two RNAs have virtually identical lengths. Finally, we show that in an assembly mixture at neutral pH, CP binds reversibly to the RNA and there is a reversible equilibrium between all the various RNA/CP complexes. At acidic pH, excess protein unbinds from RNA/CP complexes and nucleocapsids form irreversibly.
PMCID: PMC3486494  PMID: 22951822
6.  Viral RNAs Are Unusually Compact 
PLoS ONE  2014;9(9):e105875.
A majority of viruses are composed of long single-stranded genomic RNA molecules encapsulated by protein shells with diameters of just a few tens of nanometers. We examine the extent to which these viral RNAs have evolved to be physically compact molecules to facilitate encapsulation. Measurements of equal-length viral, non-viral, coding and non-coding RNAs show viral RNAs to have among the smallest sizes in solution, i.e., the highest gel-electrophoretic mobilities and the smallest hydrodynamic radii. Using graph-theoretical analyses we demonstrate that their sizes correlate with the compactness of branching patterns in predicted secondary structure ensembles. The density of branching is determined by the number and relative positions of 3-helix junctions, and is highly sensitive to the presence of rare higher-order junctions with 4 or more helices. Compact branching arises from a preponderance of base pairing between nucleotides close to each other in the primary sequence. The density of branching represents a degree of freedom optimized by viral RNA genomes in response to the evolutionary pressure to be packaged reliably. Several families of viruses are analyzed to delineate the effects of capsid geometry, size and charge stabilization on the selective pressure for RNA compactness. Compact branching has important implications for RNA folding and viral assembly.
PMCID: PMC4154850  PMID: 25188030
7.  Self-Assembly of Viral Capsid Protein and RNA Molecules of Different Sizes: Requirement for a Specific High Protein/RNA Mass Ratio 
Journal of Virology  2012;86(6):3318-3326.
Virus-like particles can be formed by self-assembly of capsid protein (CP) with RNA molecules of increasing length. If the protein “insisted” on a single radius of curvature, the capsids would be identical in size, independent of RNA length. However, there would be a limit to length of the RNA, and one would not expect RNA much shorter than native viral RNA to be packaged unless multiple copies were packaged. On the other hand, if the protein did not favor predetermined capsid size, one would expect the capsid diameter to increase with increase in RNA length. Here we examine the self-assembly of CP from cowpea chlorotic mottle virus with RNA molecules ranging in length from 140 to 12,000 nucleotides (nt). Each of these RNAs is completely packaged if and only if the protein/RNA mass ratio is sufficiently high; this critical value is the same for all of the RNAs and corresponds to equal RNA and N-terminal-protein charges in the assembly mix. For RNAs much shorter in length than the 3,000 nt of the viral RNA, two or more molecules are assembled into 24- and 26-nm-diameter capsids, whereas for much longer RNAs (>4,500 nt), a single RNA molecule is shared/packaged by two or more capsids with diameters as large as 30 nm. For intermediate lengths, a single RNA is assembled into 26-nm-diameter capsids, the size associated with T=3 wild-type virus. The significance of these assembly results is discussed in relation to likely factors that maintain T=3 symmetry in vivo.
PMCID: PMC3302347  PMID: 22205731
8.  Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures 
Nature  2007;450(7167):219-232.
Sequencing of multiple related species followed by comparative genomics analysis constitutes a powerful approach for the systematic understanding of any genome. Here, we use the genomes of 12 Drosophila species for the de novo discovery of functional elements in the fly. Each type of functional element shows characteristic patterns of change, or ‘evolutionary signatures’, dictated by its precise selective constraints. Such signatures enable recognition of new protein-coding genes and exons, spurious and incorrect gene annotations, and numerous unusual gene structures, including abundant stop-codon readthrough. Similarly, we predict non-protein-coding RNA genes and structures, and new microRNA (miRNA) genes. We provide evidence of miRNA processing and functionality from both hairpin arms and both DNA strands. We identify several classes of pre- and post-transcriptional regulatory motifs, and predict individual motif instances with high confidence. We also study how discovery power scales with the divergence and number of species compared, and we provide general guidelines for comparative studies.
PMCID: PMC2474711  PMID: 17994088
9.  Salt-Dependent DNA-DNA Spacings in Intact Bacteriophage λ Reflect Relative Importance of DNA Self-Repulsion and Bending Energies 
Physical review letters  2011;106(2):028102.
Using solution synchrotron X-ray scattering, we measure the variation of DNA-DNA d-spacings in bacteriophage λ with mono-, di- and poly-valent salt concentrations, for wild-type (48.5 kbp) and short-genome-mutant (37.8 kbp) strains. From the decrease in d-spacings with increasing salt, we deduce the relative contributions of DNA self-repulsion and bending to the energetics of pack-aged phage genomes. We quantify the DNA-DNA interaction energies within the intact phage by combining the measured d-spacings in the capsid with measurements of osmotic pressure in DNA assemblies under the same salt conditions in bulk solution. In the commonly used Tris-Mg buffer, the DNA-DNA interaction energies inside the phage capsids are shown to be about 1 kT/base pair, an order of magnitude larger than the bending energies.
PMCID: PMC3420006  PMID: 21405253
10.  Automatic categorization of diverse experimental information in the bioscience literature 
BMC Bioinformatics  2012;13:16.
Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance.
We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction.
Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort.
PMCID: PMC3305665  PMID: 22280404
11.  Rescue of Infectious Particles from Preassembled Alphavirus Nucleocapsid Cores▿† 
Journal of Virology  2011;85(12):5773-5781.
Alphaviruses are small, spherical, enveloped, positive-sense, single-stranded, RNA viruses responsible for considerable human and animal disease. Using microinjection of preassembled cores as a tool, a system has been established to study the assembly and budding process of Sindbis virus, the type member of the alphaviruses. We demonstrate the release of infectious virus-like particles from cells expressing Sindbis virus envelope glycoproteins following microinjection of Sindbis virus nucleocapsids purified from the cytoplasm of infected cells. Furthermore, it is shown that nucleocapsids assembled in vitro mimic those isolated in the cytoplasm of infected cells with respect to their ability to be incorporated into enveloped virions following microinjection. This system allows for the study of the alphavirus budding process independent of an authentic infection and provides a platform to study viral and host requirements for budding.
PMCID: PMC3126313  PMID: 21471237
12.  The effect of genome length on ejection forces in bacteriophage lambda 
Virology  2006;348(2):430-436.
A variety of viruses tightly pack their genetic material into protein capsids that are barely large enough to enclose the genome. In particular, in bacteriophages, forces as high as 60 pN are encountered during packaging and ejection, produced by DNA bending elasticity and self-interactions. The high forces are believed to be important for the ejection process, though the extent of their involvement is not yet clear. As a result, there is a need for quantitative models and experiments that reveal the nature of the forces relevant to DNA ejection. Here we report measurements of the ejection forces for two different mutants of bacteriophage λ, λb221cI26 and λcI60, which differ in genome length by ~30%. As expected for a force-driven ejection mechanism, the osmotic pressure at which DNA release is completely inhibited varies with the genome length: we find inhibition pressures of 15 atm and 25 atm, for the short and long genomes, respectively, values that are in agreement with our theoretical calculations.
PMCID: PMC3178461  PMID: 16469346
bacteriophage; lambda; LamB; maltoporin; genome delivery; DNA ejection; pressure
13.  The ends of a large RNA molecule are necessarily close 
Nucleic Acids Research  2010;39(1):292-299.
We show on general theoretical grounds that the two ends of single-stranded (ss) RNA molecules (consisting of roughly equal proportions of A, C, G and U) are necessarily close together, largely independent of their length and sequence. This is demonstrated to be a direct consequence of two generic properties of the equilibrium secondary structures, namely that the average proportion of bases in pairs is ∼60% and that the average duplex length is ∼4. Based on mfold and Vienna computations on large numbers of ssRNAs of various lengths (1000–10 000 nt) and sequences (both random and biological), we find that the 5′–3′ distance—defined as the sum of H-bond and covalent (ss) links separating the ends of the RNA chain—is small, averaging 15–20 for each set of viral sequences tested. For random sequences this distance is ∼12, consistent with the theory. We discuss the relevance of these results to evolved sequence complementarity and specific protein binding effects that are known to be important for keeping the two ends of viral and messenger RNAs in close proximity. Finally we speculate on how our conclusions imply indistinguishability in size and shape of equilibrated forms of linear and covalently circularized ssRNA molecules.
PMCID: PMC3017586  PMID: 20810537
14.  Comparative Genomics of the Eukaryotes 
Science (New York, N.Y.)  2000;287(5461):2204-2215.
A comparative analysis of the genomes of Drosophila melanogaster, Caenorhabditis elegans, and Saccharomyces cerevisiae—and the proteins they are predicted to encode—was undertaken in the context of cellular, developmental, and evolutionary processes. The nonredundant protein sets of flies and worms are similar in size and are only twice that of yeast, but different gene families are expanded in each genome, and the multidomain proteins and signaling pathways of the fly and worm are far more complex than those of yeast. The fly has orthologs to 177 of the 289 human disease genes examined and provides the foundation for rapid analysis of some of the basic processes involved in human disease.
PMCID: PMC2754258  PMID: 10731134
15.  VectorBase: a data resource for invertebrate vector genomics 
Nucleic Acids Research  2008;37(Database issue):D583-D587.
VectorBase ( is an NIAID-funded Bioinformatic Resource Center focused on invertebrate vectors of human pathogens. VectorBase annotates and curates vector genomes providing a web accessible integrated resource for the research community. Currently, VectorBase contains genome information for three mosquito species: Aedes aegypti, Anopheles gambiae and Culex quinquefasciatus, a body louse Pediculus humanus and a tick species Ixodes scapularis. Since our last report VectorBase has initiated a community annotation system, a microarray and gene expression repository and controlled vocabularies for anatomy and insecticide resistance. We have continued to develop both the software infrastructure and tools for interrogating the stored data.
PMCID: PMC2686483  PMID: 19028744
16.  Inferring genome-scale rearrangement phylogeny and ancestral gene order: a Drosophila case study 
Genome Biology  2007;8(11):R236.
A simple, fast, and biologically-inspired computational approach to infer genome-scale rearrangement phylogeny and ancestral gene order has been developed and applied to eight Drosophila genomes, providing insights into evolutionary chromosomal dynamics.
A simple, fast, and biologically inspired computational approach for inferring genome-scale rearrangement phylogeny and ancestral gene order has been developed. This has been applied to eight Drosophila genomes. Existing techniques are either limited to a few hundred markers or a small number of taxa. This analysis uses over 14,000 genomic loci and employs discrete elements consisting of pairs of homologous genetic elements. The results provide insight into evolutionary chromosomal dynamics and synteny analysis, and inform speciation studies.
PMCID: PMC2258185  PMID: 17996033
17.  Analysis of 14 BAC sequences from the Aedes aegypti genome: a benchmark for genome annotation and assembly 
Genome Biology  2007;8(5):R88.
In order to provide a set of manually curated and annotated sequences from the Aedes aegypti genome, mapped BAC clones encompassing 1.57 Mb were sequenced, assembled and manually annotated using computational gene-finding, EST matches as well as comparative protein homology.
Aedes aegypti is the principal vector of yellow fever and dengue viruses throughout the tropical world. To provide a set of manually curated and annotated sequences from the Ae. aegypti genome, 14 mapped bacterial artificial chromosome (BAC) clones encompassing 1.57 Mb were sequenced, assembled and manually annotated using a combination of computational gene-finding, expressed sequence tag (EST) matches and comparative protein homology. PCR and sequencing were used to experimentally confirm expression and sequence of a subset of these transcripts.
Of the 51 manual annotations, 50 and 43 demonstrated a high level of similarity to Anopheles gambiae and Drosophila melanogaster genes, respectively. Ten of the 12 BAC sequences with more than one annotated gene exhibited synteny with the A. gambiae genome. Putative transcripts from eight BAC clones were found in multiple copies (two copies in most cases) in the Aedes genome assembly, which point to the probable presence of haplotype polymorphisms and/or misassemblies.
This study not only provides a benchmark set of manually annotated transcripts for this genome that can be used to assess the quality of the auto-annotation pipeline and the assembly, but it also looks at the effect of a high repeat content on the genome assembly and annotation pipeline.
PMCID: PMC1929151  PMID: 17519023
18.  VectorBase: a home for invertebrate vectors of human pathogens 
Nucleic Acids Research  2006;35(Database issue):D503-D505.
VectorBase () is a web-accessible data repository for information about invertebrate vectors of human pathogens. VectorBase annotates and maintains vector genomes providing an integrated resource for the research community. Currently, VectorBase contains genome information for two organisms: Anopheles gambiae, a vector for the Plasmodium protozoan agent causing malaria, and Aedes aegypti, a vector for the flaviviral agents causing Yellow fever and Dengue fever.
PMCID: PMC1751530  PMID: 17145709
19.  FlyBase: genomes by the dozen 
Nucleic Acids Research  2006;35(Database issue):D486-D491.
FlyBase () is the primary database of genetic and genomic data for the insect family Drosophilidae. Historically, Drosophila melanogaster has been the most extensively studied species in this family, but recent determination of the genomic sequences of an additional 11 Drosophila species opens up new avenues of research for other Drosophila species. This extensive sequence resource, encompassing species with well-defined phylogenetic relationships, provides a model system for comparative genomic analyses. FlyBase has developed tools to facilitate access to and navigation through this invaluable new data collection.
PMCID: PMC1669768  PMID: 17099233
20.  Annotation of the Drosophila melanogaster euchromatic genome: a systematic review 
Genome Biology  2002;3(12):research0083.1-83.22.
The recent completion of the Drosophila melanogaster genomic sequence to high quality, and the availability of a greatly expanded set of Drosophila cDNA sequences, afforded FlyBase the opportunity to significantly improve genomic annotations.
The recent completion of the Drosophila melanogaster genomic sequence to high quality and the availability of a greatly expanded set of Drosophila cDNA sequences, aligning to 78% of the predicted euchromatic genes, afforded FlyBase the opportunity to significantly improve genomic annotations. We made the annotation process more rigorous by inspecting each gene visually, utilizing a comprehensive set of curation rules, requiring traceable evidence for each gene model, and comparing each predicted peptide to SWISS-PROT and TrEMBL sequences.
Although the number of predicted protein-coding genes in Drosophila remains essentially unchanged, the revised annotation significantly improves gene models, resulting in structural changes to 85% of the transcripts and 45% of the predicted proteins. We annotated transposable elements and non-protein-coding RNAs as new features, and extended the annotation of untranslated (UTR) sequences and alternative transcripts to include more than 70% and 20% of genes, respectively. Finally, cDNA sequence provided evidence for dicistronic transcripts, neighboring genes with overlapping UTRs on the same DNA sequence strand, alternatively spliced genes that encode distinct, non-overlapping peptides, and numerous nested genes.
Identification of so many unusual gene models not only suggests that some mechanisms for gene regulation are more prevalent than previously believed, but also underscores the complex challenges of eukaryotic gene prediction. At present, experimental data and human curation remain essential to generate high-quality genome annotations.
PMCID: PMC151185  PMID: 12537572

