PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (25)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
more »
1.  TBX3 Regulates Splicing In Vivo: A Novel Molecular Mechanism for Ulnar-Mammary Syndrome 
PLoS Genetics  2014;10(3):e1004247.
TBX3 is a member of the T-box family of transcription factors with critical roles in development, oncogenesis, cell fate, and tissue homeostasis. TBX3 mutations in humans cause complex congenital malformations and Ulnar-mammary syndrome. Previous investigations into TBX3 function focused on its activity as a transcriptional repressor. We used an unbiased proteomic approach to identify TBX3 interacting proteins in vivo and discovered that TBX3 interacts with multiple mRNA splicing factors and RNA metabolic proteins. We discovered that TBX3 regulates alternative splicing in vivo and can promote or inhibit splicing depending on context and transcript. TBX3 associates with alternatively spliced mRNAs and binds RNA directly. TBX3 binds RNAs containing TBX binding motifs, and these motifs are required for regulation of splicing. Our study reveals that TBX3 mutations seen in humans with UMS disrupt its splicing regulatory function. The pleiotropic effects of TBX3 mutations in humans and mice likely result from disrupting at least two molecular functions of this protein: transcriptional regulation and pre-mRNA splicing.
Author Summary
TBX3 is a protein with essential roles in development and tissue homeostasis, and is implicated in cancer pathogenesis. TBX3 mutations in humans cause a complex of birth defects called Ulnar-mammary syndrome (UMS). Despite the importance of TBX3 and decades of investigation, few TBX3 partner proteins have been identified and little is known about how it functions in cells. Unlike previous investigations focused on TBX3 as DNA binding factor that represses transcription, we took an unbiased approach to identify TBX3 partner proteins in mouse embryos and human cells. We discovered that TBX3 interacts with RNA binding proteins and binds mRNAs to regulate how they are spliced. The different mutations seen in human UMS patients produce mutant proteins that interact with different partners and have different splicing activities. TBX3 promotes or inhibits splicing depending on cellular context, its partner proteins, and the target mRNA. Eukaryotic cells have many more proteins than genes: alternative splicing is critical to generate the different mRNAs needed for production of the specific and vast repertoire of proteins a cell produces. Our finding that TBX3 regulates this process provides fundamental new insights into how altered quantity and molecular function of TBX3 contribute to human developmental disorders and cancer.
doi:10.1371/journal.pgen.1004247
PMCID: PMC3967948  PMID: 24675841
2.  In Vivo Determination of Direct Targets of the Nonsense-Mediated Decay Pathway in Drosophila 
G3: Genes|Genomes|Genetics  2014;4(3):485-496.
Nonsense-mediated messenger RNA (mRNA) decay (NMD) is a mRNA degradation pathway that regulates a significant portion of the transcriptome. The expression levels of numerous genes are known to be altered in NMD mutants, but it is not known which of these transcripts is a direct pathway target. Here, we present the first genome-wide analysis of direct NMD targeting in an intact animal. By using rapid reactivation of the NMD pathway in a Drosophila melanogaster NMD mutant and globally monitoring of changes in mRNA expression levels, we can distinguish between primary and secondary effects of NMD on gene expression. Using this procedure, we identified 168 candidate direct NMD targets in vivo. Remarkably, we found that 81% of direct target genes do not show increased expression levels in an NMD mutant, presumably due to feedback regulation. Because most previous studies have used up-regulation of mRNA expression as the only means to identify NMD-regulated transcripts, our results provide new directions for understanding the roles of the NMD pathway in endogenous gene regulation during animal development and physiology. For instance, we show clearly that direct target genes have longer 3′ untranslated regions compared with nontargets, suggesting long 3′ untranslated regions target mRNAs for NMD in vivo. In addition, we investigated the role of NMD in suppressing transcriptional noise and found that although the transposable element Copia is up-regulated in NMD mutants, this effect appears to be indirect.
doi:10.1534/g3.113.009357
PMCID: PMC3962487  PMID: 24429422
Upf2; reactivation; NMD; Drosophila; RNA-seq
3.  Sequencing of the sea lamprey (Petromyzon marinus) genome provides insights into vertebrate evolution 
Nature genetics  2013;45(4):415-421e2.
Lampreys are representatives of an ancient vertebrate lineage that diverged from our own ~500 million years ago. By virtue of this deeply shared ancestry, the sea lamprey (P. marinus) genome is uniquely poised to provide insight into the ancestry of vertebrate genomes and the underlying principles of vertebrate biology. Here, we present the first lamprey whole-genome sequence and assembly. We note challenges faced owing to its high content of repetitive elements and GC bases, as well as the absence of broad-scale sequence information from closely related species. Analyses of the assembly indicate that two whole-genome duplications likely occurred before the divergence of ancestral lamprey and gnathostome lineages. Moreover, the results help define key evolutionary events within vertebrate lineages, including the origin of myelin-associated proteins and the development of appendages. The lamprey genome provides an important resource for reconstructing vertebrate origins and the evolutionary events that have shaped the genomes of extant organisms.
doi:10.1038/ng.2568
PMCID: PMC3709584  PMID: 23435085
4.  Genomic diversity and evolution of the head crest in the rock pigeon 
Science (New York, N.Y.)  2013;339(6123):1063-1067.
The geographic origins of breeds and genetic basis of variation within the widely distributed and phenotypically diverse domestic rock pigeon (Columba livia) remain largely unknown. We generated a rock pigeon reference genome and additional genome sequences representing domestic and feral populations. We find evidence for the origins of major breed groups in the Middle East, and contributions from a racing breed to North American feral populations. We identify EphB2 as a strong candidate for the derived head crest phenotype shared by numerous breeds, an important trait in mate selection in many avian species. We also find evidence that this trait evolved just once and spread throughout the species, and that the crest originates early in development by the localized molecular reversal of feather bud polarity.
doi:10.1126/science.1230422
PMCID: PMC3778192  PMID: 23371554
5.  ImagePlane: An Automated Image Analysis Pipeline for High-Throughput Screens Using the Planarian Schmidtea mediterranea 
Journal of Computational Biology  2013;20(8):583-592.
Abstract
ImagePlane is a modular pipeline for automated, high-throughput image analysis and information extraction. Designed to support planarian research, ImagePlane offers a self-parameterizing adaptive thresholding algorithm; an algorithm that can automatically segment animals into anterior–posterior/left–right quadrants for automated identification of region-specific differences in gene and protein expression; and a novel algorithm for quantification of morphology of animals, independent of their orientations and sizes. ImagePlane also provides methods for automatic report generation, and its outputs can be easily imported into third-party tools such as R and Excel. Here we demonstrate the pipeline's utility for identification of genes involved in stem cell proliferation in the planarian Schmidtea mediterranea. Although designed to support planarian studies, ImagePlane will prove useful for cell-based studies as well.
doi:10.1089/cmb.2013.0025
PMCID: PMC3728726  PMID: 23822514
biology; functional genomics; genomics
6.  Transposable Elements Are Major Contributors to the Origin, Diversification, and Regulation of Vertebrate Long Noncoding RNAs 
PLoS Genetics  2013;9(4):e1003470.
Advances in vertebrate genomics have uncovered thousands of loci encoding long noncoding RNAs (lncRNAs). While progress has been made in elucidating the regulatory functions of lncRNAs, little is known about their origins and evolution. Here we explore the contribution of transposable elements (TEs) to the makeup and regulation of lncRNAs in human, mouse, and zebrafish. Surprisingly, TEs occur in more than two thirds of mature lncRNA transcripts and account for a substantial portion of total lncRNA sequence (∼30% in human), whereas they seldom occur in protein-coding transcripts. While TEs contribute less to lncRNA exons than expected, several TE families are strongly enriched in lncRNAs. There is also substantial interspecific variation in the coverage and types of TEs embedded in lncRNAs, partially reflecting differences in the TE landscapes of the genomes surveyed. In human, TE sequences in lncRNAs evolve under greater evolutionary constraint than their non–TE sequences, than their intronic TEs, or than random DNA. Consistent with functional constraint, we found that TEs contribute signals essential for the biogenesis of many lncRNAs, including ∼30,000 unique sites for transcription initiation, splicing, or polyadenylation in human. In addition, we identified ∼35,000 TEs marked as open chromatin located within 10 kb upstream of lncRNA genes. The density of these marks in one cell type correlate with elevated expression of the downstream lncRNA in the same cell type, suggesting that these TEs contribute to cis-regulation. These global trends are recapitulated in several lncRNAs with established functions. Finally a subset of TEs embedded in lncRNAs are subject to RNA editing and predicted to form secondary structures likely important for function. In conclusion, TEs are nearly ubiquitous in lncRNAs and have played an important role in the lineage-specific diversification of vertebrate lncRNA repertoires.
Author Summary
An unexpected layer of complexity in the genomes of humans and other vertebrates lies in the abundance of genes that do not appear to encode proteins but produce a variety of non-coding RNAs. In particular, the human genome is currently predicted to contain 5,000–10,000 independent gene units generating long (>200 nucleotides) noncoding RNAs (lncRNAs). While there is growing evidence that a large fraction of these lncRNAs have cellular functions, notably to regulate protein-coding gene expression, almost nothing is known on the processes underlying the evolutionary origins and diversification of lncRNA genes. Here we show that transposable elements, through their capacity to move and spread in genomes in a lineage-specific fashion, as well as their ability to introduce regulatory sequences upon chromosomal insertion, represent a major force shaping the lncRNA repertoire of humans, mice, and zebrafish. Not only do TEs make up a substantial fraction of mature lncRNA transcripts, they are also enriched in the vicinity of lncRNA genes, where they frequently contribute to their transcriptional regulation. Through specific examples we provide evidence that some TE sequences embedded in lncRNAs are critical for the biogenesis of lncRNAs and likely important for their function.
doi:10.1371/journal.pgen.1003470
PMCID: PMC3636048  PMID: 23637635
7.  Global analysis of disease-related DNA sequence variation in 10 healthy individuals: Implications for whole genome-based clinical diagnostics 
Background
Understanding how sequence variants within healthy genomes are distributed with respect to ethnicity and disease-implicated genes is an essential first step toward establishing baselines for personalized genomic medicine.
Methods
In this study, we present an analysis of 10 genomes from healthy individuals of various ethnicities, produced using six different sequencing technologies. In total, these genomes contain more than 34 million single-nucleotide variants.
Results
We have analyzed these variants from a clinical perspective, assaying the influence of sequencing technology and ethnicity on prognosis. We have also examined the utility of OMIM and the disease-gene literature for determining the impact of rare, personal variants on an individual’s health.
Conclusions
Our analyses demonstrate that clinical prognoses are complicated by sequencing platform-specific errors and ethnicity. We show that disease-causing alleles are globally distributed along ethnic lines, with alleles known to be disease causing in Eurasians being significantly more likely to be homozygous in Africans.
doi:10.1097/GIM.0b013e31820ed321
PMCID: PMC3558030  PMID: 21325948
personal genomes; genome analysis; personalized genomics
8.  Exome Sequencing and Unrelated Findings in the Context of Complex Disease Research: Ethical and Clinical Implications 
Discovery medicine  2011;12(62):41-55.
Exome sequencing has identified the causes of several Mendelian diseases, although it has rarely been used in a clinical setting to diagnose the genetic cause of an idiopathic disorder in a single patient. We performed exome sequencing on a pedigree with several members affected with attention deficit/hyperactivity disorder (ADHD), in an effort to identify candidate variants predisposing to this complex disease. While we did identify some rare variants that might predispose to ADHD, we have not yet proven the causality for any of them. However, over the course of the study, one subject was discovered to have idiopathic hemolytic anemia (IHA), which was suspected to be genetic in origin. Analysis of this subject’s exome readily identified two rare non-synonymous mutations in PKLR gene as the most likely cause of the IHA, although these two mutations had not been documented before in a single individual. We further confirmed the deficiency by functional biochemical testing, consistent with a diagnosis of red blood cell pyruvate kinase deficiency. Our study implies that exome and genome sequencing will certainly reveal additional rare variation causative for even well-studied classical Mendelian diseases, while also revealing variants that might play a role in complex diseases. Furthermore, our study has clinical and ethical implications for exome and genome sequencing in a research setting; how to handle unrelated findings of clinical significance, in the context of originally planned complex disease research, remains a largely uncharted area for clinicians and researchers.
PMCID: PMC3544941  PMID: 21794208
9.  Genome, Functional Gene Annotation, and Nuclear Transformation of the Heterokont Oleaginous Alga Nannochloropsis oceanica CCMP1779 
PLoS Genetics  2012;8(11):e1003064.
Unicellular marine algae have promise for providing sustainable and scalable biofuel feedstocks, although no single species has emerged as a preferred organism. Moreover, adequate molecular and genetic resources prerequisite for the rational engineering of marine algal feedstocks are lacking for most candidate species. Heterokonts of the genus Nannochloropsis naturally have high cellular oil content and are already in use for industrial production of high-value lipid products. First success in applying reverse genetics by targeted gene replacement makes Nannochloropsis oceanica an attractive model to investigate the cell and molecular biology and biochemistry of this fascinating organism group. Here we present the assembly of the 28.7 Mb genome of N. oceanica CCMP1779. RNA sequencing data from nitrogen-replete and nitrogen-depleted growth conditions support a total of 11,973 genes, of which in addition to automatic annotation some were manually inspected to predict the biochemical repertoire for this organism. Among others, more than 100 genes putatively related to lipid metabolism, 114 predicted transcription factors, and 109 transcriptional regulators were annotated. Comparison of the N. oceanica CCMP1779 gene repertoire with the recently published N. gaditana genome identified 2,649 genes likely specific to N. oceanica CCMP1779. Many of these N. oceanica–specific genes have putative orthologs in other species or are supported by transcriptional evidence. However, because similarity-based annotations are limited, functions of most of these species-specific genes remain unknown. Aside from the genome sequence and its analysis, protocols for the transformation of N. oceanica CCMP1779 are provided. The availability of genomic and transcriptomic data for Nannochloropsis oceanica CCMP1779, along with efficient transformation protocols, provides a blueprint for future detailed gene functional analysis and genetic engineering of Nannochloropsis species by a growing academic community focused on this genus.
Author Summary
Algae are a highly diverse group of organisms that have become the focus of renewed interest due to their potential for producing biofuel feedstocks, nutraceuticals, and biomaterials. Their high photosynthetic yields and ability to grow in areas unsuitable for agriculture provide a potential sustainable alternative to using traditional agricultural crops for biofuels. Because none of the algae currently in use have a history of domestication, and bioengineering of algae is still in its infancy, there is a need to develop algal strains adapted to cultivation for industrial large-scale production of desired compounds. Model organisms ranging from mice to baker's yeast have been instrumental in providing insights into fundamental biological structures and functions. The algal field needs versatile models to develop a fundamental understanding of photosynthetic production of biomass and valuable compounds in unicellular, marine, oleaginous algal species. To contribute to the development of such an algal model system for basic discovery, we sequenced the genome and two sets of transcriptomes of N. oceanica CCMP1779, assembled the genomic sequence, identified putative genes, and began to interpret the function of selected genes. This species was chosen because it is readily transformable with foreign DNA and grows well in culture.
doi:10.1371/journal.pgen.1003064
PMCID: PMC3499364  PMID: 23166516
10.  Elucidation of the molecular envenomation strategy of the cone snail Conus geographus through transcriptome sequencing of its venom duct 
BMC Genomics  2012;13:284.
Background
The fish-hunting cone snail, Conus geographus, is the deadliest snail on earth. In the absence of medical intervention, 70% of human stinging cases are fatal. Although, its venom is known to consist of a cocktail of small peptides targeting different ion-channels and receptors, the bulk of its venom constituents, their sites of manufacture, relative abundances and how they function collectively in envenomation has remained unknown.
Results
We have used transcriptome sequencing to systematically elucidate the contents the C. geographus venom duct, dividing it into four segments in order to investigate each segment’s mRNA contents. Three different types of calcium channel (each targeted by unrelated, entirely distinct venom peptides) and at least two different nicotinic receptors appear to be targeted by the venom. Moreover, the most highly expressed venom component is not paralytic, but causes sensory disorientation and is expressed in a different segment of the venom duct from venoms believed to cause sensory disruption. We have also identified several new toxins of interest for pharmaceutical and neuroscience research.
Conclusions
Conus geographus is believed to prey on fish hiding in reef crevices at night. Our data suggest that disorientation of prey is central to its envenomation strategy. Furthermore, venom expression profiles also suggest a sophisticated layering of venom-expression patterns within the venom duct, with disorientating and paralytic venoms expressed in different regions. Thus, our transcriptome analysis provides a new physiological framework for understanding the molecular envenomation strategy of this deadly snail.
doi:10.1186/1471-2164-13-284
PMCID: PMC3441800  PMID: 22742208
Conus geographus; Conotoxins; RNA-seq; Venom duct compartmentalization
11.  MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects 
BMC Bioinformatics  2011;12:491.
Background
Second-generation sequencing technologies are precipitating major shifts with regards to what kinds of genomes are being sequenced and how they are annotated. While the first generation of genome projects focused on well-studied model organisms, many of today's projects involve exotic organisms whose genomes are largely terra incognita. This complicates their annotation, because unlike first-generation projects, there are no pre-existing 'gold-standard' gene-models with which to train gene-finders. Improvements in genome assembly and the wide availability of mRNA-seq data are also creating opportunities to update and re-annotate previously published genome annotations. Today's genome projects are thus in need of new genome annotation tools that can meet the challenges and opportunities presented by second-generation sequencing technologies.
Results
We present MAKER2, a genome annotation and data management tool designed for second-generation genome projects. MAKER2 is a multi-threaded, parallelized application that can process second-generation datasets of virtually any size. We show that MAKER2 can produce accurate annotations for novel genomes where training-data are limited, of low quality or even non-existent. MAKER2 also provides an easy means to use mRNA-seq data to improve annotation quality; and it can use these data to update legacy annotations, significantly improving their quality. We also show that MAKER2 can evaluate the quality of genome annotations, and identify and prioritize problematic annotations for manual review.
Conclusions
MAKER2 is the first annotation engine specifically designed for second-generation genome projects. MAKER2 scales to datasets of any size, requires little in the way of training data, and can use mRNA-seq data to improve annotation quality. It can also update and manage legacy genome annotation datasets.
doi:10.1186/1471-2105-12-491
PMCID: PMC3280279  PMID: 22192575
13.  The Genome Sequence of the Leaf-Cutter Ant Atta cephalotes Reveals Insights into Its Obligate Symbiotic Lifestyle 
PLoS Genetics  2011;7(2):e1002007.
Leaf-cutter ants are one of the most important herbivorous insects in the Neotropics, harvesting vast quantities of fresh leaf material. The ants use leaves to cultivate a fungus that serves as the colony's primary food source. This obligate ant-fungus mutualism is one of the few occurrences of farming by non-humans and likely facilitated the formation of their massive colonies. Mature leaf-cutter ant colonies contain millions of workers ranging in size from small garden tenders to large soldiers, resulting in one of the most complex polymorphic caste systems within ants. To begin uncovering the genomic underpinnings of this system, we sequenced the genome of Atta cephalotes using 454 pyrosequencing. One prediction from this ant's lifestyle is that it has undergone genetic modifications that reflect its obligate dependence on the fungus for nutrients. Analysis of this genome sequence is consistent with this hypothesis, as we find evidence for reductions in genes related to nutrient acquisition. These include extensive reductions in serine proteases (which are likely unnecessary because proteolysis is not a primary mechanism used to process nutrients obtained from the fungus), a loss of genes involved in arginine biosynthesis (suggesting that this amino acid is obtained from the fungus), and the absence of a hexamerin (which sequesters amino acids during larval development in other insects). Following recent reports of genome sequences from other insects that engage in symbioses with beneficial microbes, the A. cephalotes genome provides new insights into the symbiotic lifestyle of this ant and advances our understanding of host–microbe symbioses.
Author Summary
Leaf-cutter ant workers forage for and cut leaves that they use to support the growth of a specialized fungus, which serves as the colony's primary food source. The ability of these ants to grow their own food likely facilitated their emergence as one of the most dominant herbivores in New World tropical ecosystems, where leaf-cutter ants harvest more plant biomass than any other herbivore species. These ants have also evolved one of the most complex forms of division of labor, with colonies composed of different-sized workers specialized for different tasks. To gain insight into the biology of these ants, we sequenced the first genome of a leaf-cutter ant, Atta cephalotes. Our analysis of this genome reveals characteristics reflecting the obligate nutritional dependency of these ants on their fungus. These findings represent the first genetic evidence of a reduced capacity for nutrient acquisition in leaf-cutter ants, which is likely compensated for by their fungal symbiont. These findings parallel other nutritional host–microbe symbioses, suggesting convergent genomic modifications in these types of associations.
doi:10.1371/journal.pgen.1002007
PMCID: PMC3037820  PMID: 21347285
14.  Characterization of the Conus bullatus genome and its venom-duct transcriptome 
BMC Genomics  2011;12:60.
Background
The venomous marine gastropods, cone snails (genus Conus), inject prey with a lethal cocktail of conopeptides, small cysteine-rich peptides, each with a high affinity for its molecular target, generally an ion channel, receptor or transporter. Over the last decade, conopeptides have proven indispensable reagents for the study of vertebrate neurotransmission. Conus bullatus belongs to a clade of Conus species called Textilia, whose pharmacology is still poorly characterized. Thus the genomics analyses presented here provide the first step toward a better understanding the enigmatic Textilia clade.
Results
We have carried out a sequencing survey of the Conus bullatus genome and venom-duct transcriptome. We find that conopeptides are highly expressed within the venom-duct, and describe an in silico pipeline for their discovery and characterization using RNA-seq data. We have also carried out low-coverage shotgun sequencing of the genome, and have used these data to determine its size, genome-wide base composition, simple repeat, and mobile element densities.
Conclusions
Our results provide the first global view of venom-duct transcription in any cone snail. A notable feature of Conus bullatus venoms is the breadth of A-superfamily peptides expressed in the venom duct, which are unprecedented in their structural diversity. We also find SNP rates within conopeptides are higher compared to the remainder of C. bullatus transcriptome, consistent with the hypothesis that conopeptides are under diversifying selection.
doi:10.1186/1471-2164-12-60
PMCID: PMC3040727  PMID: 21266071
15.  A standard variation file format for human genome sequences 
Genome Biology  2010;11(8):R88.
Here we describe the Genome Variation Format (GVF) and the 10Gen dataset. GVF, an extension of Generic Feature Format version 3 (GFF3), is a simple tab-delimited format for DNA variant files, which uses Sequence Ontology to describe genome variation data. The 10Gen dataset, ten human genomes in GVF format, is freely available for community analysis from the Sequence Ontology website and from an Amazon elastic block storage (EBS) snapshot for use in Amazon's EC2 cloud computing environment.
doi:10.1186/gb-2010-11-8-r88
PMCID: PMC2945790  PMID: 20796305
16.  The Pinus taeda genome is characterized by diverse and highly diverged repetitive sequences 
BMC Genomics  2010;11:420.
Background
In today's age of genomic discovery, no attempt has been made to comprehensively sequence a gymnosperm genome. The largest genus in the coniferous family Pinaceae is Pinus, whose 110-120 species have extremely large genomes (c. 20-40 Gb, 2N = 24). The size and complexity of these genomes have prompted much speculation as to the feasibility of completing a conifer genome sequence. Conifer genomes are reputed to be highly repetitive, but there is little information available on the nature and identity of repetitive units in gymnosperms. The pines have extensive genetic resources, with approximately 329000 ESTs from eleven species and genetic maps in eight species, including a dense genetic map of the twelve linkage groups in Pinus taeda.
Results
We present here the Sanger sequence and annotation of ten P. taeda BAC clones and Genome Analyzer II whole genome shotgun (WGS) sequences representing 7.5% of the genome. Computational annotation of ten BACs predicts three putative protein-coding genes and at least fifteen likely pseudogenes in nearly one megabase of sequence. We found three conifer-specific LTR retroelements in the BACs, and tentatively identified at least 15 others based on evidence from the distantly related angiosperms. Alignment of WGS sequences to the BACs indicates that 80% of BAC sequences have similar copies (≥ 75% nucleotide identity) elsewhere in the genome, but only 23% have identical copies (99% identity). The three most common repetitive elements in the genome were identified and, when combined, represent less than 5% of the genome.
Conclusions
This study indicates that the majority of repeats in the P. taeda genome are 'novel' and will therefore require additional BAC or genomic sequencing for accurate characterization. The pine genome contains a very large number of diverged and probably defunct repetitive elements. This study also provides new evidence that sequencing a pine genome using a WGS approach is a feasible goal.
doi:10.1186/1471-2164-11-420
PMCID: PMC2996948  PMID: 20609256
17.  Comparative Genomics of the Eukaryotes 
Science (New York, N.Y.)  2000;287(5461):2204-2215.
A comparative analysis of the genomes of Drosophila melanogaster, Caenorhabditis elegans, and Saccharomyces cerevisiae—and the proteins they are predicted to encode—was undertaken in the context of cellular, developmental, and evolutionary processes. The nonredundant protein sets of flies and worms are similar in size and are only twice that of yeast, but different gene families are expanded in each genome, and the multidomain proteins and signaling pathways of the fly and worm are far more complex than those of yeast. The fly has orthologs to 177 of the 289 human disease genes examined and provides the foundation for rapid analysis of some of the basic processes involved in human disease.
PMCID: PMC2754258  PMID: 10731134
18.  Quantitative measures for the management and comparison of annotated genomes 
BMC Bioinformatics  2009;10:67.
Background
The ever-increasing number of sequenced and annotated genomes has made management of their annotations a significant undertaking, especially for large eukaryotic genomes containing many thousands of genes. Typically, changes in gene and transcript numbers are used to summarize changes from release to release, but these measures say nothing about changes to individual annotations, nor do they provide any means to identify annotations in need of manual review.
Results
In response, we have developed a suite of quantitative measures to better characterize changes to a genome's annotations between releases, and to prioritize problematic annotations for manual review. We have applied these measures to the annotations of five eukaryotic genomes over multiple releases – H. sapiens, M. musculus, D. melanogaster, A. gambiae, and C. elegans.
Conclusion
Our results provide the first detailed, historical overview of how these genomes' annotations have changed over the years, and demonstrate the usefulness of these measures for genome annotation management.
doi:10.1186/1471-2105-10-67
PMCID: PMC2653490  PMID: 19236712
19.  Genome-Wide Analysis of Human Disease Alleles Reveals That Their Locations Are Correlated in Paralogous Proteins 
PLoS Computational Biology  2008;4(11):e1000218.
The millions of mutations and polymorphisms that occur in human populations are potential predictors of disease, of our reactions to drugs, of predisposition to microbial infections, and of age-related conditions such as impaired brain and cardiovascular functions. However, predicting the phenotypic consequences and eventual clinical significance of a sequence variant is not an easy task. Computational approaches have found perturbation of conserved amino acids to be a useful criterion for identifying variants likely to have phenotypic consequences. To our knowledge, however, no study to date has explored the potential of variants that occur at homologous positions within paralogous human proteins as a means of identifying polymorphisms with likely phenotypic consequences. In order to investigate the potential of this approach, we have assembled a unique collection of known disease-causing variants from OMIM and the Human Genome Mutation Database (HGMD) and used them to identify and characterize pairs of sequence variants that occur at homologous positions within paralogous human proteins. Our analyses demonstrate that the locations of variants are correlated in paralogous proteins. Moreover, if one member of a variant-pair is disease-causing, its partner is likely to be disease-causing as well. Thus, information about variant-pairs can be used to identify potentially disease-causing variants, extend existing procedures for polymorphism prioritization, and provide a suite of candidates for further diagnostic and therapeutic purposes.
Author Summary
There exists a superabundance of human sequence variations. Testing every sequence variant for association with human disease is often infeasible, as studies must be very large—and hence expensive—to overcome the statistical penalties used to control for multiple tests. A common alternative is to assay only a subset of sequence variants for which there are prior reasons to believe they may be disease-causing. Sequence variants that change conserved amino acids, for example, are often disease-causing. As an adjunct to this approach, we have explored the potential of variants that occur at homologous positions within paralogous human proteins as a means of identifying disease-causing DNA sequence variations. We find that DNA sequence variants co-occur at aligned amino acid pairs more frequently than expected by chance, suggesting that similar functional constraints on paralogous proteins result in coordinated distributions of variants along their lengths. Moreover, if one member of a variant-pair is disease-causing, its partner is likely to be disease-causing as well. These facts provide new avenues for the identification of disease-causing sequence variations.
doi:10.1371/journal.pcbi.1000218
PMCID: PMC2565504  PMID: 18989397
20.  Improved repeat identification and masking in Dipterans 
Gene  2006;389(1):1-9.
doi:10.1016/j.gene.2006.09.011
PMCID: PMC1945102  PMID: 17137733
Heterochromatin; Drosophila; A. gambiae; PILER; transposable element; RepeatRunner
21.  Large-Scale Trends in the Evolution of Gene Structures within 11 Animal Genomes 
PLoS Computational Biology  2006;2(3):e15.
We have used the annotations of six animal genomes (Homo sapiens, Mus musculus, Ciona intestinalis, Drosophila melanogaster, Anopheles gambiae, and Caenorhabditis elegans) together with the sequences of five unannotated Drosophila genomes to survey changes in protein sequence and gene structure over a variety of timescales—from the less than 5 million years since the divergence of D. simulans and D. melanogaster to the more than 500 million years that have elapsed since the Cambrian explosion. To do so, we have developed a new open-source software library called CGL (for “Comparative Genomics Library”). Our results demonstrate that change in intron–exon structure is gradual, clock-like, and largely independent of coding-sequence evolution. This means that genome annotations can be used in new ways to inform, corroborate, and test conclusions drawn from comparative genomics analyses that are based upon protein and nucleotide sequence similarities.
Synopsis
Just as protein sequences change over time, so do gene structures. Over comparatively short evolutionary timescales, introns lengthen and shorten; and over longer timescales the number and positions of introns in homologous genes can change. These facts suggest that the intron–exon structures of genes may provide a source of evolutionary information. The utility of gene structures as materials for phylogenetic analyses, however, depends upon their independence from the forces driving protein evolution. If, for example, intron–exon structures are strongly influenced by selection at the amino acid level, then using them for phylogenetic investigations is largely pointless, as the same information could have been more easily gained from protein analyses. Using 11 animal genomes, Yandell et al. show that evolution of intron lengths and positions is largely—though not completely—independent of protein sequence evolution. This means that gene structures provide a source of information about the evolutionary past independent of protein sequence similarities—a finding the authors employ to investigate the accuracy of the protein clock and to explore the utility of gene structures as a means to resolve deep phylogenetic relationships within the animals.
doi:10.1371/journal.pcbi.0020015
PMCID: PMC1386723  PMID: 16518452
22.  Large-Scale Trends in the Evolution of Gene Structures within 11 Animal Genomes 
PLoS Computational Biology  2006;2(3):e15.
We have used the annotations of six animal genomes (Homo sapiens, Mus musculus, Ciona intestinalis, Drosophila melanogaster, Anopheles gambiae, and Caenorhabditis elegans) together with the sequences of five unannotated Drosophila genomes to survey changes in protein sequence and gene structure over a variety of timescales—from the less than 5 million years since the divergence of D. simulans and D. melanogaster to the more than 500 million years that have elapsed since the Cambrian explosion. To do so, we have developed a new open-source software library called CGL (for “Comparative Genomics Library”). Our results demonstrate that change in intron–exon structure is gradual, clock-like, and largely independent of coding-sequence evolution. This means that genome annotations can be used in new ways to inform, corroborate, and test conclusions drawn from comparative genomics analyses that are based upon protein and nucleotide sequence similarities.
Synopsis
Just as protein sequences change over time, so do gene structures. Over comparatively short evolutionary timescales, introns lengthen and shorten; and over longer timescales the number and positions of introns in homologous genes can change. These facts suggest that the intron–exon structures of genes may provide a source of evolutionary information. The utility of gene structures as materials for phylogenetic analyses, however, depends upon their independence from the forces driving protein evolution. If, for example, intron–exon structures are strongly influenced by selection at the amino acid level, then using them for phylogenetic investigations is largely pointless, as the same information could have been more easily gained from protein analyses. Using 11 animal genomes, Yandell et al. show that evolution of intron lengths and positions is largely—though not completely—independent of protein sequence evolution. This means that gene structures provide a source of information about the evolutionary past independent of protein sequence similarities—a finding the authors employ to investigate the accuracy of the protein clock and to explore the utility of gene structures as a means to resolve deep phylogenetic relationships within the animals.
doi:10.1371/journal.pcbi.0020015
PMCID: PMC1386723  PMID: 16518452
23.  The Sequence Ontology: a tool for the unification of genome annotations 
Genome Biology  2005;6(5):R44.
The goal of the Sequence Ontology (SO) project is to produce a structured controlled vocabulary with a common set of terms and definitions for parts of a genomic annotation, and to describe the relationships among them. Details of SO construction, design and use, particularly with regard to part-whole relationships are discussed and the practical utility of SO is demonstrated for a set of genome annotations from Drosophila melanogaster.
The Sequence Ontology (SO) is a structured controlled vocabulary for the parts of a genomic annotation. SO provides a common set of terms and definitions that will facilitate the exchange, analysis and management of genomic data. Because SO treats part-whole relationships rigorously, data described with it can become substrates for automated reasoning, and instances of sequence features described by the SO can be subjected to a group of logical operations termed extensional mereology operators.
doi:10.1186/gb-2005-6-5-r44
PMCID: PMC1175956  PMID: 15892872
24.  The Celera Discovery System™ 
Nucleic Acids Research  2002;30(1):129-136.
The Celera Discovery System™ (CDS) is a web-accessible research workbench for mining genomic and related biological information. Users have access to the human and mouse genome sequences with annotation presented in summary form in BioMolecule Reports for genes, transcripts and proteins. Over 40 additional databases are available, including sequence, mapping, mutation, genetic variation, mRNA expression, protein structure, motif and classification data. Data are accessible by browsing reports, through a variety of interactive graphical viewers, and by advanced query capability provided by the LION SRS™ search engine. A growing number of sequence analysis tools are available, including sequence similarity, pattern searching, multiple sequence alignment and Hidden Markov Model search. A user workspace keeps track of queries and analyses. CDS is widely used by the academic research community and requires a subscription for access. The system and academic pricing information are available at http://cds.celera.com.
PMCID: PMC99167  PMID: 11752274
25.  VAAST 2.0: Improved Variant Classification and Disease-Gene Identification Using a Conservation-Controlled Amino Acid Substitution Matrix 
Genetic Epidemiology  2013;37(6):622-634.
The need for improved algorithmic support for variant prioritization and disease-gene identification in personal genomes data is widely acknowledged. We previously presented the Variant Annotation, Analysis, and Search Tool (VAAST), which employs an aggregative variant association test that combines both amino acid substitution (AAS) and allele frequencies. Here we describe and benchmark VAAST 2.0, which uses a novel conservation-controlled AAS matrix (CASM), to incorporate information about phylogenetic conservation. We show that the CASM approach improves VAAST’s variant prioritization accuracy compared to its previous implementation, and compared to SIFT, PolyPhen-2, and MutationTaster. We also show that VAAST 2.0 outperforms KBAC, WSS, SKAT, and variable threshold (VT) using published case-control datasets for Crohn disease (NOD2), hypertriglyceridemia (LPL), and breast cancer (CHEK2). VAAST 2.0 also improves search accuracy on simulated datasets across a wide range of allele frequencies, population-attributable disease risks, and allelic heterogeneity, factors that compromise the accuracies of other aggregative variant association tests. We also demonstrate that, although most aggregative variant association tests are designed for common genetic diseases, these tests can be easily adopted as rare Mendelian disease-gene finders with a simple ranking-by-statistical-significance protocol, and the performance compares very favorably to state-of-art filtering approaches. The latter, despite their popularity, have suboptimal performance especially with the increasing case sample size.
doi:10.1002/gepi.21743
PMCID: PMC3791556  PMID: 23836555
disease-gene finder; variant classifier; aggregative association test; rare Mendelian disease; complex disease

Results 1-25 (25)