Understanding the molecular basis of phenotypic diversity is a critical challenge in biology, yet we know little about the mechanistic effects of different mutations and epistatic relationships among loci that contribute to complex traits. Pigmentation genetics offers a powerful model for identifying mutations underlying diversity, and for determining how additional complexity emerges from interactions among loci. Centuries of artificial selection in domestic rock pigeons have cultivated tremendous variation in plumage pigmentation through the combined effects of dozens of loci. The dominance and epistatic hierarchies of key loci governing this diversity are known through classical genetic studies [1-6], but their molecular identities and the mechanisms of their genetic interactions remain unknown. Here we identify protein-coding and cis-regulatory mutations in Tyrp1, Sox10, and Slc45a2 that underlie classical color phenotypes of pigeons, and present a mechanistic explanation of their dominance and epistatic relationships. We also find unanticipated allelic heterogeneity at Tyrp1 and Sox10, indicating that color variants evolved repeatedly though mutations in the same genes. These results demonstrate how a spectrum of coding and regulatory mutations in a small number of genes can interact to generate substantial phenotypic diversity in a classic Darwinian model of evolution .
Cone snails, genus Conus, are predatory marine snails that use venom to capture their prey. This venom contains a diverse array of peptide toxins, known as conotoxins, which undergo a diverse set of posttranslational modifications. Amidating enzymes modify peptides and proteins containing a C-terminal glycine residue, resulting in loss of the glycine residue and amidation of the preceding residue. A significant fraction of peptides present in the venom of cone snails contain C-terminal amidated residues, which are important for optimizing biological activity. This study describes the characterization of the amidating enzyme, peptidylglycine α-amidating monooxygenase (PAM), present in the venom duct of cone snails, Conus bullatus and Conus geographus.
PAM is known to carry out two functions, peptidyl α-hydroxylating monooxygenase (PHM) and peptidylamido-glycolate lyase (PAL). In some animals, such as Drosophila melanogaster, these two functions are present in separate polypeptides, working as individual enzymes. In other animals, such as mammals and in Aplysia californica, PAM activity resides in a single, bifunctional polypeptide. Using specific oligonucleotide primers and reverse transcription-polymerase chain reaction we have identified and cloned from the venom duct cDNA library, a cDNA with 49% homology to PAM from A. californica. We have determined that both the PHM and PAL activities are encoded in one mRNA polynucleotide in both C. bullatus and C. geographus. We have directly demonstrated enzymatic activity catalyzing the conversion of dansyl-YVG-COOH to dansyl-YV-NH2 in cloned cDNA expressed in Drosophila S2 cells.
Posttranslational modification; Conotoxins; Peptidylglycine α-amidating; monooxygenase
We hypothesized that genetic variation affects responsiveness to 17-alpha hydroxyprogesterone caproate (17P) for recurrent preterm birth prevention.
Women of European ancestry with ≥1 spontaneous singleton preterm birth at <34 weeks’ gestation who received 17P were recruited prospectively and classified as a 17P responder or nonresponder by the difference in delivery gestational age between 17P-treated and -untreated pregnancies. Samples underwent whole exome sequencing. Coding variants were compared between responders and nonresponders with the use of the Variant Annotation, Analysis, and Search Tool (VAASl), which is a probabilistic search tool for the identification of disease-causing variants, and were compared with a Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway candidate gene list. Genes with the highest VAAST scores were then classified by the online Protein ANalysis THrough Evolutionary Relationships (PANTHER) system into known gene ontology molecular functions and biologic processes. Gene distributions within these classifications were compared with an online reference population to identity over and under represented gene sets.
Fifty women (9 nonresponders) were included. Responders delivered 9.2 weeks longer with 17P vs 1.3 weeks’ gestation for nonresponders (P < .001). A genome wide search for genetic differences implicated the NOS1 gene to be the most likely associated gene from among genes on the KEGG candidate gene list (P < .00095). PANTHER analysis revealed several over represented gene ontology categories that included cell adhesion, cell communication, signal transduction, nitric oxide signal transduction, and receptor activity (all with significant Bonferroni-corrected probability values).
We identified sets of over-represented genes in key processes among responders to 17P, which is the first step in the application of pharmacogenomics to preterm birth prevention.
pharmacogenomics; progesterone; spontaneous preterm birth
High-throughput sequencing of related individuals has become an important tool for studying human disease. However, owing to technical complexity and lack of available tools, most pedigree-based sequencing studies rely on an ad hoc combination of suboptimal analyses. Here we present pedigree-VAAST (pVAAST), a disease-gene identification tool designed for high-throughput sequence data in pedigrees. pVAAST uses a sequence-based model to perform variant and gene-based linkage analysis. Linkage information is then combined with functional prediction and rare variant case-control association information in a unified statistical framework. pVAAST outperformed linkage and rare-variant association tests in simulations and identified disease-causing genes from whole-genome sequence data in three human pedigrees with dominant, recessive and de novo inheritance patterns. The approach is robust to incomplete penetrance and locus heterogeneity and is applicable to a wide variety of genetic traits. pVAAST maintains high power across studies of monogenic, high-penetrance phenotypes in a single pedigree to highly polygenic, common phenotypes involving hundreds of pedigrees.
The VAAST pipeline is specifically designed to identify disease-associated alleles in next-generation sequencing data. In the protocols presented in this paper, we outline the best practices for variant prioritization using VAAST. Examples and test data are provided for case-control, small pedigree, and large pedigree analyses. These protocols will teach users the fundamentals of VAAST, VAAST 2.0, and pVAAST analyses.
VAAST; rare-variant association test; variant classification; disease-gene identification; next-generation sequencing; genome-wide association studies; human disease; genomics; computational genomics; bioinformatics
Adult muscle’s exceptional capacity for regeneration is mediated by muscle stem cells, termed satellite cells. As with many stem cells, Wnt/β-catenin signaling has been proposed to be critical in satellite cells during regeneration. Using new genetic reagents, we explicitly test in vivo whether Wnt/β-catenin signaling is necessary and sufficient within satellite cells and their derivatives for regeneration. We find that signaling is transiently active in transit-amplifying myoblasts, but is not required for regeneration or satellite cell self-renewal. Instead, downregulation of transiently activated β-catenin is important to limit the regenerative response, as continuous regeneration is deleterious. Wnt/β-catenin activation in adult satellite cells may simply be a vestige of their developmental lineage, in which β-catenin signaling is critical for fetal myogenesis. In the adult, surprisingly, we show that it is not activation but rather silencing of Wnt/β-catenin signaling that is important for muscle regeneration.
•Wnt/β-catenin signaling is transiently active in myoblasts during muscle regeneration•β-catenin is not required in myogenic cells for muscle regeneration•β-catenin signaling in myoblasts must be silenced to limit the regenerative response•β-catenin requirement and sensitivity differs in fetal and adult muscle stem cells
In this article, Kardon and colleagues show that Wnt/β-catenin signaling is transiently active in transit-amplifying myoblasts but nevertheless is not required for regeneration or satellite cell self-renewal. Instead, downregulation of transiently activated β-catenin is important to limit the regenerative response, as continuous regeneration is deleterious. Wnt/β-catenin activation in adult satellite cells may simply be a vestige of their developmental lineage.
Cellular senescence is a crucial tumor suppressor mechanism. We discovered a CAPERα/TBX3 repressor complex required to prevent senescence in primary cells and mouse embryos. Critical, previously unknown roles for CAPERα in controlling cell proliferation are manifest in an obligatory interaction with TBX3 to regulate chromatin structure and repress transcription of CDKN2A-p16INK and the RB pathway. The IncRNA UCA1 is a direct target of CAPERα/TBX3 repression whose overexpression is sufficient to induce senescence. In proliferating cells, we found that hnRNPA1 binds and destabilizes CDKN2A-p16INK mRNA whereas during senescence, UCA1 sequesters hnRNPA1 and thus stabilizes CDKN2A-p16INK. Thus CAPERα/TBX3 and UCA1 constitute a coordinated, reinforcing mechanism to regulate both CDKN2A-p16INK transcription and mRNA stability. Dissociation of the CAPERα/TBX3 co-repressor during oncogenic stress activates UCA1, revealing a novel mechanism for oncogene-induced senescence. Our elucidation of CAPERα and UCA1 functions in vivo provides new insights into senescence induction, and the oncogenic and developmental properties of TBX3.
Cell division and growth are essential for survival. But it is equally important that cells can stop dividing, because failing to do so can lead to the uncontrolled tumor growth seen in cancer. One such quality control mechanism is called senescence, which stops the growth and multiplication of cells that are old, damaged or behaving in ways that may harm the organism. All cells eventually stop dividing and undergo senescence, but a number of factors may trigger the process early, such as DNA damage, stress or the appearance of cancer-causing proteins.
Senescence can be harmful if it occurs too early in life and interferes with normal growth. Severe birth defects—including fatal heart problems and limb malformations—occur if senescence is inappropriately triggered early in development. Mutations in a gene encoding a protein called TBX3 have been linked to these severe birth defects.
Normally, TBX3 stops the production of other proteins that trigger senescence in early development, and helps to maintain stable conditions in adult cells. Understanding how it does so could help scientists understand normal cell function and aging, and also help to find ways to trigger senescence in cancerous cells.
Kumar et al. found that a protein called CAPERα—for short Coactivator of AP1 and Estrogen Receptor—forms a complex with TBX3 that stops cells dividing in living organisms in at least two different ways. One way is by altering how DNA is folded. The other way involves a non-coding strand of RNA from a gene called UCA1: this RNA prevents the degradation of proteins that stop cell division.
In normal proliferating cells, the CAPERα/TBX3 protein complex prevents the production of UCA1 RNA. In contrast, in cells that received a cancer causing stimulus, TBX3 and CAPERα physically separate: this activates production of UCA1 RNA and causes senescence. Further studies will be required to establish exactly how the CAPERα/TBX3 protein complex interacts with DNA and RNA to control senescence and prevent cancer.
senescence; oncogenesis; development; p16; mouse
Medicago truncatula, a close relative of alfalfa, is a preeminent model for studying nitrogen fixation, symbiosis, and legume genomics. The Medicago sequencing project began in 2003 with the goal to decipher sequences originated from the euchromatic portion of the genome. The initial sequencing approach was based on a BAC tiling path, culminating in a BAC-based assembly (Mt3.5) as well as an in-depth analysis of the genome published in 2011.
Here we describe a further improved and refined version of the M. truncatula genome (Mt4.0) based on de novo whole genome shotgun assembly of a majority of Illumina and 454 reads using ALLPATHS-LG. The ALLPATHS-LG scaffolds were anchored onto the pseudomolecules on the basis of alignments to both the optical map and the genotyping-by-sequencing (GBS) map. The Mt4.0 pseudomolecules encompass ~360 Mb of actual sequences spanning 390 Mb of which ~330 Mb align perfectly with the optical map, presenting a drastic improvement over the BAC-based Mt3.5 which only contained 70% sequences (~250 Mb) of the current version. Most of the sequences and genes that previously resided on the unanchored portion of Mt3.5 have now been incorporated into the Mt4.0 pseudomolecules, with the exception of ~28 Mb of unplaced sequences. With regard to gene annotation, the genome has been re-annotated through our gene prediction pipeline, which integrates EST, RNA-seq, protein and gene prediction evidences. A total of 50,894 genes (31,661 high confidence and 19,233 low confidence) are included in Mt4.0 which overlapped with ~82% of the gene loci annotated in Mt3.5. Of the remaining genes, 14% of the Mt3.5 genes have been deprecated to an “unsupported” status and 4% are absent from the Mt4.0 predictions.
Mt4.0 and its associated resources, such as genome browsers, BLAST-able datasets and gene information pages, can be found on the JCVI Medicago web site (http://www.jcvi.org/medicago). The assembly and annotation has been deposited in GenBank (BioProject: PRJNA10791). The heavily curated chromosomal sequences and associated gene models of Medicago will serve as a better reference for legume biology and comparative genomics.
Medicago; Legume; Genome assembly; Gene annotation; Optical map
TBX3 is a member of the T-box family of transcription factors with critical roles in development, oncogenesis, cell fate, and tissue homeostasis. TBX3 mutations in humans cause complex congenital malformations and Ulnar-mammary syndrome. Previous investigations into TBX3 function focused on its activity as a transcriptional repressor. We used an unbiased proteomic approach to identify TBX3 interacting proteins in vivo and discovered that TBX3 interacts with multiple mRNA splicing factors and RNA metabolic proteins. We discovered that TBX3 regulates alternative splicing in vivo and can promote or inhibit splicing depending on context and transcript. TBX3 associates with alternatively spliced mRNAs and binds RNA directly. TBX3 binds RNAs containing TBX binding motifs, and these motifs are required for regulation of splicing. Our study reveals that TBX3 mutations seen in humans with UMS disrupt its splicing regulatory function. The pleiotropic effects of TBX3 mutations in humans and mice likely result from disrupting at least two molecular functions of this protein: transcriptional regulation and pre-mRNA splicing.
TBX3 is a protein with essential roles in development and tissue homeostasis, and is implicated in cancer pathogenesis. TBX3 mutations in humans cause a complex of birth defects called Ulnar-mammary syndrome (UMS). Despite the importance of TBX3 and decades of investigation, few TBX3 partner proteins have been identified and little is known about how it functions in cells. Unlike previous investigations focused on TBX3 as DNA binding factor that represses transcription, we took an unbiased approach to identify TBX3 partner proteins in mouse embryos and human cells. We discovered that TBX3 interacts with RNA binding proteins and binds mRNAs to regulate how they are spliced. The different mutations seen in human UMS patients produce mutant proteins that interact with different partners and have different splicing activities. TBX3 promotes or inhibits splicing depending on cellular context, its partner proteins, and the target mRNA. Eukaryotic cells have many more proteins than genes: alternative splicing is critical to generate the different mRNAs needed for production of the specific and vast repertoire of proteins a cell produces. Our finding that TBX3 regulates this process provides fundamental new insights into how altered quantity and molecular function of TBX3 contribute to human developmental disorders and cancer.
There is tremendous potential for genome sequencing to improve clinical diagnosis and care once it becomes routinely accessible, but this will require formalizing research methods into clinical best practices in the areas of sequence data generation, analysis, interpretation and reporting. The CLARITY Challenge was designed to spur convergence in methods for diagnosing genetic disease starting from clinical case history and genome sequencing data. DNA samples were obtained from three families with heritable genetic disorders and genomic sequence data were donated by sequencing platform vendors. The challenge was to analyze and interpret these data with the goals of identifying disease-causing variants and reporting the findings in a clinically useful format. Participating contestant groups were solicited broadly, and an independent panel of judges evaluated their performance.
A total of 30 international groups were engaged. The entries reveal a general convergence of practices on most elements of the analysis and interpretation process. However, even given this commonality of approach, only two groups identified the consensus candidate variants in all disease cases, demonstrating a need for consistent fine-tuning of the generally accepted methods. There was greater diversity of the final clinical report content and in the patient consenting process, demonstrating that these areas require additional exploration and standardization.
The CLARITY Challenge provides a comprehensive assessment of current practices for using genome sequencing to diagnose and report genetic diseases. There is remarkable convergence in bioinformatic techniques, but medical interpretation and reporting are areas that require further development by many groups.
The size and complexity of conifer genomes has, until now, prevented full genome sequencing and assembly. The large research community and economic importance of loblolly pine, Pinus taeda L., made it an early candidate for reference sequence determination.
We develop a novel strategy to sequence the genome of loblolly pine that combines unique aspects of pine reproductive biology and genome assembly methodology. We use a whole genome shotgun approach relying primarily on next generation sequence generated from a single haploid seed megagametophyte from a loblolly pine tree, 20-1010, that has been used in industrial forest tree breeding. The resulting sequence and assembly was used to generate a draft genome spanning 23.2 Gbp and containing 20.1 Gbp with an N50 scaffold size of 66.9 kbp, making it a significant improvement over available conifer genomes. The long scaffold lengths allow the annotation of 50,172 gene models with intron lengths averaging over 2.7 kbp and sometimes exceeding 100 kbp in length. Analysis of orthologous gene sets identifies gene families that may be unique to conifers. We further characterize and expand the existing repeat library based on the de novo analysis of the repetitive content, estimated to encompass 82% of the genome.
In addition to its value as a resource for researchers and breeders, the loblolly pine genome sequence and assembly reported here demonstrates a novel approach to sequencing the large and complex genomes of this important group of plants that can now be widely applied.
Nonsense-mediated messenger RNA (mRNA) decay (NMD) is a mRNA degradation pathway that regulates a significant portion of the transcriptome. The expression levels of numerous genes are known to be altered in NMD mutants, but it is not known which of these transcripts is a direct pathway target. Here, we present the first genome-wide analysis of direct NMD targeting in an intact animal. By using rapid reactivation of the NMD pathway in a Drosophila melanogaster NMD mutant and globally monitoring of changes in mRNA expression levels, we can distinguish between primary and secondary effects of NMD on gene expression. Using this procedure, we identified 168 candidate direct NMD targets in vivo. Remarkably, we found that 81% of direct target genes do not show increased expression levels in an NMD mutant, presumably due to feedback regulation. Because most previous studies have used up-regulation of mRNA expression as the only means to identify NMD-regulated transcripts, our results provide new directions for understanding the roles of the NMD pathway in endogenous gene regulation during animal development and physiology. For instance, we show clearly that direct target genes have longer 3′ untranslated regions compared with nontargets, suggesting long 3′ untranslated regions target mRNAs for NMD in vivo. In addition, we investigated the role of NMD in suppressing transcriptional noise and found that although the transposable element Copia is up-regulated in NMD mutants, this effect appears to be indirect.
Upf2; reactivation; NMD; Drosophila; RNA-seq
It was a zoological sensation when a living specimen of the coelacanth was first discovered in 1938, as this lineage of lobe-finned fish was thought to have gone extinct 70 million years ago. The modern coelacanth looks remarkably similar to many of its ancient relatives, and its evolutionary proximity to our own fish ancestors provides a glimpse of the fish that first walked on land. Here we report the genome sequence of the African coelacanth, Latimeria chalumnae. Through a phylogenomic analysis, we conclude that the lungfish, and not the coelacanth, is the closest living relative of tetrapods. Coelacanth protein-coding genes are significantly more slowly evolving than those of tetrapods, unlike other genomic features . Analyses of changes in genes and regulatory elements during the vertebrate adaptation to land highlight genes involved in immunity, nitrogen excretion and the development of fins, tail, ear, eye, brain, and olfaction. Functional assays of enhancers involved in the fin-to-limb transition and in the emergence of extra-embryonic tissues demonstrate the importance of the coelacanth genome as a blueprint for understanding tetrapod evolution.
Lampreys are representatives of an ancient vertebrate lineage that diverged from our own ~500 million years ago. By virtue of this deeply shared ancestry, the sea lamprey (P. marinus) genome is uniquely poised to provide insight into the ancestry of vertebrate genomes and the underlying principles of vertebrate biology. Here, we present the first lamprey whole-genome sequence and assembly. We note challenges faced owing to its high content of repetitive elements and GC bases, as well as the absence of broad-scale sequence information from closely related species. Analyses of the assembly indicate that two whole-genome duplications likely occurred before the divergence of ancestral lamprey and gnathostome lineages. Moreover, the results help define key evolutionary events within vertebrate lineages, including the origin of myelin-associated proteins and the development of appendages. The lamprey genome provides an important resource for reconstructing vertebrate origins and the evolutionary events that have shaped the genomes of extant organisms.
The geographic origins of breeds and genetic basis of variation within the widely distributed and phenotypically diverse domestic rock pigeon (Columba livia) remain largely unknown. We generated a rock pigeon reference genome and additional genome sequences representing domestic and feral populations. We find evidence for the origins of major breed groups in the Middle East, and contributions from a racing breed to North American feral populations. We identify EphB2 as a strong candidate for the derived head crest phenotype shared by numerous breeds, an important trait in mate selection in many avian species. We also find evidence that this trait evolved just once and spread throughout the species, and that the crest originates early in development by the localized molecular reversal of feather bud polarity.
ImagePlane is a modular pipeline for automated, high-throughput image analysis and information extraction. Designed to support planarian research, ImagePlane offers a self-parameterizing adaptive thresholding algorithm; an algorithm that can automatically segment animals into anterior–posterior/left–right quadrants for automated identification of region-specific differences in gene and protein expression; and a novel algorithm for quantification of morphology of animals, independent of their orientations and sizes. ImagePlane also provides methods for automatic report generation, and its outputs can be easily imported into third-party tools such as R and Excel. Here we demonstrate the pipeline's utility for identification of genes involved in stem cell proliferation in the planarian Schmidtea mediterranea. Although designed to support planarian studies, ImagePlane will prove useful for cell-based studies as well.
biology; functional genomics; genomics
Sacred lotus is a basal eudicot with agricultural, medicinal, cultural and religious importance. It was domesticated in Asia about 7,000 years ago, and cultivated for its rhizomes and seeds as a food crop. It is particularly noted for its 1,300-year seed longevity and exceptional water repellency, known as the lotus effect. The latter property is due to the nanoscopic closely packed protuberances of its self-cleaning leaf surface, which have been adapted for the manufacture of a self-cleaning industrial paint, Lotusan.
The genome of the China Antique variety of the sacred lotus was sequenced with Illumina and 454 technologies, at respective depths of 101× and 5.2×. The final assembly has a contig N50 of 38.8 kbp and a scaffold N50 of 3.4 Mbp, and covers 86.5% of the estimated 929 Mbp total genome size. The genome notably lacks the paleo-triplication observed in other eudicots, but reveals a lineage-specific duplication. The genome has evidence of slow evolution, with a 30% slower nucleotide mutation rate than observed in grape. Comparisons of the available sequenced genomes suggest a minimum gene set for vascular plants of 4,223 genes. Strikingly, the sacred lotus has 16 COG2132 multi-copper oxidase family proteins with root-specific expression; these are involved in root meristem phosphate starvation, reflecting adaptation to limited nutrient availability in an aquatic environment.
The slow nucleotide substitution rate makes the sacred lotus a better resource than the current standard, grape, for reconstructing the pan-eudicot genome, and should therefore accelerate comparative analysis between eudicots and monocots.
Advances in vertebrate genomics have uncovered thousands of loci encoding long noncoding RNAs (lncRNAs). While progress has been made in elucidating the regulatory functions of lncRNAs, little is known about their origins and evolution. Here we explore the contribution of transposable elements (TEs) to the makeup and regulation of lncRNAs in human, mouse, and zebrafish. Surprisingly, TEs occur in more than two thirds of mature lncRNA transcripts and account for a substantial portion of total lncRNA sequence (∼30% in human), whereas they seldom occur in protein-coding transcripts. While TEs contribute less to lncRNA exons than expected, several TE families are strongly enriched in lncRNAs. There is also substantial interspecific variation in the coverage and types of TEs embedded in lncRNAs, partially reflecting differences in the TE landscapes of the genomes surveyed. In human, TE sequences in lncRNAs evolve under greater evolutionary constraint than their non–TE sequences, than their intronic TEs, or than random DNA. Consistent with functional constraint, we found that TEs contribute signals essential for the biogenesis of many lncRNAs, including ∼30,000 unique sites for transcription initiation, splicing, or polyadenylation in human. In addition, we identified ∼35,000 TEs marked as open chromatin located within 10 kb upstream of lncRNA genes. The density of these marks in one cell type correlate with elevated expression of the downstream lncRNA in the same cell type, suggesting that these TEs contribute to cis-regulation. These global trends are recapitulated in several lncRNAs with established functions. Finally a subset of TEs embedded in lncRNAs are subject to RNA editing and predicted to form secondary structures likely important for function. In conclusion, TEs are nearly ubiquitous in lncRNAs and have played an important role in the lineage-specific diversification of vertebrate lncRNA repertoires.
An unexpected layer of complexity in the genomes of humans and other vertebrates lies in the abundance of genes that do not appear to encode proteins but produce a variety of non-coding RNAs. In particular, the human genome is currently predicted to contain 5,000–10,000 independent gene units generating long (>200 nucleotides) noncoding RNAs (lncRNAs). While there is growing evidence that a large fraction of these lncRNAs have cellular functions, notably to regulate protein-coding gene expression, almost nothing is known on the processes underlying the evolutionary origins and diversification of lncRNA genes. Here we show that transposable elements, through their capacity to move and spread in genomes in a lineage-specific fashion, as well as their ability to introduce regulatory sequences upon chromosomal insertion, represent a major force shaping the lncRNA repertoire of humans, mice, and zebrafish. Not only do TEs make up a substantial fraction of mature lncRNA transcripts, they are also enriched in the vicinity of lncRNA genes, where they frequently contribute to their transcriptional regulation. Through specific examples we provide evidence that some TE sequences embedded in lncRNAs are critical for the biogenesis of lncRNAs and likely important for their function.
Understanding how sequence variants within healthy genomes are distributed with respect to ethnicity and disease-implicated genes is an essential first step toward establishing baselines for personalized genomic medicine.
In this study, we present an analysis of 10 genomes from healthy individuals of various ethnicities, produced using six different sequencing technologies. In total, these genomes contain more than 34 million single-nucleotide variants.
We have analyzed these variants from a clinical perspective, assaying the influence of sequencing technology and ethnicity on prognosis. We have also examined the utility of OMIM and the disease-gene literature for determining the impact of rare, personal variants on an individual’s health.
Our analyses demonstrate that clinical prognoses are complicated by sequencing platform-specific errors and ethnicity. We show that disease-causing alleles are globally distributed along ethnic lines, with alleles known to be disease causing in Eurasians being significantly more likely to be homozygous in Africans.
personal genomes; genome analysis; personalized genomics
Exome sequencing has identified the causes of several Mendelian diseases, although it has rarely been used in a clinical setting to diagnose the genetic cause of an idiopathic disorder in a single patient. We performed exome sequencing on a pedigree with several members affected with attention deficit/hyperactivity disorder (ADHD), in an effort to identify candidate variants predisposing to this complex disease. While we did identify some rare variants that might predispose to ADHD, we have not yet proven the causality for any of them. However, over the course of the study, one subject was discovered to have idiopathic hemolytic anemia (IHA), which was suspected to be genetic in origin. Analysis of this subject’s exome readily identified two rare non-synonymous mutations in PKLR gene as the most likely cause of the IHA, although these two mutations had not been documented before in a single individual. We further confirmed the deficiency by functional biochemical testing, consistent with a diagnosis of red blood cell pyruvate kinase deficiency. Our study implies that exome and genome sequencing will certainly reveal additional rare variation causative for even well-studied classical Mendelian diseases, while also revealing variants that might play a role in complex diseases. Furthermore, our study has clinical and ethical implications for exome and genome sequencing in a research setting; how to handle unrelated findings of clinical significance, in the context of originally planned complex disease research, remains a largely uncharted area for clinicians and researchers.
Unicellular marine algae have promise for providing sustainable and scalable biofuel feedstocks, although no single species has emerged as a preferred organism. Moreover, adequate molecular and genetic resources prerequisite for the rational engineering of marine algal feedstocks are lacking for most candidate species. Heterokonts of the genus Nannochloropsis naturally have high cellular oil content and are already in use for industrial production of high-value lipid products. First success in applying reverse genetics by targeted gene replacement makes Nannochloropsis oceanica an attractive model to investigate the cell and molecular biology and biochemistry of this fascinating organism group. Here we present the assembly of the 28.7 Mb genome of N. oceanica CCMP1779. RNA sequencing data from nitrogen-replete and nitrogen-depleted growth conditions support a total of 11,973 genes, of which in addition to automatic annotation some were manually inspected to predict the biochemical repertoire for this organism. Among others, more than 100 genes putatively related to lipid metabolism, 114 predicted transcription factors, and 109 transcriptional regulators were annotated. Comparison of the N. oceanica CCMP1779 gene repertoire with the recently published N. gaditana genome identified 2,649 genes likely specific to N. oceanica CCMP1779. Many of these N. oceanica–specific genes have putative orthologs in other species or are supported by transcriptional evidence. However, because similarity-based annotations are limited, functions of most of these species-specific genes remain unknown. Aside from the genome sequence and its analysis, protocols for the transformation of N. oceanica CCMP1779 are provided. The availability of genomic and transcriptomic data for Nannochloropsis oceanica CCMP1779, along with efficient transformation protocols, provides a blueprint for future detailed gene functional analysis and genetic engineering of Nannochloropsis species by a growing academic community focused on this genus.
Algae are a highly diverse group of organisms that have become the focus of renewed interest due to their potential for producing biofuel feedstocks, nutraceuticals, and biomaterials. Their high photosynthetic yields and ability to grow in areas unsuitable for agriculture provide a potential sustainable alternative to using traditional agricultural crops for biofuels. Because none of the algae currently in use have a history of domestication, and bioengineering of algae is still in its infancy, there is a need to develop algal strains adapted to cultivation for industrial large-scale production of desired compounds. Model organisms ranging from mice to baker's yeast have been instrumental in providing insights into fundamental biological structures and functions. The algal field needs versatile models to develop a fundamental understanding of photosynthetic production of biomass and valuable compounds in unicellular, marine, oleaginous algal species. To contribute to the development of such an algal model system for basic discovery, we sequenced the genome and two sets of transcriptomes of N. oceanica CCMP1779, assembled the genomic sequence, identified putative genes, and began to interpret the function of selected genes. This species was chosen because it is readily transformable with foreign DNA and grows well in culture.
The fish-hunting cone snail, Conus geographus, is the deadliest snail on earth. In the absence of medical intervention, 70% of human stinging cases are fatal. Although, its venom is known to consist of a cocktail of small peptides targeting different ion-channels and receptors, the bulk of its venom constituents, their sites of manufacture, relative abundances and how they function collectively in envenomation has remained unknown.
We have used transcriptome sequencing to systematically elucidate the contents the C. geographus venom duct, dividing it into four segments in order to investigate each segment’s mRNA contents. Three different types of calcium channel (each targeted by unrelated, entirely distinct venom peptides) and at least two different nicotinic receptors appear to be targeted by the venom. Moreover, the most highly expressed venom component is not paralytic, but causes sensory disorientation and is expressed in a different segment of the venom duct from venoms believed to cause sensory disruption. We have also identified several new toxins of interest for pharmaceutical and neuroscience research.
Conus geographus is believed to prey on fish hiding in reef crevices at night. Our data suggest that disorientation of prey is central to its envenomation strategy. Furthermore, venom expression profiles also suggest a sophisticated layering of venom-expression patterns within the venom duct, with disorientating and paralytic venoms expressed in different regions. Thus, our transcriptome analysis provides a new physiological framework for understanding the molecular envenomation strategy of this deadly snail.
Conus geographus; Conotoxins; RNA-seq; Venom duct compartmentalization
Second-generation sequencing technologies are precipitating major shifts with regards to what kinds of genomes are being sequenced and how they are annotated. While the first generation of genome projects focused on well-studied model organisms, many of today's projects involve exotic organisms whose genomes are largely terra incognita. This complicates their annotation, because unlike first-generation projects, there are no pre-existing 'gold-standard' gene-models with which to train gene-finders. Improvements in genome assembly and the wide availability of mRNA-seq data are also creating opportunities to update and re-annotate previously published genome annotations. Today's genome projects are thus in need of new genome annotation tools that can meet the challenges and opportunities presented by second-generation sequencing technologies.
We present MAKER2, a genome annotation and data management tool designed for second-generation genome projects. MAKER2 is a multi-threaded, parallelized application that can process second-generation datasets of virtually any size. We show that MAKER2 can produce accurate annotations for novel genomes where training-data are limited, of low quality or even non-existent. MAKER2 also provides an easy means to use mRNA-seq data to improve annotation quality; and it can use these data to update legacy annotations, significantly improving their quality. We also show that MAKER2 can evaluate the quality of genome annotations, and identify and prioritize problematic annotations for manual review.
MAKER2 is the first annotation engine specifically designed for second-generation genome projects. MAKER2 scales to datasets of any size, requires little in the way of training data, and can use mRNA-seq data to improve annotation quality. It can also update and manage legacy genome annotation datasets.
Leaf-cutter ants are one of the most important herbivorous insects in the Neotropics, harvesting vast quantities of fresh leaf material. The ants use leaves to cultivate a fungus that serves as the colony's primary food source. This obligate ant-fungus mutualism is one of the few occurrences of farming by non-humans and likely facilitated the formation of their massive colonies. Mature leaf-cutter ant colonies contain millions of workers ranging in size from small garden tenders to large soldiers, resulting in one of the most complex polymorphic caste systems within ants. To begin uncovering the genomic underpinnings of this system, we sequenced the genome of Atta cephalotes using 454 pyrosequencing. One prediction from this ant's lifestyle is that it has undergone genetic modifications that reflect its obligate dependence on the fungus for nutrients. Analysis of this genome sequence is consistent with this hypothesis, as we find evidence for reductions in genes related to nutrient acquisition. These include extensive reductions in serine proteases (which are likely unnecessary because proteolysis is not a primary mechanism used to process nutrients obtained from the fungus), a loss of genes involved in arginine biosynthesis (suggesting that this amino acid is obtained from the fungus), and the absence of a hexamerin (which sequesters amino acids during larval development in other insects). Following recent reports of genome sequences from other insects that engage in symbioses with beneficial microbes, the A. cephalotes genome provides new insights into the symbiotic lifestyle of this ant and advances our understanding of host–microbe symbioses.
Leaf-cutter ant workers forage for and cut leaves that they use to support the growth of a specialized fungus, which serves as the colony's primary food source. The ability of these ants to grow their own food likely facilitated their emergence as one of the most dominant herbivores in New World tropical ecosystems, where leaf-cutter ants harvest more plant biomass than any other herbivore species. These ants have also evolved one of the most complex forms of division of labor, with colonies composed of different-sized workers specialized for different tasks. To gain insight into the biology of these ants, we sequenced the first genome of a leaf-cutter ant, Atta cephalotes. Our analysis of this genome reveals characteristics reflecting the obligate nutritional dependency of these ants on their fungus. These findings represent the first genetic evidence of a reduced capacity for nutrient acquisition in leaf-cutter ants, which is likely compensated for by their fungal symbiont. These findings parallel other nutritional host–microbe symbioses, suggesting convergent genomic modifications in these types of associations.