StellaBase, the Nematostella vectensis Genomics Database, is a web-based resource that will facilitate desktop and bench-top studies of the starlet sea anemone. Nematostella is an emerging model organism that has already proven useful for addressing fundamental questions in developmental evolution and evolutionary genomics. StellaBase allows users to query the assembled Nematostella genome, a confirmed gene library, and a predicted genome using both keyword and homology based search functions. Data provided by these searches will elucidate gene family evolution in early animals. Unique research tools, including a Nematostella genetic stock library, a primer library, a literature repository and a gene expression library will provide support to the burgeoning Nematostella research community. The development of StellaBase accompanies significant upgrades to CnidBase, the Cnidarian Evolutionary Genomics Database. With the completion of the first sequenced cnidarian genome, genome comparison tools have been added to CnidBase. In addition, StellaBase provides a framework for the integration of additional species-specific databases into CnidBase. StellaBase is available at .
Color markings among felid species display both a remarkable diversity and a common underlying periodicity. A similar range of patterns in domestic cats suggests a conserved mechanism whose appearance can be altered by selection. We identified the gene responsible for tabby pattern variation in domestic cats as Transmembrane aminopeptidase Q (Taqpep), which encodes a membrane-bound metalloprotease. Analyzing 31 other felid species, we identified Taqpep as the cause of the rare king cheetah phenotype, in which spots coalesce into blotches and stripes. Histologic, genomic expression, and transgenic mouse studies indicate that paracrine expression of Endothelin3 (Edn3) coordinates localized color differences. We propose a two-stage model in which Taqpep helps to establish a periodic pre-pattern during skin development that is later implemented by differential expression of Edn3.
The current genetic and recombination maps of the cat have less than 3,000 markers and a resolution limit greater than 1 Mb. To complement the first generation domestic cat maps, support higher resolution mapping studies, and aid genome assembly in specific areas as well as in the whole genome, a 15,000Rad radiation hybrid (RH) panel for the domestic cat was generated. Fibroblasts from the female Abyssinian cat that was used to generate the cat genomic sequence were fused to a Chinese hamster cell line (A23), producing 150 hybrid lines. The clones were initially characterized using 39 STR and 1536 SNP markers. The utility of whole genome amplification (WGA) in preserving and extending RH panel DNA was also tested using ten STR markers; no significant difference in retention was observed. The resolution of the 15,000Rad RH panel was established by constructing framework maps across ten different 1 Mb regions on different feline chromosomes. In these regions, two-point analysis was used to estimate RH distances, which compared favorably with the estimation of physical distances. The study demonstrates that the 15,000Rad RH panel constitutes a powerful tool for constructing high-resolution maps, having an average resolution of 40.1 kb per marker across the ten 1 Mb regions. In addition, the RH panel will complement existing genomic resources for the domestic cat, aid in the accurate reassemblies of the forthcoming cat genomic sequence, and support cross-species genomic comparisons.
Endometrial cancer is the 6th most commonly diagnosed cancer among women worldwide, causing ~74,000 deaths annually 1. Serous endometrial cancers are a clinically aggressive subtype with a poorly defined genetic etiology 2-4. We used whole exome sequencing (WES) to comprehensively search for somatic mutations within ~22,000 protein-encoding genes among 13 primary serous endometrial tumors. We subsequently resequenced 18 genes that were mutated in more than one tumor, and/or were genes that formed an enriched functional grouping, from 40 additional serous tumors. We identified high frequencies of somatic mutations in CHD4 (17%), EP300 (8%), ARID1A (6%), TSPYL2 (6%), FBXW7 (29%), SPOP (8%), MAP3K4 (6%) and ABCC9 (6%). Overall, 36.5% of serous tumors had mutated a chromatin-remodeling gene and 35% had mutated a ubiquitin ligase complex gene, implicating the frequent mutational disruption of these processes in the molecular pathogenesis of one of the deadliest forms of endometrial cancer.
In this study we assess exome sequencing (ES) as a diagnostic alternative for genetically heterogeneous disorders. Since ES readily identified a previously reported homozygous mutation in the CAPN3 gene for an individual with an undiagnosed limb girdle muscular dystrophy, we evaluated ES as a generalizable clinical diagnostic tool by assessing the targeting efficiency and sequencing-coverage of 88 genes associated with muscle disease (MD) and spastic paraplegia (SPG). We used three exome-capture kits on 125 individuals. Exons constituting each gene were defined using the UCSC and CCDS databases. The three exome-capture kits targeted 47–92% of bases within the UCSC-defined exons, and 97%–99% of bases within the CCDS-defined exons. An average of 61.2–99.5% and 19.1–99.5% of targeted bases per gene were sequenced to 20X coverage within the CCDS-defined MD and SPG coding exons, respectively. Greater than 95–99% of targeted known mutation positions were sequenced to ≥1X coverage and 55–87% to ≥20X coverage in every exome. We conclude therefore that ES is a rapid and efficient first tier method to screen for mutations, particularly within the CCDS annotated exons, although its application requires disclosure of the extent of coverage for each targeted gene and supplementation with second tier Sanger sequencing for full coverage.
CAPN3; exome; LGMD; HSP; neuromuscular disorders; clinical genetic testing
Large data sets on human genetic variation have been collected recently, but their usefulness for learning about history and natural selection has been limited by biases in the ways polymorphisms were chosen. We report large subsets of SNPs from the International HapMap Project1,2 that allow us to overcome these biases and to provide accurate measurement of a quantity of crucial importance for understanding genetic variation: the allele frequency spectrum. Our analysis shows that East Asian and northern European ancestors shared the same population bottleneck expanding out of Africa but that both also experienced more recent genetic drift, which was greater in East Asians.
The utility of induced pluripotent stem cells (iPSCs) as models to study diseases and as sources for cell therapy depends on the integrity of their genomes. Despite recent publications of DNA sequence variations in the iPSCs, the true scope of such changes for the entire genome is not clear. Here we report the whole-genome sequencing of three human iPSC lines derived from two cell types of an adult donor by episomal vectors. The vector sequence was undetectable in the deeply sequenced iPSC lines. We identified 1058–1808 heterozygous single nucleotide variants (SNVs), but no copy number variants, in each iPSC line. Six to twelve of these SNVs were within coding regions in each iPSC line, but ~50% of them are synonymous changes and the remaining are not selectively enriched for known genes associated with cancers. Our data thus suggest that episome-mediated reprogramming is not inherently mutagenic during integration-free iPSC induction.
Human iPS cells; Reprogramming; Episomal vectors; Integration-free; Genetic mutations; Whole Genome Sequencing
Summary: VarSifter is a graphical software tool for desktop computers that allows investigators of varying computational skills to easily and quickly sort, filter, and sift through sequence variation data. A variety of filters and a custom query framework allow filtering based on any combination of sample and annotation information. By simplifying visualization and analyses of exome-scale sequence variation data, this program will help bring the power and promise of massively-parallel DNA sequencing to a broader group of researchers.
Availability and Implementation: VarSifter is written in Java, and is freely available in source and binary versions, along with a User Guide, at http://research.nhgri.nih.gov/software/VarSifter/.
Supplementary Information: Additional figures and methods available online at the journal's website.
Domestic cats have a unique breeding history and can be used as models for human hereditary and infectious diseases. In the current era of genome-wide association studies, insights regarding linkage disequilibrium (LD) are essential for efficient association studies. The objective of this study is to investigate the extent of LD in the domestic cat, Felis silvestris catus, particularly within its breeds. A custom illumina GoldenGate Assay consisting of 1536 single nucleotide polymorphisms (SNPs) equally divided over ten 1 Mb chromosomal regions was developed, and genotyped across 18 globally recognized cat breeds and two distinct random bred populations. The pair-wise LD descriptive measure (r2) was calculated between the SNPs in each region and within each population independently. LD decay was estimated by determining the non-linear least-squares of all pair-wise estimates as a function of distance using established models. The point of 50% decay of r2 was used to compare the extent of LD between breeds. The longest extent of LD was observed in the Burmese breed, where the distance at which r2 ≈ 0.25 was ∼380 kb, comparable to several horse and dog breeds. The shortest extent of LD was found in the Siberian breed, with an r2 ≈ 0.25 at approximately 17 kb, comparable to random bred cats and human populations. A comprehensive haplotype analysis was also conducted. The haplotype structure of each region within each breed mirrored the LD estimates. The LD of cat breeds largely reflects the breeds’ population history and breeding strategies. Understanding LD in diverse populations will contribute to an efficient use of the newly developed SNP array for the cat in the design of genome-wide association studies, as well as to the interpretation of results for the fine mapping of disease and phenotypic traits.
Two African apes are the closest living relatives of humans: the chimpanzee (Pan troglodytes) and the bonobo (Pan paniscus). Although they are similar in many respects, bonobos and chimpanzees differ strikingly in key social and sexual behaviours1–4, and for some of these traits they show more similarity with humans than with each other. Here we report the sequencing and assembly of the bonobo genome to study its evolutionary relationship with the chimpanzee and human genomes. We find that more than three per cent of the human genome is more closely related to either the bonobo or the chimpanzee genome than these are to each other. These regions allow various aspects of the ancestry of the two ape species to be reconstructed. In addition, many of the regions that overlap genes may eventually help us understand the genetic basis of phenotypes that humans share with one of the two apes to the exclusion of the other.
Recent advances in sequencing technology have led to a rapid accumulation of mitochondrial DNA (mtDNA) sequences, which now represent the wide spectrum of animal diversity. However, one animal phylum – Ctenophora – has, to date, remained completely unsampled. Ctenophores, a small group of marine animals, are of interest due to their unusual biology, controversial phylogenetic position, and devastating impact as an invasive species. Using data from the Mnemiopsis leidyi genome sequencing project, we PCR amplified and analyzed its complete mitochondrial (mt-) genome. At just over 10kb, the mt-genome of M. leidyi is the smallest animal mtDNA ever reported and is among the most derived. It has lost at least 25 genes, including atp6 and all tRNA genes. We show that atp6 has been relocated to the nuclear genome and has acquired introns and a mitochondrial targeting presequence, while tRNA genes have been genuinely lost, along with nuclear-encoded mt-aminoacyl tRNA synthetases. The mt-genome of M. leidyi also displays extremely high rates of sequence evolution, which likely led to the degeneration of both protein and rRNA genes. In particular, encoded rRNA molecules possess little similarity with their homologues in other organisms and have highly reduced secondary structures. At the same time, nuclear encoded mt-ribosomal proteins have undergone expansions, probably to compensate for the reductions in mt-rRNA. The unusual features identified in M. leidyi mtDNA make this organism an interesting system for the study of various aspects of mitochondrial biology, particularly protein and tRNA import and mt-ribosome structures, and add to its value as an emerging model species. Furthermore, the fast-evolving M. leidyi mtDNA should be a convenient molecular marker for species- and population-level studies.
Ctenophora; comparative genomics; cytonuclear coevolution
Select HIV-1-infected individuals develop sera capable of neutralizing diverse viral strains. The molecular basis of this neutralization is currently being deciphered by the isolation of HIV-1-neutralizing antibodies. In one infected donor, three neutralizing antibodies, PGT135–137, were identified by assessment of neutralization from individually sorted B cells and found to recognize an epitope containing an N-linked glycan at residue 332 on HIV-1 gp120. Here we use next-generation sequencing and bioinformatics methods to interrogate the B cell record of this donor to gain a more complete understanding of the humoral immune response. PGT135–137-gene family specific primers were used to amplify heavy-chain and light-chain variable-domain sequences. Pyrosequencing produced 141,298 heavy-chain sequences of IGHV4-39 origin and 87,229 light-chain sequences of IGKV3-15 origin. A number of heavy and light-chain sequences of ∼90% identity to PGT137, several to PGT136, and none of high identity to PGT135 were identified. After expansion of these sequences to include close phylogenetic relatives, a total of 202 heavy-chain sequences and 72 light-chain sequences were identified. These sequences were clustered into populations of 95% identity comprising 15 for heavy chain and 10 for light chain, and a select sequence from each population was synthesized and reconstituted with a PGT137-partner chain. Reconstituted antibodies showed varied neutralization phenotypes for HIV-1 clade A and D isolates. Sequence diversity of the antibody population represented by these tested sequences was notably higher than observed with a 454 pyrosequencing-control analysis on 10 antibodies of defined sequence, suggesting that this diversity results primarily from somatic maturation. Our results thus provide an example of how pathogens like HIV-1 are opposed by a varied humoral immune response, derived from intrinsic mechanisms of antibody development, and embodied by somatic populations of diverse antibodies.
antibody bioinformatics; high-throughput sequencing; HIV-1; immunity; N-linked glycan
Defining the genetic contribution of rare variants to common diseases is a major basic and clinical science challenge that could offer new insights into disease etiology and provide potential for directed gene- and pathway-based prevention and treatment. Common and rare nonsynonymous variants in the GCKR gene are associated with alterations in metabolic traits, most notably serum triglyceride levels. GCKR encodes glucokinase regulatory protein (GKRP), a predominantly nuclear protein that inhibits hepatic glucokinase (GCK) and plays a critical role in glucose homeostasis. The mode of action of rare GCKR variants remains unexplored. We identified 19 nonsynonymous GCKR variants among 800 individuals from the ClinSeq medical sequencing project. Excluding the previously described common missense variant p.Pro446Leu, all variants were rare in the cohort. Accordingly, we functionally characterized all variants to evaluate their potential phenotypic effects. Defects were observed for the majority of the rare variants after assessment of cellular localization, ability to interact with GCK, and kinetic activity of the encoded proteins. Comparing the individuals with functional rare variants to those without such variants showed associations with lipid phenotypes. Our findings suggest that, while nonsynonymous GCKR variants, excluding p.Pro446Leu, are rare in individuals of mixed European descent, the majority do affect protein function. In sum, this study utilizes computational, cell biological, and biochemical methods to present a model for interpreting the clinical significance of rare genetic variants in common disease.
Ciliary dysfunction leads to a broad range of overlapping phenotypes, termed collectively as ciliopathies. This grouping is underscored by genetic overlap, where causal genes can also contribute modifying alleles to clinically distinct disorders. Here we show that mutations in TTC21B/IFT139, encoding a retrograde intraflagellar transport (IFT) protein, cause both isolated nephronophthisis (NPHP) and syndromic Jeune Asphyxiating Thoracic Dystrophy (JATD). Moreover, although systematic medical resequencing of a large, clinically diverse ciliopathy cohort and matched controls showed a similar frequency of rare changes, in vivo and in vitro evaluations unmasked a significant enrichment of pathogenic alleles in cases, suggesting that TTC21B contributes pathogenic alleles to ∼5% of ciliopathy patients. Our data illustrate how genetic lesions can be both causally associated with diverse ciliopathies, as well as interact in trans with other disease-causing genes, and highlight how saturated resequencing followed by functional analysis of all variants informs the genetic architecture of disorders.
ClinSeq is a large-scale medical sequencing (LSMS) project at the National Institutes of Health (NIH), the goal of which is to pilot the feasibility of using high throughput genome sequencing for clinical research and eventually to improve the delivery of healthcare. In phase one, 1000 participants are being clinically evaluated for cardiovascular phenotypes and DNA is being collected for sequencing of 400 candidate genes to identify genetic variants that may predispose to the early development of atherosclerosis. We report on an individual with familial hypercholesterolemia (OMIM #143890) who has a novel mutation, c.261_262invGA that predicts a premature stop (p.Trp87X) in the LDLR gene. Although the p.Trp87X predicted protein mutation has been reported, c.261_262invGA is distinct from mutations reported in prior families and emphasizes the importance of describing mutations at the DNA level. It is important to describe mutations according to the underlying DNA change as multiple nucleotide changes may underlie a single predicted protein change.
Nuclear receptors (NRs) are an ancient superfamily of metazoan transcription factors that play critical roles in regulation of reproduction, development, and energetic homeostasis. Although the evolutionary relationships among NRs are well-described in two prominent clades of animals (deuterostomes and protostomes), comparatively little information has been reported on the diversity of NRs in early diverging metazoans. Here, we identified NRs from the phylum Ctenophora and used a phylogenomic approach to explore the emergence of the NR superfamily in the animal kingdom. In addition, to gain insight into conserved or novel functions, we examined NR expression during ctenophore development.
We report the first described NRs from the phylum Ctenophora: two from Mnemiopsis leidyi and one from Pleurobrachia pileus. All ctenophore NRs contained a ligand-binding domain and grouped with NRs from the subfamily NR2A (HNF4). Surprisingly, all the ctenophore NRs lacked the highly conserved DNA-binding domain (DBD). NRs from Mnemiopsis were expressed in different regions of developing ctenophores. One was broadly expressed in the endoderm during gastrulation. The second was initially expressed in the ectoderm during gastrulation, in regions corresponding to the future tentacles; subsequent expression was restricted to the apical organ. Phylogenetic analyses of NRs from ctenophores, sponges, cnidarians, and a placozoan support the hypothesis that expansion of the superfamily occurred in a step-wise fashion, with initial radiations in NR family 2, followed by representatives of NR families 3, 6, and 1/4 originating prior to the appearance of the bilaterian ancestor.
Our study provides the first description of NRs from ctenophores, including the full complement from Mnemiopsis. Ctenophores have the least diverse NR complement of any animal phylum with representatives that cluster with only one subfamily (NR2A). Ctenophores and sponges have a similarly restricted NR complement supporting the hypothesis that the original NR was HNF4-like and that these lineages are the first two branches from the animal tree. The absence of a zinc-finger DNA-binding domain in the two ctenophore species suggests two hypotheses: this domain may have been secondarily lost within the ctenophore lineage or, if ctenophores are the first branch off the animal tree, the original NR may have lacked the canonical DBD. Phylogenomic analyses and categorization of NRs from all four early diverging animal phyla compared with the complement from bilaterians suggest the rate of NR diversification prior to the cnidarian-bilaterian split was relatively modest, with independent radiations of several NR subfamilies within the cnidarian lineage.
Intercellular signaling pathways are a fundamental component of the integrating cellular behavior required for the evolution of multicellularity. The genomes of three of the four early branching animal phyla (Cnidaria, Placozoa and Porifera) have been surveyed for key components, but not the fourth (Ctenophora). Genomic data from ctenophores could be particularly relevant, as ctenophores have been proposed to be one of the earliest branching metazoan phyla.
A preliminary assembly of the lobate ctenophore Mnemiopsis leidyi genome generated using next-generation sequencing technologies were searched for components of a developmentally important signaling pathway, the Wnt/β-catenin pathway. Molecular phylogenetic analysis shows four distinct Wnt ligands (MlWnt6, MlWnt9, MlWntA and MlWntX), and most, but not all components of the receptor and intracellular signaling pathway were detected. In situ hybridization of the four Wnt ligands showed that they are expressed in discrete regions associated with the aboral pole, tentacle apparati and apical organ.
Ctenophores show a minimal (but not obviously simple) complement of Wnt signaling components. Furthermore, it is difficult to compare the Mnemiopsis Wnt expression patterns with those of other metazoans. mRNA expression of Wnt pathway components appears later in development than expected, and zygotic gene expression does not appear to play a role in early axis specification. Notably absent in the Mnemiopsis genome are most major secreted antagonists, which suggests that complex regulation of this secreted signaling pathway probably evolved later in animal evolution.
The much-debated phylogenetic relationships of the five early branching metazoan lineages (Bilateria, Cnidaria, Ctenophora, Placozoa and Porifera) are of fundamental importance in piecing together events that occurred early in animal evolution. Comparisons of gene content between organismal lineages have been identified as a potentially useful methodology for phylogenetic reconstruction. However, these comparisons require complete genomes that, until now, did not exist for the ctenophore lineage. The homeobox superfamily of genes is particularly suited for these kinds of gene content comparisons, since it is large, diverse, and features a highly conserved domain.
We have used a next-generation sequencing approach to generate a high-quality rough draft of the genome of the ctenophore Mnemiopsis leidyi and subsequently identified a set of 76 homeobox-containing genes from this draft. We phylogenetically categorized this set into established gene families and classes and then compared this set to the homeodomain repertoire of species from the other four early branching metazoan lineages. We have identified several important classes and subclasses of homeodomains that appear to be absent from Mnemiopsis and from the poriferan Amphimedon queenslandica. We have also determined that, based on lineage-specific paralog retention and average branch lengths, it is unlikely that these missing classes and subclasses are due to extensive gene loss or unusually high rates of evolution in Mnemiopsis.
This paper provides a first glimpse of the first sequenced ctenophore genome. We have characterized the full complement of Mnemiopsis homeodomains from this species and have compared them to species from other early branching lineages. Our results suggest that Porifera and Ctenophora were the first two extant lineages to diverge from the rest of animals. Based on this analysis, we also propose a new name - ParaHoxozoa - for the remaining group that includes Placozoa, Cnidaria and Bilateria.
Stuttering is a disorder of unknown cause characterized by repetitions, prolongations, and interruptions in the flow of speech. Genetic factors have been implicated in this disorder, and previous studies of stuttering have identified linkage to markers on chromosome 12.
We analyzed the chromosome 12q23.3 genomic region in consanguineous Pakistani families, some members of which had nonsyndromic stuttering and in unrelated case and control subjects from Pakistan and North America.
We identified a missense mutation in the N-acetylglucosamine-1-phosphate transferase gene (GNPTAB), which encodes the alpha and beta catalytic subunits of GlcNAc-phosphotransferase (GNPT [EC 18.104.22.168]), that was associated with stuttering in a large, consanguineous Pakistani family. This mutation occurred in the affected members of approximately 10% of Pakistani families studied, but it occurred only once in 192 chromosomes from unaffected, unrelated Pakistani control subjects and was not observed in 552 chromosomes from unaffected, unrelated North American control subjects. This and three other mutations in GNPTAB occurred in unrelated subjects with stuttering but not in control subjects. We also identified three mutations in the GNPTG gene, which encodes the gamma subunit of GNPT, in affected subjects of Asian and European descent but not in control subjects. Furthermore, we identified three mutations in the NAGPA gene, which encodes the so-called uncovering enzyme, in other affected subjects but not in control subjects. These genes encode enzymes that generate the mannose-6-phosphate signal, which directs a diverse group of hydrolases to the lysosome. Deficits in this system are associated with the mucolipidoses, rare lysosomal storage disorders that are most commonly associated with bone, connective tissue, and neurologic symptoms.
Susceptibility to nonsyndromic stuttering is associated with variations in genes governing lysosomal metabolism.
The development of massively parallel sequencing technologies, coupled with new massively parallel DNA enrichment technologies (genomic capture), has allowed the sequencing of targeted regions of the human genome in rapidly increasing numbers of samples. Genomic capture can target specific areas in the genome, including genes of interest and linkage regions, but this limits the study to what is already known. Exome capture allows an unbiased investigation of the complete protein-coding regions in the genome. Researchers can use exome capture to focus on a critical part of the human genome, allowing larger numbers of samples than are currently practical with whole-genome sequencing. In this review, we briefly describe some of the methodologies currently used for genomic and exome capture and highlight recent applications of this technology.
The domestic cat has offered enormous genomic potential in the veterinary description of over 250 hereditary disease models as well as the occurrence of several deadly feline viruses (feline leukemia virus -- FeLV, feline coronavirus -- FECV, feline immunodeficiency virus - FIV) that are homologues to human scourges (cancer, SARS, and AIDS respectively). However, to realize this bio-medical potential, a high density single nucleotide polymorphism (SNP) map is required in order to accomplish disease and phenotype association discovery.
To remedy this, we generated 3,178,297 paired fosmid-end Sanger sequence reads from seven cats, and combined these data with the publicly available 2X cat whole genome sequence. All sequence reads were assembled together to form a 3X whole genome assembly allowing the discovery of over three million SNPs. To reduce potential false positive SNPs due to the low coverage assembly, a low upper-limit was placed on sequence coverage and a high lower-limit on the quality of the discrepant bases at a potential variant site. In all domestic cats of different breeds: female Abyssinian, female American shorthair, male Cornish Rex, female European Burmese, female Persian, female Siamese, a male Ragdoll and a female African wildcat were sequenced lightly. We report a total of 964 k common SNPs suitable for a domestic cat SNP genotyping array and an additional 900 k SNPs detected between African wildcat and domestic cats breeds. An empirical sampling of 94 discovered SNPs were tested in the sequenced cats resulting in a SNP validation rate of 99%.
These data provide a large collection of mapped feline SNPs across the cat genome that will allow for the development of SNP genotyping platforms for mapping feline diseases.
Microsatellite length mutations are often modeled using the generalized stepwise mutation process, which is a type of random walk. If this model is sufficiently accurate, one can estimate the coalescence time between alleles of a locus after a mathematical transformation of the allele lengths. When large-scale microsatellite genotyping first became possible, there was substantial interest in using this approach to make inferences about time and demography, but that interest has waned because it has not been possible to empirically validate the clock by comparing it with data in which the mutation process is well understood. We analyzed data from 783 microsatellite loci in human populations and 292 loci in chimpanzee populations, and compared them with up to one gigabase of aligned sequence data, where the molecular clock based upon nucleotide substitutions is believed to be reliable. We empirically demonstrate a remarkable linearity (r2 > 0.95) between the microsatellite average square distance statistic and sequence divergence. We demonstrate that microsatellites are accurate molecular clocks for coalescent times of at least 2 million years (My). We apply this insight to confirm that the African populations San, Biaka Pygmy, and Mbuti Pygmy have the deepest coalescent times among populations in the Human Genome Diversity Project. Furthermore, we show that microsatellites support unbiased estimates of population differentiation (FST) that are less subject to ascertainment bias than single nucleotide polymorphism (SNP) FST. These results raise the prospect of using microsatellite data sets to determine parameters of population history. When genotyped along with SNPs, microsatellite data can also be used to correct for SNP ascertainment bias.
microsatellite evolution; molecular clocks; coalescent time; average square distance; FST; SNP ascertainment bias
To utilize high-throughput sequencing to determine the etiology of juvenile-onset neurodegeneration in a 19-year-old woman with progressive motor and cognitive decline.
Exome sequencing identified an initial list of 133,555 variants in the proband's family, which were filtered using segregation analysis, presence in dbSNP, and an empirically derived gene exclusion list. The filtered list comprised 52 genes: 21 homozygous variants and 31 compound heterozygous variants. These variants were subsequently scrutinized with predicted pathogenicity programs and for association with appropriate clinical syndromes.
Exome sequencing data identified 2 GLB1 variants (c.602G>A, p.R201H; c.785G>T, p.G262V). β-Galactosidase enzyme analysis prior to our evaluation was reported as normal; however, subsequent testing was consistent with juvenile-onset GM1-gangliosidosis. Urine oligosaccharide analysis was positive for multiple oligosaccharides with terminal galactose residues.
We describe a patient with juvenile-onset neurodegeneration that had eluded diagnosis for over a decade. GM1-gangliosidosis had previously been excluded from consideration, but was subsequently identified as the correct diagnosis using exome sequencing. Exome sequencing can evaluate genes not previously associated with neurodegeneration, as well as most known neurodegeneration-associated genes. Our results demonstrate the utility of “agnostic” exome sequencing to evaluate patients with undiagnosed disorders, without prejudice from prior testing results.
A comparative analysis of SNPs and their exonic and intronic environments identifies the features predictive of splice affecting variants.
Single point mutations at both synonymous and non-synonymous positions within exons can have severe effects on gene function through disruption of splicing. Predicting these mutations in silico purely from the genomic sequence is difficult due to an incomplete understanding of the multiple factors that may be responsible. In addition, little is known about which computational prediction approaches, such as those involving exonic splicing enhancers and exonic splicing silencers, are most informative.
We assessed the features of single-nucleotide genomic variants verified to cause exon skipping and compared them to a large set of coding SNPs common in the human population, which are likely to have no effect on splicing. Our findings implicate a number of features important for their ability to discriminate splice-affecting variants, including the naturally occurring density of exonic splicing enhancers and exonic splicing silencers of the exon and intronic environment, extensive changes in the number of predicted exonic splicing enhancers and exonic splicing silencers, proximity to the splice junctions and evolutionary constraint of the region surrounding the variant. By extending this approach to additional datasets, we also identified relevant features of variants that cause increased exon inclusion and ectopic splice site activation.
We identified a number of features that have statistically significant representation among exonic variants that modulate splicing. These analyses highlight putative mechanisms responsible for splicing outcome and emphasize the role of features important for exon definition. We developed a web-tool, Skippy, to score coding variants for these relevant splice-modulating features.