Using DNA extracted from a finger bone found in Denisova Cave in southern Siberia, we have sequenced the genome of an archaic hominin to about 1.9-fold coverage. This individual is from a group that shares a common origin with Neanderthals. This population was not involved in the putative gene flow from Neanderthals into Eurasians; however, the data suggest that it contributed 4–6% of its genetic material to the genomes of present-day Melanesians. We designate this hominin population ‘Denisovans’ and suggest that it may have been widespread in Asia during the Late Pleistocene epoch. A tooth found in Denisova Cave carries a mitochondrial genome highly similar to that of the finger bone. This tooth shares no derived morphological features with Neanderthals or modern humans, further indicating that Denisovans have an evolutionary history distinct from Neanderthals and modern humans.
Despite detailed clinical definition and refinement of neurodevelopmental disorders and neuropsychiatric conditions, the underlying genetic etiology has proved elusive. Recent genetic studies have revealed some common themes: considerable locus heterogeneity, variable expressivity for the same mutation, and a role for multiple disruptive events in the same individual affecting genes in common pathways. Recurrent copy number variation (CNV), in particular, has emphasized the importance of either de novo or essentially private mutations creating imbalances for multiple genes. CNVs have foreshadowed a model where the distinction between milder neuropsychiatric conditions from those of severe developmental impairment may be a consequence of increased mutational burden affecting more genes.
copy number variants; variable penetrance; genomic disorders; autism; schizophrenia; intellectual disability
All genetic variation arises via new mutations, and therefore determining the rate and biases for different classes of mutation is essential for understanding the genetics of human disease and evolution. Decades of mutation rate analyses have focused on a relatively small number of loci because of technical limitations. However, advances in sequencing technology have allowed for empirical assessments of genome-wide rates of mutation. Recent studies have shown that 76% of new mutations originate in the paternal lineage and provide unequivocal evidence for an increase in mutation with paternal age. Although most analyses have been focused on single nucleotide variants (SNVs), studies have begun to provide insight into the mutation rate for other classes of variation, including copy number variants (CNVs), microsatellites, and mobile element insertions. Here, we review the genome-wide analyses for the mutation rate of several types of variants and suggest areas for future research.
germline mutation rate; de novo mutation; paternal bias; paternal age; genome-wide
Despite a high heritability, a genetic diagnosis can only be established in a minority of patients with autism spectrum disorder (ASD), characterized by persistent deficits in social communication and interaction and restricted, repetitive patterns of behavior, interests or activities1. Known genetic causes include chromosomal aberrations, such as the duplication of the 15q11-13 region, and monogenic causes, such as the Rett and Fragile X syndromes. The genetic heterogeneity within ASD is striking, with even the most frequent causes responsible for only 1% of cases at the most. Even with the recent developments in next generation sequencing, for the large majority of cases no molecular diagnosis can be established 2-7. Here, we report 10 patients with ASD and other shared clinical characteristics, including intellectual disability and facial dysmorphisms caused by a mutation in ADNP, a transcription factor involved in the SWI/SNF remodeling complex. We estimate this gene to be mutated in at least 0.17% of ASD cases, making it one of the most frequent ASD genes known to date.
We report a novel gene for a parkinsonian disorder. X-linked parkinsonism with spasticity (XPDS) presents either as typical adult onset Parkinson's disease or earlier onset spasticity followed by parkinsonism. We previously mapped the XPDS gene to a 28 Mb region on Xp11.2–X13.3. Exome sequencing of one affected individual identified five rare variants in this region, of which none was missense, nonsense or frame shift. Using patient-derived cells, we tested the effect of these variants on expression/splicing of the relevant genes. A synonymous variant in ATP6AP2, c.345C>T (p.S115S), markedly increased exon 4 skipping, resulting in the overexpression of a minor splice isoform that produces a protein with internal deletion of 32 amino acids in up to 50% of the total pool, with concomitant reduction of isoforms containing exon 4. ATP6AP2 is an essential accessory component of the vacuolar ATPase required for lysosomal degradative functions and autophagy, a pathway frequently affected in Parkinson's disease. Reduction of the full-size ATP6AP2 transcript in XPDS cells and decreased level of ATP6AP2 protein in XPDS brain may compromise V-ATPase function, as seen with siRNA knockdown in HEK293 cells, and may ultimately be responsible for the pathology. Another synonymous mutation in the same exon, c.321C>T (p.D107D), has a similar molecular defect of exon inclusion and causes X-linked mental retardation Hedera type (MRXSH). Mutations in XPDS and MRXSH alter binding sites for different splicing factors, which may explain the marked differences in age of onset and manifestations.
Asthma is a complex genetic disease caused by a combination of genetic and environmental risk factors. We sought to test classes of genetic variants largely missed by genome-wide association studies (GWAS), including copy number variants (CNVs) and low-frequency variants, by performing whole-genome sequencing (WGS) on 16 individuals from asthma-enriched and asthma-depleted families. The samples were obtained from an extended 13-generation Hutterite pedigree with reduced genetic heterogeneity due to a small founding gene pool and reduced environmental heterogeneity as a result of a communal lifestyle. We sequenced each individual to an average depth of 13-fold, generated a comprehensive catalog of genetic variants, and tested the most severe mutations for association with asthma. We identified and validated 1960 CNVs, 19 nonsense or splice-site single nucleotide variants (SNVs), and 18 insertions or deletions that were out of frame. As follow-up, we performed targeted sequencing of 16 genes in 837 cases and 540 controls of Puerto Rican ancestry and found that controls carry a significantly higher burden of mutations in IL27RA (2.0% of controls; 0.23% of cases; nominal p = 0.004; Bonferroni p = 0.21). We also genotyped 593 CNVs in 1199 Hutterite individuals. We identified a nominally significant association (p = 0.03; Odds ratio (OR) = 3.13) between a 6 kbp deletion in an intron of NEDD4L and increased risk of asthma. We genotyped this deletion in an additional 4787 non-Hutterite individuals (nominal p = 0.056; OR = 1.69). NEDD4L is expressed in bronchial epithelial cells, and conditional knockout of this gene in the lung in mice leads to severe inflammation and mucus accumulation. Our study represents one of the early instances of applying WGS to complex disease with a large environmental component and demonstrates how WGS can identify risk variants, including CNVs and low-frequency variants, largely untested in GWAS.
Comparisons of human genomes show that more base pairs are altered as a result of
structural variation — including copy number variation — than as a result of point
mutations. Here we review advances and challenges in the discovery and genotyping of structural
variation. The recent application of massively parallel sequencing methods has complemented
microarray-based methods and has led to an exponential increase in the discovery of smaller
structural-variation events. Some global discovery biases remain, but the integration of
experimental and computational approaches is proving fruitful for accurate characterization of the
copy, content and structure of variable regions. We argue that the long-term goal should be routine,
cost-effective and high quality de novo assembly of human genomes to
comprehensively assess all classes of structural variation.
We present a high-quality genome sequence of a Neandertal woman from Siberia. We show that her parents were related at the level of half siblings and that mating among close relatives was common among her recent ancestors. We also sequenced the genome of a Neandertal from the Caucasus to low coverage. An analysis of the relationships and population history of available archaic genomes and 25 present-day human genomes shows that several gene flow events occurred among Neandertals, Denisovans and early modern humans, possibly including gene flow into Denisovans from an unknown archaic group. Thus, interbreeding, albeit of low magnitude, occurred among many hominin groups in the Late Pleistocene. In addition, the high quality Neandertal genome allows us to establish a definitive list of substitutions that became fixed in modern humans after their separation from the ancestors of Neandertals and Denisovans.
de novo SNV mutation; autozygosity; mutation rate
The genetic basis of neurodevelopmental and neuropsychiatric diseases has been advanced by the discovery of large and recurrent copy number variants significantly enriched in cases when compared to controls. The pattern of this variation strongly implies that rare variants contribute significantly to neurological disease; that different genes will be responsible for similar diseases in different families; and that the same “primary” genetic lesions can result in a different disease outcome depending potentially on the genetic background. Next-generation sequencing technologies are beginning to broaden the spectrum of disease-causing variation and provide specificity by pinpointing both genes and pathways for future diagnostics and therapeutics.
High throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing and downstream analysis. While tools that report the ‘best’ mapping location of each read provide a fast way to process HTS data, they are not suitable for many types of downstream analysis such as structural variation detection, where it is important to report multiple mapping loci for each read. For this purpose we introduce mrsFAST-Ultra, a fast, cache oblivious, SNP-aware aligner that can handle the multi-mapping of HTS reads very efficiently. mrsFAST-Ultra improves mrsFAST, our first cache oblivious read aligner capable of handling multi-mapping reads, through new and compact index structures that reduce not only the overall memory usage but also the number of CPU operations per alignment. In fact the size of the index generated by mrsFAST-Ultra is 10 times smaller than that of mrsFAST. As importantly, mrsFAST-Ultra introduces new features such as being able to (i) obtain the best mapping loci for each read, and (ii) return all reads that have at most n mapping loci (within an error threshold), together with these loci, for any user specified n. Furthermore, mrsFAST-Ultra is SNP-aware, i.e. it can map reads to reference genome while discounting the mismatches that occur at common SNP locations provided by db-SNP; this significantly increases the number of reads that can be mapped to the reference genome. Notice that all of the above features are implemented within the index structure and are not simple post-processing steps and thus are performed highly efficiently. Finally, mrsFAST-Ultra utilizes multiple available cores and processors and can be tuned for various memory settings. Our results show that mrsFAST-Ultra is roughly five times faster than its predecessor mrsFAST. In comparison to newly enhanced popular tools such as Bowtie2, it is more sensitive (it can report 10 times or more mappings per read) and much faster (six times or more) in the multi-mapping mode. Furthermore, mrsFAST-Ultra has an index size of 2GB for the entire human reference genome, which is roughly half of that of Bowtie2. mrsFAST-Ultra is open source and it can be accessed at http://mrsfast.sourceforge.net.
Over 900 genes have been annotated within duplicated regions of the human genome, yet their functions and potential roles in disease remain largely unknown. One major obstacle has been our inability to accurately and comprehensively assay genetic variation for these genes in a high-throughput manner. We developed a sequencing-based method for rapid and high-throughput genotyping of duplicated genes using molecular inversion probes designed to unique paralogous sequence variants. We apply this method to genotype all members of two gene families, SRGAP2 and RH, among a diversity panel of 1,056 humans. The approach can accurately distinguish copy number in paralogs having up to ∼99.6% sequence identity, identify small gene-disruptive deletions, detect single nucleotide variants, define breakpoints of unequal crossover, and discover regions of interlocus gene conversion. Our analysis of SRGAP2 suggests that nonreciprocal genetic exchange akin to interlocus gene conversion can occur over long distances (> 80 Mbp) between paralogs. The ability to rapidly and accurately genotype multiple gene families in thousands of individuals at low cost enables the development of genome-wide gene conversion maps and unlocks many duplicated genes for association with human traits.
Gene duplication is an important source of phenotypic change and adaptive evolution. We use a novel genomic approach to identify highly identical sequence missing from the reference genome, confirming the cortical development gene Slit-Robo Rho GTPase activating protein 2 (SRGAP2) duplicated three times in humans. We show that the promoter and first nine exons of SRGAP2 duplicated from 1q32.1 (SRGAP2A) to 1q21.1 (SRGAP2B) ~3.4 million years ago (mya). Two larger duplications later copied SRGAP2B to chromosome 1p12 (SRGAP2C) and to proximal 1q21.1 (SRGAP2D), ~2.4 and ~1 mya, respectively. Sequence and expression analysis shows SRGAP2C is the most likely duplicate to encode a functional protein and among the most fixed human-specific duplicate genes. Our data suggest a mechanism where incomplete duplication created a novel function —at birth, antagonizing parental SRGAP2 function 2–3 mya a time corresponding to the transition from Australopithecus to Homo and the beginning of neocortex expansion.
It is well established that autism spectrum disorders (ASD) have a strong genetic component. However, for at least 70% of cases, the underlying genetic cause is unknown1. Under the hypothesis that de novo mutations underlie a substantial fraction of the risk for developing ASD in families with no previous history of ASD or related phenotypes—so-called sporadic or simplex families2,3, we sequenced all coding regions of the genome, i.e. the exome, for parent-child trios exhibiting sporadic ASD, including 189 new trios and 20 previously reported4. Additionally, we also sequenced the exomes of 50 unaffected siblings corresponding to these new (n = 31) and previously reported trios (n = 19)4, for a total of 677 individual exomes from 209 families. Here we show de novo point mutations are overwhelmingly paternal in origin (4:1 bias) and positively correlated with paternal age, consistent with the modest increased risk for children of older fathers to develop ASD5. Moreover, 39% (49/126) of the most severe or disruptive de novo mutations map to a highly interconnected beta-catenin/chromatin remodeling protein network ranked significantly for autism candidate genes. In proband exomes, recurrent protein-altering mutations were observed in two genes, CHD8 and NTNG1. Mutation screening of six candidate genes in 1,703 ASD probands identified additional de novo, protein-altering mutations in GRIN2B, LAMC3, and SCN1A. Combined with copy number variant (CNV) data, these results suggest extreme locus heterogeneity but also provide a target for future discovery, diagnostics, and therapeutics.
We report an algorithm to detect structural variation and indels from 1 base pair to 1 megabase pair within exome sequence datasets. Splitread uses one-end anchored placements to cluster the mappings of subsequences of unanchored ends to identify the size, content and location of variants with good specificity and high sensitivity. The algorithm discovers indels, structural variants, de novo events and copy-number polymorphic processed pseudogenes missed by other methods.
Identifying rare, highly penetrant risk mutations may be an important step in dissecting the molecular etiology of schizophrenia. We conducted a gene-based analysis of large (>100 kb), rare copy-number variants (CNVs) in the Wellcome Trust Case Control Consortium 2 (WTCCC2) schizophrenia sample of 1564 cases and 1748 controls all from Ireland, and further extended the analysis to include an additional 5196 UK controls. We found association with duplications at chr20p12.2 (P = 0.007) and evidence of replication in large independent European schizophrenia (P = 0.052) and UK bipolar disorder case-control cohorts (P = 0.047). A combined analysis of Irish/UK subjects including additional psychosis cases (schizophrenia and bipolar disorder) identified 22 carriers in 11 707 cases and 10 carriers in 21 204 controls [meta-analysis Cochran–Mantel–Haenszel P-value = 2 × 10−4; odds ratio (OR) = 11.3, 95% CI = 3.7, ∞]. Nineteen of the 22 cases and 8 of the 10 controls carried duplications starting at 9.68 Mb with similar breakpoints across samples. By haplotype analysis and sequencing, we identified a tandem ∼149 kb duplication overlapping the gene p21 Protein-Activated Kinase 7 (PAK7, also called PAK5) which was in linkage disequilibrium with local haplotypes (P = 2.5 × 10−21), indicative of a single ancestral duplication event. We confirmed the breakpoints in 8/8 carriers tested and found co-segregation of the duplication with illness in two additional family members of one of the affected probands. We demonstrate that PAK7 is developmentally co-expressed with another known psychosis risk gene (DISC1) suggesting a potential molecular mechanism involving aberrant synapse development and plasticity.
Kohlschütter–Tönz syndrome (KTS) is a rare autosomal recessive disorder characterized by amelogenesis imperfecta, psychomotor delay or regression and seizures starting early in childhood. KTS was established as a distinct clinical entity after the first report by Kohlschütter in 1974, and to date, only a total of 20 pedigrees have been reported. The genetic etiology of KTS remained elusive until recently when mutations in ROGDI were independently identified in three unrelated families and in five likely related Druze families. Herein, we report a clinical and genetic study of 10 KTS families. By using a combination of whole exome sequencing, linkage analysis, and Sanger sequencing, we identify novel homozygous or compound heterozygous ROGDI mutations in five families, all presenting with a typical KTS phenotype. The other families, mostly presenting with additional atypical features, were negative for ROGDI mutations, suggesting genetic heterogeneity of atypical forms of the disease.
Kohlschütter–Tönz; ROGDI; amelogenesis imperfecta; epilepsy
The genetic structure of the indigenous hunter-gatherer peoples of southern Africa, the oldest known lineage of modern human, is important for understanding human diversity. Studies based on mitochondrial1 and small sets of nuclear markers2 have shown that these hunter-gatherers, known as Khoisan, San, or Bushmen, are genetically divergent from other humans1,3. However, until now, fully sequenced human genomes have been limited to recently diverged populations4–8. Here we present the complete genome sequences of an indigenous hunter-gatherer from the Kalahari Desert and a Bantu from southern Africa, as well as protein-coding regions from an additional three hunter-gatherers from disparate regions of the Kalahari. We characterize the extent of whole-genome and exome diversity among the five men, reporting 1.3 million novel DNA differences genome-wide, including 13,146 novel amino acid variants. In terms of nucleotide substitutions, the Bushmen seem to be, on average, more different from each other than, for example, a European and an Asian. Observed genomic differences between the hunter-gatherers and others may help to pinpoint genetic adaptations to an agricultural lifestyle. Adding the described variants to current databases will facilitate inclusion of southern Africans in medical research efforts, particularly when family and medical histories can be correlated with genome-wide data.
To understand the genetic heterogeneity underlying developmental delay, we compare copy-number variants (CNVs) in 15,767 children with intellectual disability and various congenital defects to 8,329 adult controls. We estimate that ~14.2% of disease in these individuals is due to large CNVs > 400 kbp. We find greater CNV enrichment in patients with craniofacial anomalies and cardiovascular defects than epilepsy or autism. We identify 59 pathogenic CNVs including 14 novel or previously weakly supported candidates. We refine the critical interval for several genomic disorders such as the 17q21.31 microdeletion syndrome and identify 940 candidate dosage-sensitive genes. We also develop methods to opportunistically discover small, disruptive CNVs within the large and growing diagnostic array datasets. This evolving CNV morbidity map combined with exome/genome sequencing will be critical for deciphering the genetic basis of developmental delay, intellectual disability, and autism spectrum disorders.
The FMR1 premutation is defined as having 55 to 200 CGG repeats in the 5′ untranslated region of the fragile X mental retardation 1 gene (FMR1). The clinical involvement has been well characterized for fragile X-associated tremor/ataxia syndrome (FXTAS) and fragile X-associated primary ovarian insufficiency (FXPOI). The behavior/psychiatric and other neurological manifestations remain to be specified as well as the molecular mechanisms that will explain the phenotypic variability observed in individuals with the FMR1 premutation.
Here we describe a small pilot study of copy number variants (CNVs) in 56 participants with a premutation ranging from 55 to 192 repeats. The participants were divided into four different clinical groups for the analysis: those with behavioral problems but no autism spectrum disorder (ASD); those with ASD but without neurological problems; those with ASD and neurological problems including seizures; and those with neurological problems without ASD.
We found 12 rare CNVs (eight duplications and four deletions) in 11 cases (19.6%) that were not found in approximately 8,000 controls. Three of them were at 10q26 and two at Xp22.3, with small areas of overlap. The CNVs were more commonly identified in individuals with neurological involvement and ASD.
The frequencies were not statistically significant across the groups. There were no significant differences in the psychometric and behavior scores among all groups. Further studies are necessary to determine the frequency of second genetic hits in individuals with the FMR1 premutation; however, these preliminary results suggest that genomic studies can be useful in understanding the molecular etiology of clinical involvement in premutation carriers with ASD and neurological involvement.
Premutation; FMR1 gene; Autism; Second hit; ASD; Neurodevelopmental disorders; Neurological disorders
Evidence for the etiology of autism spectrum disorders (ASD) has consistently pointed to a strong genetic component complicated by substantial locus heterogeneity1,2. We sequenced the exomes of 20 sporadic cases of ASD and their parents, reasoning that these families would be enriched for de novo mutations of major effect. We identified 21 de novo mutations, of which 11 were protein-altering. Protein-altering mutations were significantly enriched for changes at highly conserved residues. We identified potentially causative de novo events in 4/20 probands, particularly among more severely affected individuals, in FOXP1, GRIN2B, SCN1A, and LAMC3. In the FOXP1 mutation carrier, we also observed a rare inherited CNTNAP2 mutation and provide functional support for a multihit model for disease risk3. Our results demonstrate that trio-based exome sequencing is a powerful approach for identifying novel candidate genes for ASD and suggest that de novo mutations may contribute substantially to the genetic risk for ASD.
Although an increasing number of copy-number variations are being identified as susceptibility loci for a variety of pediatric diseases, the penetrance of these copy-number variations remains mostly unknown. This poses challenges for counseling, both for recurrence risks and prenatal diagnosis. We sought to provide empiric estimates for penetrance for some of these recurrent, disease-susceptibility loci.
We conducted a Bayesian analysis, based on the copy-number variation frequencies in control populations (n = 22,246) and in our database of >48,000 postnatal microarray-based comparative genomic hybridization samples. The background risk for congenital anomalies/developmental delay/intellectual disability was assumed to be ~5%. Copy-number variations studied were 1q21.1 proximal duplications, 1q21.1 distal deletions and duplications, 15q11.2 deletions, 16p13.11 deletions, 16p12.1 deletions, 16p11.2 proximal and distal deletions and duplications, 17q12 deletions and duplications, and 22q11.21 duplications.
Estimates for the risk of an abnormal phenotype ranged from 10.4% for 15q11.2 deletions to 62.4% for distal 16p11.2 deletions.
This model can be used to provide more precise estimates for the chance of an abnormal phenotype for many copy-number variations encountered in the prenatal setting. By providing the penetrance, additional, critical information can be given to prospective parents in the genetic counseling session.
copy-number variation; genomic disorder; microarray; penetrance; prenatal diagnosis
Understanding the prevailing mutational mechanisms responsible for human genome structural variation requires uniformity in the discovery of allelic variants and precision in terms of breakpoint delineation. We develop a resource based on capillary end-sequencing of 13.8 million fosmid clones from 17 human genomes and characterize the complete sequence of 1,054 large structural variants corresponding to 589 deletions, 384 insertions, and 81 inversions. We analyze the 2,081 breakpoint junctions and infer potential mechanism of origin. Three mechanisms account for the bulk of germline structural variation: microhomology-mediated processes involving short (2–20 bp) stretches of sequence (28%), non-allelic homologous recombination (NAHR) (22%) and L1 retrotransposition (19%). The high quality and long-range continuity of the sequence reveals more complex mutational mechanisms including repeat-mediated inversions and gene conversion that are most often missed by other methods including comparative genomic hybridization, SNP microarrays and next-generation sequencing.
Standard methods of DNA sequence analysis assume that sequences evolve independently, yet this assumption may not be appropriate for segmental duplications that exchange variants via interlocus gene conversion (IGC). Here, we use high quality multiple sequence alignments from well-annotated segmental duplications to systematically identify IGC signals in the human reference genome. Our analysis combines two complementary methods: (i) a paralog quartet method that uses DNA sequence simulations to identify a statistical excess of sites consistent with inter-paralog exchange, and (ii) the alignment-based method implemented in the GENECONV program. One-quarter (25.4%) of the paralog families in our analysis harbor clear IGC signals by the quartet approach. Using GENECONV, we identify 1477 gene conversion tracks that cumulatively span 1.54 Mb of the genome. Our analyses confirm the previously reported high rates of IGC in subtelomeric regions and Y-chromosome palindromes, and identify multiple novel IGC hotspots, including the pregnancy specific glycoproteins and the neuroblastoma breakpoint gene families. Although the duplication history of a paralog family is described by a single tree, we show that IGC has introduced incredible site-to-site variation in the evolutionary relationships among paralogs in the human genome. Our findings indicate that IGC has left significant footprints in patterns of sequence diversity across segmental duplications in the human genome, out-pacing the contributions of single base mutation by orders of magnitude. Collectively, the IGC signals we report comprise a catalog that will provide a critical reference for interpreting observed patterns of DNA sequence variation across duplicated genomic regions, including targets of recent adaptive evolution in humans.