The genetic basis of neurodevelopmental and neuropsychiatric diseases has been advanced by the discovery of large and recurrent copy number variants significantly enriched in cases when compared to controls. The pattern of this variation strongly implies that rare variants contribute significantly to neurological disease; that different genes will be responsible for similar diseases in different families; and that the same “primary” genetic lesions can result in a different disease outcome depending potentially on the genetic background. Next-generation sequencing technologies are beginning to broaden the spectrum of disease-causing variation and provide specificity by pinpointing both genes and pathways for future diagnostics and therapeutics.
Standard methods of DNA sequence analysis assume that sequences evolve independently, yet this assumption may not be appropriate for segmental duplications that exchange variants via interlocus gene conversion (IGC). Here, we use high quality multiple sequence alignments from well-annotated segmental duplications to systematically identify IGC signals in the human reference genome. Our analysis combines two complementary methods: (i) a paralog quartet method that uses DNA sequence simulations to identify a statistical excess of sites consistent with inter-paralog exchange, and (ii) the alignment-based method implemented in the GENECONV program. One-quarter (25.4%) of the paralog families in our analysis harbor clear IGC signals by the quartet approach. Using GENECONV, we identify 1477 gene conversion tracks that cumulatively span 1.54 Mb of the genome. Our analyses confirm the previously reported high rates of IGC in subtelomeric regions and Y-chromosome palindromes, and identify multiple novel IGC hotspots, including the pregnancy specific glycoproteins and the neuroblastoma breakpoint gene families. Although the duplication history of a paralog family is described by a single tree, we show that IGC has introduced incredible site-to-site variation in the evolutionary relationships among paralogs in the human genome. Our findings indicate that IGC has left significant footprints in patterns of sequence diversity across segmental duplications in the human genome, out-pacing the contributions of single base mutation by orders of magnitude. Collectively, the IGC signals we report comprise a catalog that will provide a critical reference for interpreting observed patterns of DNA sequence variation across duplicated genomic regions, including targets of recent adaptive evolution in humans.
We report an algorithm to detect structural variation and indels from 1 base pair to 1 megabase pair within exome sequence datasets. Splitread uses one-end anchored placements to cluster the mappings of subsequences of unanchored ends to identify the size, content and location of variants with good specificity and high sensitivity. The algorithm discovers indels, structural variants, de novo events and copy-number polymorphic processed pseudogenes missed by other methods.
de novo SNV mutation; autozygosity; mutation rate
Familial dyskinesia with facial myokymia (FDFM) is an autosomal dominant disorder that is exacerbated by anxiety. In a five-generation family of German ancestry we previously mapped FDFM to chromosome 3p21-3q21. The 72.5 Mbp linkage region was too large for traditional positional mutation identification.
To identify the gene responsible for FDFM by exome resequencing of a single affected individual.
Design, Setting and Participants
We performed whole exome sequencing in one affected individual and used a series of bioinformatic filters, including functional significance and presence in dbSNP or 1000 Genomes project, to reduce the number of candidate variants. Co-segregation analysis was performed in 15 additional individuals in three generations.
The exome contained 23428 single nucleotide variants, of which 9391 were missense, nonsense or splice site alterations. The critical region contained 323 variants, five of which were not present in one of the sequence-databases. Adenylate cyclase 5 (ADCY5) was the only gene in which the variant (c.2176G>A) was co-transmitted perfectly with disease status and was not present in 3510 control Caucasian exomes. This residue is highly conserved and the change is nonconservative and predicted to be damaging.
ADCY5 is highly expressed in striatum. Mice deficient in Adcy5 develop a movement disorder that is worsened by stress. We conclude that FDFM likely results from a missense mutation in ADCY5. This study demonstrates the power of a single exome sequence in combination with linkage information to identify causative genes for rare autosomal dominant Mendelian diseases.
Children with autism have an elevated frequency of large, rare copy number variants (CNVs). However, the global load of deletions or duplications, per se, and their size, location and relationship to clinical manifestations of autism have not been documented. We examined CNV data from 516 individuals with autism or typical development from the population-based Childhood Autism Risks from Genetics and Environment (CHARGE) study. We interrogated 120 regions flanked by segmental duplications (genomic hotspots) for events >50 kbp and the entire genomic backbone for variants >300 kbp using a custom targeted DNA microarray. This analysis was complemented by a separate study of five highly dynamic hotspots associated with autism or developmental delay syndromes, using a finely tiled array platform (>1 kbp) in 142 children matched for gender and ethnicity. In both studies, a significant increase in the number of base pairs of duplication, but not deletion, was associated with autism. Significantly elevated levels of CNV load remained after the removal of rare and likely pathogenic events. Further, the entire CNV load detected with the finely tiled array was contributed by common variants. The impact of this variation was assessed by examining the correlation of clinical outcomes with CNV load. The level of personal and social skills, measured by Vineland Adaptive Behavior Scales, negatively correlated (Spearman's r = −0.13, P = 0.034) with the duplication CNV load for the affected children; the strongest association was found for communication (P = 0.048) and socialization (P = 0.022) scores. We propose that CNV load, predominantly increased genomic base pairs of duplication, predisposes to autism.
The lamprey (Petromyzon marinus) undergoes developmentally programmed genome rearrangements (PGRs) that mediate deletion of ~20% of germline DNA from somatic cells during early embryogenesis. This genomic differentiation of germline and soma is intriguing, because the germline plays a unique biological role wherein it must possess the ability to undergo meiotic recombination and the capacity to differentiate into every cell type. These evolutionarily indispensible functions set the germline at odds with somatic tissues, as factors that promote recombination and pluripotency can potentially disrupt genome integrity or specification of cell fate when misexpressed in somatic cell lineages (e.g. in oncogenesis). Here, we describe the development of new genomic and transcriptomic resources for lamprey and use these to identify hundreds of genes that are targeted for programmed deletion from somatic cell lineages. Transcriptome sequencing and targeted validation studies further confirm that somatically deleted genes function both in adult (meiotic) germline and in the development of primordial germ cells during embryogenesis. Inferred functional information from deleted regions indicates that developmentally programmed rearrangement serves as a (perhaps ancient) biological strategy to ensure segregation of pluripotency functions to the germline, effectively eliminating the potential for somatic misexpression.
Atrioventricular septal defects (AVSDs) are a frequent but not universal component of Down syndrome (DS), while AVSDs in otherwise normal individuals have no well-defined genetic basis. The contribution of copy number variation (CNV) to specific congenital heart disease (CHD) phenotypes including AVSD is unknown. We hypothesized that de novo CNVs on chromosome 21 might cause isolated sporadic AVSDs, and separately that CNVs throughout the genome might constitute an additional genetic risk factor for AVSD in patients with DS. We utilized a custom oligonucleotide arrays targeted to CNV hotspots that are flanked by large duplicated segments of high sequence identity. We assayed 29 euploid and 50 DS individuals with AVSD, and compared to general population controls. In patients with isolated-sporadic AVSD we identified two large unique deletions outside of chromosome 21 not seen in the expanded set of 8,635 controls, each overlapping with larger deletions associated with similar CHD reported in the DECIPHER database. There was a small duplication in one patient with DS and AVSD. We conclude that isolated sporadic AVSDs may be occasionally associated with large de novo genomic structural variation outside of chromosome 21. The absence of CNVs on chromosome 21 in patients with isolated sporadic AVSD suggests that sub-chromosomal duplications or deletions of greater than 150 kbp on chromosome 21 do not cause sporadic AVSDs. Large CNVs do not appear to be an additive risk factor for AVSD in the DS population.
Down syndrome; atrioventricular septal defects; copy number variation; array CGH; congenital heart disease
The 17q21.31 inversion polymorphism exists either as direct (H1) or inverted (H2) haplotypes with differential predispositions to disease and selection. We investigated its genetic diversity in 2700 individuals with an emphasis on African populations. We characterize eight structural haplotypes that vary in size from 1.08 to 1.49 Mbp as a result of complex rearrangements and provide evidence for a 30 kbp H1/H2 double recombination event. We show that recurrent partial duplications of the KANSL1 (previously known as KIAA1267) gene have occurred on both H1 and H2 haplotypes and risen to high frequency in European populations. We identify a likely ancestral H2 haplotype (H2′) lacking these duplications, enriched among African hunter-gatherer groups yet essentially absent from West Africans populations. While H1 and H2 segmental duplications arose independently and prior to the human migration out of Africa, they have reached high frequencies recently among Europeans either due to extraordinary genetic drift or selective sweeps.
Analysis of cell-free fetal DNA in maternal plasma holds great promise for the development of non-invasive prenatal genetic diagnostics. However, previous studies have been restricted to detection of fetal trisomies (1, 2) or specific, paternally inherited mutations (3), or to genotyping common polymorphisms using invasively sampled material (4). Here, we combine genome sequencing of two parents, genome-wide maternal haplotyping (5), and deep sequencing of maternal plasma to non-invasively determine the genome sequence of a human fetus at 18.5 weeks gestation. Inheritance was predicted at 2.8×106 parentally heterozygous sites with 98.1% accuracy. Furthermore, 39 of 44 de novo point mutations in the fetal genome were detected, albeit with limited specificity. Subsampling these data and analyzing a second family trio by the same approach indicate that ~300 kilobase parental haplotype blocks combined with shallow sequencing of maternal plasma are sufficient to substantially determine the inherited complement of a fetal genome. However, ultra-deep sequencing of maternal plasma is necessary for the practical detection of fetal de novo mutations genome-wide. Although technical and analytical challenges remain, we anticipate that non-invasive analysis of inherited variation and de novo mutations in fetal genomes will facilitate the comprehensive prenatal diagnosis of both recessive and dominant Mendelian disorders.
The human genome is a highly dynamic structure that shows a wide range of genetic polymorphic variation. Unlike other types of structural variation, little is known about inversion variants within normal individuals because such events are typically balanced and are difficult to detect and analyze by standard molecular approaches. Using sequence-based, cytogenetic and genotyping approaches, we characterized six large inversion polymorphisms that map to regions associated with genomic disorders with complex segmental duplications mapping at the breakpoints. We developed a metaphase FISH-based assay to genotype inversions and analyzed the chromosomes of 27 individuals from three HapMap populations. In this subset, we find that these inversions are less frequent or absent in Asians when compared with European and Yoruban populations. Analyzing multiple individuals from outgroup species of great apes, we show that most of these large inversion polymorphisms are specific to the human lineage with two exceptions, 17q21.31 and 8p23 inversions, which are found to be similarly polymorphic in other great ape species and where the inverted allele represents the ancestral state. Investigating linkage disequilibrium relationships with genotyped SNPs, we provide evidence that most of these inversions appear to have arisen on at least two different haplotype backgrounds. In these cases, discovery and genotyping methods based on SNPs may be confounded and molecular cytogenetics remains the only method to genotype these inversions.
Rare copy number variants (CNVs) – deletions and duplications – have recently been established as important risk factors for both generalized and focal epilepsies. A systematic assessment of the role of CNVs in epileptic encephalopathies, the most devastating and often etiologically obscure, group of epilepsies, has not been performed.
We evaluated 315 patients with epileptic encephalopathies characterized by epilepsy and progressive cognitive impairment for rare CNVs using a high-density, exon-focused whole-genome oligonucleotide array.
We found that 25/315 (7.9%) of our patients carried rare CNVs that may contribute to their phenotype, with at least half being clearly or likely pathogenic. We identified two patients with overlapping deletions at 7q21 and two patients with identical duplications of 16p11.2. In our cohort, large deletions were enriched in affected individuals compared to controls, and four patients harbored two rare CNVs. We screened two novel candidate genes found within the rare CNVs in our cohort but found no mutations in our patients with epileptic encephalopathies. We highlight several additional novel candidate genes located in CNV regions.
Our data highlight the significance of rare copy number variants in the epileptic encephalopathies, and we suggest that CNV analysis should be considered in the genetic evaluation of these patients. Our findings also highlight novel candidate genes for further study.
Gene duplication is an important source of phenotypic change and adaptive evolution. We use a novel genomic approach to identify highly identical sequence missing from the reference genome, confirming the cortical development gene Slit-Robo Rho GTPase activating protein 2 (SRGAP2) duplicated three times in humans. We show that the promoter and first nine exons of SRGAP2 duplicated from 1q32.1 (SRGAP2A) to 1q21.1 (SRGAP2B) ~3.4 million years ago (mya). Two larger duplications later copied SRGAP2B to chromosome 1p12 (SRGAP2C) and to proximal 1q21.1 (SRGAP2D), ~2.4 and ~1 mya, respectively. Sequence and expression analysis shows SRGAP2C is the most likely duplicate to encode a functional protein and among the most fixed human-specific duplicate genes. Our data suggest a mechanism where incomplete duplication created a novel function —at birth, antagonizing parental SRGAP2 function 2–3 mya a time corresponding to the transition from Australopithecus to Homo and the beginning of neocortex expansion.
Motivation: Discovering variation among high-throughput sequenced genomes relies on efficient and effective mapping of sequence reads. The speed, sensitivity and accuracy of read mapping are crucial to determining the full spectrum of single nucleotide variants (SNVs) as well as structural variants (SVs) in the donor genomes analyzed.
Results: We present drFAST, a read mapper designed for di-base encoded ‘color-space’ sequences generated with the AB SOLiD platform. drFAST is specially designed for better delineation of structural variants, including segmental duplications, and is able to return all possible map locations and underlying sequence variation of short reads within a user-specified distance threshold. We show that drFAST is more sensitive in comparison to all commonly used aligners such as Bowtie, BFAST and SHRiMP. drFAST is also faster than both BFAST and SHRiMP and achieves a mapping speed comparable to Bowtie.
Availability: The source code for drFAST is available at http://drfast.sourceforge.net
Structural variations in the chromosome 22q11.2 region mediated by non-allelic homologous recombination result in 22q11.2 deletion (del22q11.2) and 22q11.2 duplication (dup22q11.2) syndromes. The majority of del22q11.2 cases have facial and cardiac malformations, immunologic impairments, specific cognitive profile and increased risk for schizophrenia and autism spectrum disorders. The phenotype of dup22q11.2 is frequently without physical features but includes the spectrum of neurocognitive abnormalities. Although there is substantial evidence that haploinsufficiency for TBX1 plays a role in the physical features of del22q11.2, it is not known which gene(s) in the critical 1.5 Mb region are responsible for the observed spectrum of behavioral phenotypes. We identified an individual with a balanced translocation 46,XY,t(1;22)(p36.1;q11.2) and a behavioral phenotype characterized by cognitive impairment, autism and schizophrenia in the absence of congenital malformations. Using somatic cell hybrids and comparative genomic hybridization we mapped the chromosome-22 breakpoint within intron 7 of the GNB1L gene. Copy number evaluations and direct DNA sequencing of GNB1L in 271 schizophrenia and 513 autism cases revealed dup22q11.2 in two families with autism and private GNB1L missense variants in conserved residues in three families (p=0.036). The identified missense variants affect residues in the WD40 repeat domains and are predicted to have deleterious effects on the protein. Prior studies provided evidence that GNB1L may have a role in schizophrenia. Our findings support involvement of GNB1L in autism spectrum disorders as well.
22q11.2; translocation; neurodevelopmental disorders
Copy number variants (CNVs) are known to be associated with complex neuropsychiatric disorders (e.g., schizophrenia and autism) but have not been explored in the isolated features of aggressive behaviors such as intermittent explosive disorder (IED). IED is characterized by recurrent episodes of aggression in which individuals act impulsively and grossly out of proportion from the involved stressors. Previous studies have identified genetic variants in the serotonergic pathway that play a role in susceptibility to this behavior, but additional contributors have not been identified. Therefore, to further delineate possible genetic influences, we investigated CNVs in individuals diagnosed with IED and/or personality disorder (PD). We carried out array comparative genomic hybridization on 113 samples of individuals with isolated features of IED (n = 90) or PD (n = 23). We detected a recurrent 1.35-Mbp deletion on chromosome 1q21.1 in one IED subject and a novel ~350-kbp deletion on chromosome 16q22.3q23.1 in another IED subject. While five recent reports have suggested the involvement of an ~1.6-Mbp 15q13.3 deletion in individuals with behavioral problems, particularly aggression, we report an absence of such events in our study of individuals specifically selected for aggression. We did, however, detect a smaller ~430-kbp 15q13.3 duplication containing CHRNA7 in one individual with PD. While these results suggest a possible role for rare CNVs in identifying genes underlying IED or PD, further studies on a large number of well-characterized individuals are necessary.
Aggression; array CGH; genomic disorders; segmental duplication; 15q13.3
15q13.3 microdeletions are the most common genetic findings in Idiopathic Generalized Epilepsies identified to date, present in up to 1% of patients. In addition, 15q13.3 microdeletions have been described in patients with epilepsy as part of a complex neurodevelopmental phenotype. We analyzed a cohort of 570 patients with various pediatric epilepsies for 15q13.3 microdeletions. Screening was performed using quantitative polymerase chain reaction, deletions were confirmed by array comparative genomic hybridization. We carried out detailed phenotyping of deletion carriers. In total, we identified four pediatric patients with 15q13.3 microdeletions including one previously described patient. 2/4 deletions were de novo, 1 deletion was inherited from an unaffected parent, and in one patient, inheritance is unknown. All four patients had absence epilepsy with various degrees of intellectual disability. We suggest that absence epilepsy accompanied by intellectual disability may represent a common phenotype of the 15q13.3 microdeletion in pediatric epilepsy patients.
Intellectual disability; IGE
Evidence for the etiology of autism spectrum disorders (ASD) has consistently pointed to a strong genetic component complicated by substantial locus heterogeneity1,2. We sequenced the exomes of 20 sporadic cases of ASD and their parents, reasoning that these families would be enriched for de novo mutations of major effect. We identified 21 de novo mutations, of which 11 were protein-altering. Protein-altering mutations were significantly enriched for changes at highly conserved residues. We identified potentially causative de novo events in 4/20 probands, particularly among more severely affected individuals, in FOXP1, GRIN2B, SCN1A, and LAMC3. In the FOXP1 mutation carrier, we also observed a rare inherited CNTNAP2 mutation and provide functional support for a multihit model for disease risk3. Our results demonstrate that trio-based exome sequencing is a powerful approach for identifying novel candidate genes for ASD and suggest that de novo mutations may contribute substantially to the genetic risk for ASD.
The duplication architecture of the human genome predisposes our species to recurrent copy number variation and disease. Emerging data suggest that this mechanism of mutation contributes to both common and rare diseases. Two features regarding this form of mutation have emerged. First, common structural polymorphisms create susceptible and protective chromosomal architectures. These structural polymorphisms occur at varying frequencies in populations, leading to different susceptibility and ethnic predilection. Second, a subset of rearrangements shows extreme variability in expressivity. We propose that two types of genomic disorders may be distinguished: syndromic forms where the phenotypic features are largely invariant and those where the same molecular lesion associates with a diverse set of diagnoses including epilepsy, schizophrenia, autism, intellectual disability and congenital malformations. Copy number variation analyses of patient genomes reveal that disease type and severity may be explained by the occurrence of additional rare events and their inheritance within families. We propose that the overall burden of copy number variants creates differing sensitized backgrounds during development leading to different thresholds and disease outcomes. We suggest that the accumulation of multiple high-penetrant alleles of low frequency may serve as a more general model for complex genetic diseases, posing a significant challenge for diagnostics and disease management.
Structural variation contributes to the rich genetic and phenotypic diversity of the modern domestic dog, Canis lupus familiaris, although compared to other organisms, catalogs of canine copy number variants (CNVs) are poorly defined. To this end, we developed a customized high-density tiling array across the canine genome and used it to discover CNVs in nine genetically diverse dogs and a gray wolf.
In total, we identified 403 CNVs that overlap 401 genes, which are enriched for defense/immunity, oxidoreductase, protease, receptor, signaling molecule and transporter genes. Furthermore, we performed detailed comparisons between CNVs located within versus outside of segmental duplications (SDs) and find that CNVs in SDs are enriched for gene content and complexity. Finally, we compiled all known dog CNV regions and genotyped them with a custom aCGH chip in 61 dogs from 12 diverse breeds. These data allowed us to perform the first population genetics analysis of canine structural variation and identify CNVs that potentially contribute to breed specific traits.
Our comprehensive analysis of canine CNVs will be an important resource in genetically dissecting canine phenotypic and behavioral variation.
Psoriasis is a common inflammatory skin disease with a prevalence of 2% to 3% in Caucasians1. In a genome-wide search for copy number variants (CNV) using a sample pooling approach we have identified a deletion comprising LCE3B and LCE3C, members of the late cornified envelope (LCE) gene cluster2. The absence of LCE3B and LCE3C (LCE3C-LCE3B-del) is significantly associated (p=1.38E-08) with risk of psoriasis in 2,831 samples from Spain, The Netherlands, Italy and the USA, and in a family-based study (p=5.4E-04). LCE3C-LCE3B-del is tagged by rs4112788 (r2=0.93), which is also strongly associated with psoriasis (p<6.6E-09). LCE3C-LCE3B-del shows epistatic effects with the HLA-Cw6 allele on the development of psoriasis in Dutch samples, and multiplicative effects in the other samples. LCE expression can be induced in normal epidermis by skin barrier disruption and is strongly expressed in psoriatic lesions, suggesting that compromised skin barrier function plays a role in psoriasis susceptibility.
High-throughput sequencing technologies promise to transform the fields of genetics and comparative biology by delivering tens of thousands of genomes in the near future. Although it is feasible to construct de novo genome assemblies in a few months, there has been relatively little attention to what is lost by sole application of short sequence reads. We compared the recent de novo assemblies using the short oligonucleotide analysis package (SOAP), generated from the genomes of a Han Chinese individual and a Yoruban individual, to experimentally validated genomic features. We found that de novo assemblies were 16.2% shorter than the reference genome and that 420.2 megabase pairs of common repeats and 99.1% of validated duplicated sequences were missing from the genome. Consequently, over 2,377 coding exons were completely missing. We conclude that high-quality sequencing approaches must be considered in conjunction with high-throughput sequencing for comparative genomics analyses and studies of genome evolution.
Haplotype information is essential to the complete description and interpretation of genomes1, genetic diversity2 and genetic ancestry3. Although individual human genome sequencing is increasingly routine4, nearly all such genomes are unresolved with respect to haplotype. Here we combine the throughput of massively parallel sequencing5 with the contiguity information provided by large-insert cloning6 to experimentally determine the haplotype-resolved genome of a South Asian individual. A single fosmid library was split into a modest number of pools, each providing ~3% physical coverage of the diploid genome. Sequencing of each pool yielded reads overwhelmingly derived from only one homologous chromosome at any given location. These data were combined with whole-genome shotgun sequence to directly phase 94% of ascertained heterozygous single nucleotide polymorphisms (SNPs) into long haplotype blocks (N50 of 386 kilobases (kbp)). This method also facilitates the analysis of structural variation, for example, to anchor novel insertions7,8 to specific locations and haplotypes.