We report an algorithm to detect structural variation and indels from 1 base pair to 1 megabase pair within exome sequence datasets. Splitread uses one-end anchored placements to cluster the mappings of subsequences of unanchored ends to identify the size, content and location of variants with good specificity and high sensitivity. The algorithm discovers indels, structural variants, de novo events and copy-number polymorphic processed pseudogenes missed by other methods.
The lamprey (Petromyzon marinus) undergoes developmentally programmed genome rearrangements (PGRs) that mediate deletion of ~20% of germline DNA from somatic cells during early embryogenesis. This genomic differentiation of germline and soma is intriguing, because the germline plays a unique biological role wherein it must possess the ability to undergo meiotic recombination and the capacity to differentiate into every cell type. These evolutionarily indispensible functions set the germline at odds with somatic tissues, as factors that promote recombination and pluripotency can potentially disrupt genome integrity or specification of cell fate when misexpressed in somatic cell lineages (e.g. in oncogenesis). Here, we describe the development of new genomic and transcriptomic resources for lamprey and use these to identify hundreds of genes that are targeted for programmed deletion from somatic cell lineages. Transcriptome sequencing and targeted validation studies further confirm that somatically deleted genes function both in adult (meiotic) germline and in the development of primordial germ cells during embryogenesis. Inferred functional information from deleted regions indicates that developmentally programmed rearrangement serves as a (perhaps ancient) biological strategy to ensure segregation of pluripotency functions to the germline, effectively eliminating the potential for somatic misexpression.
Atrioventricular septal defects (AVSDs) are a frequent but not universal component of Down syndrome (DS), while AVSDs in otherwise normal individuals have no well-defined genetic basis. The contribution of copy number variation (CNV) to specific congenital heart disease (CHD) phenotypes including AVSD is unknown. We hypothesized that de novo CNVs on chromosome 21 might cause isolated sporadic AVSDs, and separately that CNVs throughout the genome might constitute an additional genetic risk factor for AVSD in patients with DS. We utilized a custom oligonucleotide arrays targeted to CNV hotspots that are flanked by large duplicated segments of high sequence identity. We assayed 29 euploid and 50 DS individuals with AVSD, and compared to general population controls. In patients with isolated-sporadic AVSD we identified two large unique deletions outside of chromosome 21 not seen in the expanded set of 8,635 controls, each overlapping with larger deletions associated with similar CHD reported in the DECIPHER database. There was a small duplication in one patient with DS and AVSD. We conclude that isolated sporadic AVSDs may be occasionally associated with large de novo genomic structural variation outside of chromosome 21. The absence of CNVs on chromosome 21 in patients with isolated sporadic AVSD suggests that sub-chromosomal duplications or deletions of greater than 150 kbp on chromosome 21 do not cause sporadic AVSDs. Large CNVs do not appear to be an additive risk factor for AVSD in the DS population.
Down syndrome; atrioventricular septal defects; copy number variation; array CGH; congenital heart disease
The 17q21.31 inversion polymorphism exists either as direct (H1) or inverted (H2) haplotypes with differential predispositions to disease and selection. We investigated its genetic diversity in 2700 individuals with an emphasis on African populations. We characterize eight structural haplotypes that vary in size from 1.08 to 1.49 Mbp as a result of complex rearrangements and provide evidence for a 30 kbp H1/H2 double recombination event. We show that recurrent partial duplications of the KANSL1 (previously known as KIAA1267) gene have occurred on both H1 and H2 haplotypes and risen to high frequency in European populations. We identify a likely ancestral H2 haplotype (H2′) lacking these duplications, enriched among African hunter-gatherer groups yet essentially absent from West Africans populations. While H1 and H2 segmental duplications arose independently and prior to the human migration out of Africa, they have reached high frequencies recently among Europeans either due to extraordinary genetic drift or selective sweeps.
Analysis of cell-free fetal DNA in maternal plasma holds great promise for the development of non-invasive prenatal genetic diagnostics. However, previous studies have been restricted to detection of fetal trisomies (1, 2) or specific, paternally inherited mutations (3), or to genotyping common polymorphisms using invasively sampled material (4). Here, we combine genome sequencing of two parents, genome-wide maternal haplotyping (5), and deep sequencing of maternal plasma to non-invasively determine the genome sequence of a human fetus at 18.5 weeks gestation. Inheritance was predicted at 2.8×106 parentally heterozygous sites with 98.1% accuracy. Furthermore, 39 of 44 de novo point mutations in the fetal genome were detected, albeit with limited specificity. Subsampling these data and analyzing a second family trio by the same approach indicate that ~300 kilobase parental haplotype blocks combined with shallow sequencing of maternal plasma are sufficient to substantially determine the inherited complement of a fetal genome. However, ultra-deep sequencing of maternal plasma is necessary for the practical detection of fetal de novo mutations genome-wide. Although technical and analytical challenges remain, we anticipate that non-invasive analysis of inherited variation and de novo mutations in fetal genomes will facilitate the comprehensive prenatal diagnosis of both recessive and dominant Mendelian disorders.
Rare copy number variants (CNVs) – deletions and duplications – have recently been established as important risk factors for both generalized and focal epilepsies. A systematic assessment of the role of CNVs in epileptic encephalopathies, the most devastating and often etiologically obscure, group of epilepsies, has not been performed.
We evaluated 315 patients with epileptic encephalopathies characterized by epilepsy and progressive cognitive impairment for rare CNVs using a high-density, exon-focused whole-genome oligonucleotide array.
We found that 25/315 (7.9%) of our patients carried rare CNVs that may contribute to their phenotype, with at least half being clearly or likely pathogenic. We identified two patients with overlapping deletions at 7q21 and two patients with identical duplications of 16p11.2. In our cohort, large deletions were enriched in affected individuals compared to controls, and four patients harbored two rare CNVs. We screened two novel candidate genes found within the rare CNVs in our cohort but found no mutations in our patients with epileptic encephalopathies. We highlight several additional novel candidate genes located in CNV regions.
Our data highlight the significance of rare copy number variants in the epileptic encephalopathies, and we suggest that CNV analysis should be considered in the genetic evaluation of these patients. Our findings also highlight novel candidate genes for further study.
Gene duplication is an important source of phenotypic change and adaptive evolution. We use a novel genomic approach to identify highly identical sequence missing from the reference genome, confirming the cortical development gene Slit-Robo Rho GTPase activating protein 2 (SRGAP2) duplicated three times in humans. We show that the promoter and first nine exons of SRGAP2 duplicated from 1q32.1 (SRGAP2A) to 1q21.1 (SRGAP2B) ~3.4 million years ago (mya). Two larger duplications later copied SRGAP2B to chromosome 1p12 (SRGAP2C) and to proximal 1q21.1 (SRGAP2D), ~2.4 and ~1 mya, respectively. Sequence and expression analysis shows SRGAP2C is the most likely duplicate to encode a functional protein and among the most fixed human-specific duplicate genes. Our data suggest a mechanism where incomplete duplication created a novel function —at birth, antagonizing parental SRGAP2 function 2–3 mya a time corresponding to the transition from Australopithecus to Homo and the beginning of neocortex expansion.
Motivation: Discovering variation among high-throughput sequenced genomes relies on efficient and effective mapping of sequence reads. The speed, sensitivity and accuracy of read mapping are crucial to determining the full spectrum of single nucleotide variants (SNVs) as well as structural variants (SVs) in the donor genomes analyzed.
Results: We present drFAST, a read mapper designed for di-base encoded ‘color-space’ sequences generated with the AB SOLiD platform. drFAST is specially designed for better delineation of structural variants, including segmental duplications, and is able to return all possible map locations and underlying sequence variation of short reads within a user-specified distance threshold. We show that drFAST is more sensitive in comparison to all commonly used aligners such as Bowtie, BFAST and SHRiMP. drFAST is also faster than both BFAST and SHRiMP and achieves a mapping speed comparable to Bowtie.
Availability: The source code for drFAST is available at http://drfast.sourceforge.net
Structural variations in the chromosome 22q11.2 region mediated by non-allelic homologous recombination result in 22q11.2 deletion (del22q11.2) and 22q11.2 duplication (dup22q11.2) syndromes. The majority of del22q11.2 cases have facial and cardiac malformations, immunologic impairments, specific cognitive profile and increased risk for schizophrenia and autism spectrum disorders. The phenotype of dup22q11.2 is frequently without physical features but includes the spectrum of neurocognitive abnormalities. Although there is substantial evidence that haploinsufficiency for TBX1 plays a role in the physical features of del22q11.2, it is not known which gene(s) in the critical 1.5 Mb region are responsible for the observed spectrum of behavioral phenotypes. We identified an individual with a balanced translocation 46,XY,t(1;22)(p36.1;q11.2) and a behavioral phenotype characterized by cognitive impairment, autism and schizophrenia in the absence of congenital malformations. Using somatic cell hybrids and comparative genomic hybridization we mapped the chromosome-22 breakpoint within intron 7 of the GNB1L gene. Copy number evaluations and direct DNA sequencing of GNB1L in 271 schizophrenia and 513 autism cases revealed dup22q11.2 in two families with autism and private GNB1L missense variants in conserved residues in three families (p=0.036). The identified missense variants affect residues in the WD40 repeat domains and are predicted to have deleterious effects on the protein. Prior studies provided evidence that GNB1L may have a role in schizophrenia. Our findings support involvement of GNB1L in autism spectrum disorders as well.
22q11.2; translocation; neurodevelopmental disorders
The human genome is a highly dynamic structure that shows a wide range of genetic polymorphic variation. Unlike other types of structural variation, little is known about inversion variants within normal individuals because such events are typically balanced and are difficult to detect and analyze by standard molecular approaches. Using sequence-based, cytogenetic and genotyping approaches, we characterized six large inversion polymorphisms that map to regions associated with genomic disorders with complex segmental duplications mapping at the breakpoints. We developed a metaphase FISH-based assay to genotype inversions and analyzed the chromosomes of 27 individuals from three HapMap populations. In this subset, we find that these inversions are less frequent or absent in Asians when compared with European and Yoruban populations. Analyzing multiple individuals from outgroup species of great apes, we show that most of these large inversion polymorphisms are specific to the human lineage with two exceptions, 17q21.31 and 8p23 inversions, which are found to be similarly polymorphic in other great ape species and where the inverted allele represents the ancestral state. Investigating linkage disequilibrium relationships with genotyped SNPs, we provide evidence that most of these inversions appear to have arisen on at least two different haplotype backgrounds. In these cases, discovery and genotyping methods based on SNPs may be confounded and molecular cytogenetics remains the only method to genotype these inversions.
Copy number variants (CNVs) are known to be associated with complex neuropsychiatric disorders (e.g., schizophrenia and autism) but have not been explored in the isolated features of aggressive behaviors such as intermittent explosive disorder (IED). IED is characterized by recurrent episodes of aggression in which individuals act impulsively and grossly out of proportion from the involved stressors. Previous studies have identified genetic variants in the serotonergic pathway that play a role in susceptibility to this behavior, but additional contributors have not been identified. Therefore, to further delineate possible genetic influences, we investigated CNVs in individuals diagnosed with IED and/or personality disorder (PD). We carried out array comparative genomic hybridization on 113 samples of individuals with isolated features of IED (n = 90) or PD (n = 23). We detected a recurrent 1.35-Mbp deletion on chromosome 1q21.1 in one IED subject and a novel ~350-kbp deletion on chromosome 16q22.3q23.1 in another IED subject. While five recent reports have suggested the involvement of an ~1.6-Mbp 15q13.3 deletion in individuals with behavioral problems, particularly aggression, we report an absence of such events in our study of individuals specifically selected for aggression. We did, however, detect a smaller ~430-kbp 15q13.3 duplication containing CHRNA7 in one individual with PD. While these results suggest a possible role for rare CNVs in identifying genes underlying IED or PD, further studies on a large number of well-characterized individuals are necessary.
Aggression; array CGH; genomic disorders; segmental duplication; 15q13.3
15q13.3 microdeletions are the most common genetic findings in Idiopathic Generalized Epilepsies identified to date, present in up to 1% of patients. In addition, 15q13.3 microdeletions have been described in patients with epilepsy as part of a complex neurodevelopmental phenotype. We analyzed a cohort of 570 patients with various pediatric epilepsies for 15q13.3 microdeletions. Screening was performed using quantitative polymerase chain reaction, deletions were confirmed by array comparative genomic hybridization. We carried out detailed phenotyping of deletion carriers. In total, we identified four pediatric patients with 15q13.3 microdeletions including one previously described patient. 2/4 deletions were de novo, 1 deletion was inherited from an unaffected parent, and in one patient, inheritance is unknown. All four patients had absence epilepsy with various degrees of intellectual disability. We suggest that absence epilepsy accompanied by intellectual disability may represent a common phenotype of the 15q13.3 microdeletion in pediatric epilepsy patients.
Intellectual disability; IGE
Evidence for the etiology of autism spectrum disorders (ASD) has consistently pointed to a strong genetic component complicated by substantial locus heterogeneity1,2. We sequenced the exomes of 20 sporadic cases of ASD and their parents, reasoning that these families would be enriched for de novo mutations of major effect. We identified 21 de novo mutations, of which 11 were protein-altering. Protein-altering mutations were significantly enriched for changes at highly conserved residues. We identified potentially causative de novo events in 4/20 probands, particularly among more severely affected individuals, in FOXP1, GRIN2B, SCN1A, and LAMC3. In the FOXP1 mutation carrier, we also observed a rare inherited CNTNAP2 mutation and provide functional support for a multihit model for disease risk3. Our results demonstrate that trio-based exome sequencing is a powerful approach for identifying novel candidate genes for ASD and suggest that de novo mutations may contribute substantially to the genetic risk for ASD.
The duplication architecture of the human genome predisposes our species to recurrent copy number variation and disease. Emerging data suggest that this mechanism of mutation contributes to both common and rare diseases. Two features regarding this form of mutation have emerged. First, common structural polymorphisms create susceptible and protective chromosomal architectures. These structural polymorphisms occur at varying frequencies in populations, leading to different susceptibility and ethnic predilection. Second, a subset of rearrangements shows extreme variability in expressivity. We propose that two types of genomic disorders may be distinguished: syndromic forms where the phenotypic features are largely invariant and those where the same molecular lesion associates with a diverse set of diagnoses including epilepsy, schizophrenia, autism, intellectual disability and congenital malformations. Copy number variation analyses of patient genomes reveal that disease type and severity may be explained by the occurrence of additional rare events and their inheritance within families. We propose that the overall burden of copy number variants creates differing sensitized backgrounds during development leading to different thresholds and disease outcomes. We suggest that the accumulation of multiple high-penetrant alleles of low frequency may serve as a more general model for complex genetic diseases, posing a significant challenge for diagnostics and disease management.
Structural variation contributes to the rich genetic and phenotypic diversity of the modern domestic dog, Canis lupus familiaris, although compared to other organisms, catalogs of canine copy number variants (CNVs) are poorly defined. To this end, we developed a customized high-density tiling array across the canine genome and used it to discover CNVs in nine genetically diverse dogs and a gray wolf.
In total, we identified 403 CNVs that overlap 401 genes, which are enriched for defense/immunity, oxidoreductase, protease, receptor, signaling molecule and transporter genes. Furthermore, we performed detailed comparisons between CNVs located within versus outside of segmental duplications (SDs) and find that CNVs in SDs are enriched for gene content and complexity. Finally, we compiled all known dog CNV regions and genotyped them with a custom aCGH chip in 61 dogs from 12 diverse breeds. These data allowed us to perform the first population genetics analysis of canine structural variation and identify CNVs that potentially contribute to breed specific traits.
Our comprehensive analysis of canine CNVs will be an important resource in genetically dissecting canine phenotypic and behavioral variation.
Psoriasis is a common inflammatory skin disease with a prevalence of 2% to 3% in Caucasians1. In a genome-wide search for copy number variants (CNV) using a sample pooling approach we have identified a deletion comprising LCE3B and LCE3C, members of the late cornified envelope (LCE) gene cluster2. The absence of LCE3B and LCE3C (LCE3C-LCE3B-del) is significantly associated (p=1.38E-08) with risk of psoriasis in 2,831 samples from Spain, The Netherlands, Italy and the USA, and in a family-based study (p=5.4E-04). LCE3C-LCE3B-del is tagged by rs4112788 (r2=0.93), which is also strongly associated with psoriasis (p<6.6E-09). LCE3C-LCE3B-del shows epistatic effects with the HLA-Cw6 allele on the development of psoriasis in Dutch samples, and multiplicative effects in the other samples. LCE expression can be induced in normal epidermis by skin barrier disruption and is strongly expressed in psoriatic lesions, suggesting that compromised skin barrier function plays a role in psoriasis susceptibility.
High-throughput sequencing technologies promise to transform the fields of genetics and comparative biology by delivering tens of thousands of genomes in the near future. Although it is feasible to construct de novo genome assemblies in a few months, there has been relatively little attention to what is lost by sole application of short sequence reads. We compared the recent de novo assemblies using the short oligonucleotide analysis package (SOAP), generated from the genomes of a Han Chinese individual and a Yoruban individual, to experimentally validated genomic features. We found that de novo assemblies were 16.2% shorter than the reference genome and that 420.2 megabase pairs of common repeats and 99.1% of validated duplicated sequences were missing from the genome. Consequently, over 2,377 coding exons were completely missing. We conclude that high-quality sequencing approaches must be considered in conjunction with high-throughput sequencing for comparative genomics analyses and studies of genome evolution.
Haplotype information is essential to the complete description and interpretation of genomes1, genetic diversity2 and genetic ancestry3. Although individual human genome sequencing is increasingly routine4, nearly all such genomes are unresolved with respect to haplotype. Here we combine the throughput of massively parallel sequencing5 with the contiguity information provided by large-insert cloning6 to experimentally determine the haplotype-resolved genome of a South Asian individual. A single fosmid library was split into a modest number of pools, each providing ~3% physical coverage of the diploid genome. Sequencing of each pool yielded reads overwhelmingly derived from only one homologous chromosome at any given location. These data were combined with whole-genome shotgun sequence to directly phase 94% of ascertained heterozygous single nucleotide polymorphisms (SNPs) into long haplotype blocks (N50 of 386 kilobases (kbp)). This method also facilitates the analysis of structural variation, for example, to anchor novel insertions7,8 to specific locations and haplotypes.
Microdeletions and microduplications encompassing a ~593-kb region of 16p11.2 have been implicated as one of the most common genetic causes of susceptibility to autism/autism spectrum disorder (ASD). We report 45 microdeletions and 32 microduplications of 16p11.2, representing 0.78% of 9,773 individuals referred to our laboratory for microarray-based comparative genomic hybridization (aCGH) testing for neurodevelopmental and congenital anomalies. The microdeletion was de novo in 17 individuals and maternally inherited in five individuals for whom parental testing was available. Detailed histories of 18 individuals with 16p11.2 microdeletions were reviewed; all had developmental delays with below-average intelligence, and a majority had speech or language problems or delays and various behavioral problems. Of the 16 individuals old enough to be evaluated for autism, the speech/behavior profiles of seven did not suggest the need for ASD evaluation. Of the remaining nine individuals who had speech/behavior profiles that aroused clinical suspicion of ASD, five had formal evaluations, and three had PDD-NOS. Of the 19 microduplications with parental testing, five were de novo, nine were maternally inherited, and five were paternally inherited. A majority with the microduplication had delayed development and/or specific deficits in speech or language, though these features were not as consistent as seen with the microdeletions. This study, which is the largest cohort of individuals with 16p11.2 alterations reported to date, suggests that 16p11.2 microdeletions and microduplications are associated with a high frequency of cognitive, developmental, and speech delay and behavior abnormalities. Furthermore, although features associated with these alterations can be found in individuals with ASD, additional factors are likely required to lead to the development of ASD.
Array CGH; 16p11.2; Microdeletion; Microduplication; Autism; ASD
Little is known about genes that underlie isolated single suture craniosynostosis. In this study, we hypothesize that rare copy number variants in patients with isolated single suture craniosynostosis contain genes important for cranial development. Using whole genome array comparative genomic hybridization (CGH), we evaluated DNA from 187 individuals with single suture craniosynostosis for submicroscopic deletions and duplications. We identified a 1.1-Mb duplication encompassing RUNX2 in two affected cousins with metopic synostosis and hypodontia. Given that RUNX2 is required as a master switch for osteoblast differentiation and interacts with TWIST1, mutations in which also cause craniosynostosis, we conclude that the duplication in this family is pathogenic, albeit with reduced penetrance. In addition, we find that a total of 7.4% of individuals with single-suture synostosis in our series have at least one rare deletion or duplication that contains genes and that has not been previously reported in unaffected individuals. The genes within and disrupted by copy number variants in this cohort are potential novel candidate genes for craniosynostosis.
craniosynostosis; copy number variant; array comparative genomic hybridization; RUNX2
Understanding the prevailing mutational mechanisms responsible for human genome structural variation requires uniformity in the discovery of allelic variants and precision in terms of breakpoint delineation. We develop a resource based on capillary end-sequencing of 13.8 million fosmid clones from 17 human genomes and characterize the complete sequence of 1,054 large structural variants corresponding to 589 deletions, 384 insertions, and 81 inversions. We analyze the 2,081 breakpoint junctions and infer potential mechanism of origin. Three mechanisms account for the bulk of germline structural variation: microhomology-mediated processes involving short (2–20 bp) stretches of sequence (28%), non-allelic homologous recombination (NAHR) (22%) and L1 retrotransposition (19%). The high quality and long-range continuity of the sequence reveals more complex mutational mechanisms including repeat-mediated inversions and gene conversion that are most often missed by other methods including comparative genomic hybridization, SNP microarrays and next-generation sequencing.
Motivation: In the past few years, human genome structural variation discovery has enjoyed increased attention from the genomics research community. Many studies were published to characterize short insertions, deletions, duplications and inversions, and associate copy number variants (CNVs) with disease. Detection of new sequence insertions requires sequence data, however, the ‘detectable’ sequence length with read-pair analysis is limited by the insert size. Thus, longer sequence insertions that contribute to our genetic makeup are not extensively researched.
Results: We present NovelSeq: a computational framework to discover the content and location of long novel sequence insertions using paired-end sequencing data generated by the next-generation sequencing platforms. Our framework can be built as part of a general sequence analysis pipeline to discover multiple types of genetic variation (SNPs, structural variation, etc.), thus it requires significantly less-computational resources than de novo sequence assembly. We apply our methods to detect novel sequence insertions in the genome of an anonymous donor and validate our results by comparing with the insertions discovered in the same genome using various sources of sequence data.
Availability: The implementation of the NovelSeq pipeline is available at http://compbio.cs.sfu.ca/strvar.htm
Copy number variants affect both disease and normal phenotypic variation, but those lying within heavily duplicated, highly identical sequence have been difficult to assay. By analyzing short-read mapping depth for 159 human genomes, we demonstrated accurate estimation of absolute copy number for duplications as small as 1.9 kilobase pairs, ranging from 0 to 48 copies. We identified 4.1 million “singly unique nucleotide” positions informative in distinguishing specific copies and used them to genotype the copy and content of specific paralogs within highly duplicated gene families. These data identify human-specific expansions in genes associated with brain development, reveal extensive population genetic diversity, and detect signatures consistent with gene conversionin the human species. Our approach makes ~1000 genes accessible to genetic studies of disease association.