From previous genetic studies, we expected that de novo mutation plays a large role in autism incidence and introduces variation that is short-lived in the human gene pool because such variation is deleterious and highly penetrant. Sequencing reveals the type and rates of small-scale mutation and pinpoints the responsible gene targets more definitively than does copy number or karyotypic analysis. Our study is a partial confirmation of our expectations, provides sources and rates of some classes of mutation, and strengthens the notion that a convergent set of events might explain a good portion of autism: a class of neuronal genes, defined empirically as FMRP-associated genes, overlap significantly with autism target genes.
Our data set is the largest set of family exome data to be reported so far, and it is derived from whole-blood DNA to avoid the perils of immortalized cell lines. While we focused on the role of de novo mutation of different types in autistic spectrum disorders, we have looked at additional questions related to new mutation. We project overall rates of de novo mutation to be 120 per diploid genome per birth. Most small-scale de novo mutation comes from fathers, and is related to parental age. Per event (and probably en masse), missense mutations have far less impact on the individual than do gene-disrupting mutations such as nonsense, splice variants, and frame shifts. This is evident both in overall differential in de novo mutations, but also from the effects of purifying selection on sets of genes ().
Differential Signal from De Novo Missense and Gene-Disrupting Mutations
Missense mutation should contribute to autism to some degree, as gene function can be severely altered by single-amino-acid substitutions. However, we see no statistical evidence in our work for the hypothesis that de novo missense mutations contribute to autism. The number of de novo missense events we observe is not greater in probands than in siblings. Moreover, the ratio of numbers of missense mutations in probands to siblings is not significantly different than the observed ratio of numbers of synonymous mutations. Even when we filter for genes expressed in brain, count missense mutations that cause nonconservative amino acid changes, or count missense mutations at positions conserved among vertebrates (Table S1
, columns BA–BJ), we see no statistical evidence for contribution from this type of mutation. This is also true when we look for overlap of de novo missense mutations with FMRP-associated genes (). The lack of signal is not attributable to the type of population we study, as we observe de novo copy number imbalance of the expected magnitude in this very same population (Levy et al., 2011
; Sanders et al., 2011
). But given the size of the population and background mutation rate, we are unable to find signal in the present study. A simple power calculation indicates that we cannot rule out confidently even a 20% contribution to autism from de novo missense mutation. Despite these caveats, it is worth considering that de novo mutation causing merely amino acid substitution may only rarely create a dominant allele of strong effect.
We make a strikingly different observation for mutations that are likely to disrupt gene function. In contrast to de novo missense mutation, we do get signal from de novo mutations likely to severely disrupt coding: mutations at splice sites, nonsense mutations, and small indels, particularly indels that cause frame shifts. We observe 59 likely gene disruptions (LGD) in affected and 28 in siblings, a ratio of two to one. We note that girls on the autistic spectrum have a higher rate (9/29) than boys (50/314), a bias we have previously noted for de novo CNV events. The total contribution from LGD mutations can be estimated as 31 events in 343 families (59 events in probands minus 28 events in siblings), or roughly 10% of affected children.
Germline Origin: Rates, Parental Age, and Paternal Variation
We observed de novo point mutations in children at the rate expected from other studies (Awadalla et al., 2010
; Conrad et al., 2011
), about 120 point mutations per genome per generation. We observe that the frequency of de novo mutation is dependent on parental age, and know this with a high degree of statistical certainty. This observation is in keeping with, and potentially explains, other studies that have shown increased incidence of certain genetic disorders in the progeny of older parents, including ASD (Saha et al., 2009
From sequencing adjacent linked polymorphisms in children and parents, we infer that on the order of 3/4 of new point mutations (50 of 67) derive from the father’s germline. Although we have less data, this conclusion holds as well for de novo small indels (6 of 7). These data confirm the paternal line is the main source for these types of new human variation. The data also indicate that the majority of the de novo calls in this study are not somatic in origin, but occur prior to conception. We infer this by assuming that after zygote formation, the mother’s and father’s genomes are equally vulnerable to subsequent somatic mutation. By contrast, a previous study indicated that for de novo copy number variation both parents contribute almost equally (Sanders et al., 2011
We observe very few cases where two siblings share the same de novo mutation, about one for every fifty occurrences, suggesting that the parent is rarely a broad mosaic. However, this conclusion could be an ascertainment bias, because our operational identification of “de novo” precludes observing the mutation in the parent at levels higher than expected from sequencing error. As presented, we do observe some evidence of parental mosaicism, and this is a subject of ongoing scrutiny using enhanced statistical modeling and validation.
Total Contribution from De Novo Mutation
Finding the correct contribution from each genetic mechanism is critical for understanding the nature of the factors causing autistic spectrum disorders. Adding the 6% differential for large-scale de novo copy number mutation previously observed (Levy et al., 2011
; Sanders et al., 2011
) to the 10% differential for LGDs, we reach a total differential of 16% between affected children and siblings. This is far less than our predictions, based on modeling the AGRE population (Zhao et al., 2007
), that causal de novo mutations would occur in about 50% of the SSC. This gap could be attributable to having modeled a more severely affected population. The SSC is skewed to higher functioning cases with a male to female ratio of 6:1 (Fischbach and Lord, 2010
), so there may be more borderline cases in that collection than in the AGRE collection (male to female ratio of 3:1), from which we built our model (Zhao et al., 2007
But our differential must underestimate the contribution from de novo events. First, we use extremely stringent criteria meant to eliminate false positives, and we fail to detect many true positives as a consequence. Second, even among the de novo events we do observe, we may be missing gene-disruptive events, for example, mutations outside the consensus that disrupt splicing and in-frame indels that disrupt the spacing of the peptide backbone. It would not be unlikely to miss even a 5% differential from de novo missense mutation in a study of this size, given the high background rate of neutral missense mutation. Third, our coverage of the genome is incomplete. Some of this arises by chance, and some is systematic due to the exome capture reagents or errors in the reference genome. Fourth, large classes of mutations are eliminated by our filters, such as those that originate in a parent who is a mosaic, and in children who suffer somatic mutation early after zygote formation. Fifth, there are biases in correctly mapping reads covering regions of the genome that are highly rearranged in the child. Sixth, we have not implemented tools that can reliably detect large indels and rearrangements. Our present tool is efficient only for small indels, less than seven base pairs. Seventh, an entire class of events involving repetitive elements is presently unexplored by us because we currently demand that reads have unique mappings. Eighth, we make calls from only coding regions and thus are not able to assess noncoding events that might affect RNA expression or processing. From all these presently hidden sources, the contribution of de novo mutation could easily double or more.
While there is still a gap between the incidence of de novo gene disrupting events and our expectations from population analysis—especially in males—this gap may yet be filled by deeper coverage, more refined genomic tools, and whole-genome sequencing. Interpretation of a richer data set will undoubtedly require a greater understanding of biology, such as the role for noncoding RNAs and how transcript expression and processing are controlled. By contrast, the differential incidence of de novo mutation in females is very strong, and from CNV and exome sequencing data, runs at nearly twice the differential as in males.
Transmission Genetics and Gene Dosage
We find almost no evidence of a role for transmission genetics. We do not think the present study of only 343 families would display statistical evidence for any of the plausible models of contribution from transmission. Such studies will require greater power, and previous larger copy number studies of the SSC have found such evidence (Levy et al., 2011
). There is, however, a weak signal from the increased ratio of compound heterozygotes of rare coding variants in probands to siblings (242 versus 224). This would be consistent with a 5% contribution from this genetic mechanism, but is also consistent with virtually no contribution (p value = 0.4). We can virtually rule out that such events are contributory in more than 20% of children on the spectrum. Fortunately, even a modestly larger study will resolve the strength of contribution from this source.
We do not find evidence of compound heterozygosity at the vast majority of loci where one allele was hit by a disruptive mutation. These events are thus likely to have high impact by altering gene dosage, although we cannot rule out at present that the mutant allele acts by dominant interference.
Individual Vulnerability to New Mutation
Conceptually, any individual of a given genetic lineage has a “vulnerability” to a disorder caused by new mutation in that lineage. We can speak of the “naive genetic lineage” of the zygote as that which is inherited from the grandparents before the action of any mutation acquired during passage through the parental germline. We then define the number of individual vulnerability genes as the number of genes which if disrupted (either in the parental germline or by early somatic mutation after the zygote is formed) will result in the development of the disorder. The size of individual vulnerability is not the same as the target size of autism genes because the former depends on genetic background and future history. Children do not necessarily have the same set of vulnerability genes. The average individual vulnerability over a population can be measured from the ratio of number of de novo LGD events in probands and siblings, as follows.
We will solve for the general case. Assume the rate for a given mutation class in unaffecteds is R
, and the rate in probands is AR
. In a population of size P
, roughly RP
mutations of that class will occur, neglecting the small surplus coming from the small number of affected individuals. The number of affected individuals will be P
, where 1 / N
is the incidence in the population. Thus, ARP
mutations of the class will be found in affecteds. RP
of these will be present by chance and not contributory, whereas (A
events are contributory. Thus the proportion of all de novo mutations in a population of size P that contribute to the condition is
is the probability that a de novo mutation of the particular class will contribute to the condition, and S
is a function only of A
If each of G
total genes had a uniform probability of being a target for a de novo mutation, and T
was the mean number of vulnerability genes per affected, and mutations of the class were completely penetrant, we also have S
Now, for LGD in autism, taking N
= 150, A
= 2 and G
= 25,000, we can compute the average individual vulnerability per child as 167 genes.
This of course is only a crude argument because genes do not have a uniform mutation rate, and not every LGD in a target gene will have complete penetrance. Nevertheless we make note that the size of individual vulnerability appears to be roughly half the target size of all autism genes (see last section of the Discussion).
Other than NRXN1
, we did not see any genes among the detected de novo LGD targets that had been conclusively linked to ASD (independent of FMR1 association), although CTTNBP2
(encoding a cortactin-binding protein) was suggested as a potential candidate for the autism susceptibility locus (AUTS1
) at 7q31 (Cheung et al., 2001
). We now provide evidence, based on a de novo 2 bp frame shift deletion, that mutations in CTTNBP2
may cause ASD. In addition, a number of other candidates stood out as being potentially causal due to a combination of provocative expression patterns, known roles in human disease and suggestive mouse mutant phenotypes. Among these were RIMS1
, a Ras superfamily member necessary for presynaptic long-term potentiation (Castillo et al., 2002
). A targeted Rims1
mutation in the mouse leads to increased postsynaptic density and impaired associative learning as well as memory and cognition deficits (Powell et al., 2004
; Schoch et al., 2002
), and the frame shift allele we found may lead to a similarly severe condition. Another intriguing candidate was the serine/threonine-specific protein kinase DYRK1A
, which is located within the Down syndrome critical region of chromosome 21 and believed to underlie at least some of the pathogenesis of Down syndrome as a consequence of increased dosage. Several reports of likely inactivating mutations in DYRK1A
result in symptoms including developmental delay, behavioral problems, impaired speech and mental retardation (Møller et al., 2008
; van Bon et al., 2011
), and a heterozygous knockout in the mouse also led to developmental delay and increased neuronal densities (Fotaki et al., 2002
). Truncating mutations in ZFYVE26
(encoding a zinc finger protein) are known to cause autosomal recessive spastic paraplegia-15, consisting of lower limb spasticity, cognitive deterioration, axonal neuropathy and white matter abnormalities (Hanein et al., 2008
). It is possible that a heterozygous truncating mutation such as the de novo frame shift allele found in our study might cause a less severe version of this condition resulting in an ASD diagnosis. Other de novo mutations of interest were a 4 bp deletion in DST
(encoding the basement membrane glycoprotein dystonin), which is associated with FMRP (Darnell et al., 2011
) and produces a neurodegeneration phenotype when inactivated in the mouse, and a nonsense mutation in ANK2
(an ankyrin protein involved in synaptic stability [Koch et al., 2008
]). A nonsense mutation in UNC80
has been linked to control of “slow” neuronal excitability (Lu et al., 2010
We also note that thirteen of the 59 LGD candidates appear to be involved in either transcription regulation or chromatin remodeling. Among the latter are three proteins involved in epigenetic modification of histones: ASH1L, a histone H3/H4 methyltransferase that activates transcription (Gregory et al., 2007
); KDM6B, a histone H3 demethylase implicated in multiple developmental processes (Swigut and Wysocka, 2007
), and MLL5, a histone H3 methyltranserase involved in cell lineage determination (Fujiki et al., 2009
). These three are also FMRP-associated genes.
Relation of Candidate Genes to FMRP-Associated Genes
Fragile X syndrome (FXS) is one of the most common genetic causes of intellectual disability, with up to 90% of affected children exhibiting autistic symptoms. This has suggested overlaying recent understanding of FXS biology onto candidate ASD genes (Darnell et al., 2011
). The FMR1
gene is expressed in neurons and controls the translation of many products. A set of 842 FMRP-associated genes has been enumerated by cross-linking, immunoprecipitation, and high-throughput sequencing (HITS-CLIP), and this set was previously noted to overlap candidate genes from de novo CNVs (Darnell et al., 2011
). Hence, we checked the list of FMRP-associated genes with our lists of 59 LGD targets and 72 most likely autism candidate genes from de novo CNVs, and found a remarkable overlap: 14 and 13 with one in common, thus 26/129, with a p value of 10−13
determined on a per gene basis (842 FMRP-associated genes out of 25,000 genes). This overlap is remarkable because half of the LGD targets should not be ASD related, and probably a similar number of the most likely CNV genes. We found no unusual overlap between the FMRP-associated genes and de novo LGD targets in unaffected siblings, or between FMRP-associated genes and de novo missense targets in either affected or unaffected children.
As a follow-up to this striking observation, we searched for de novo mutations in targets upstream of FMR1
and found an intriguing one: GRM5
. It is hit by a deletion that is not a frame shift but removes a single amino acid and causes an additional substitution at the deletion site. GRM5
encodes mGluR5, a glutamate receptor coupled to a G protein (Bear et al., 2004
). Defects in mGluR5 compensate for some of the fragile X symptoms in mice (Dölen et al., 2007
), and mGluR5 antagonists are currently in clinical trial (Jacquemont et al., 2011
Lack of LGD Variants in FMRP-Associated Genes in the Population
FMRP has been proposed to inhibit protein translation of certain critical transcripts involved in neuroplasticity, the coordinated sensitization or desensitization of neurons in response to activity. Hence, it is reasonable to suppose that the physiological mechanisms modulated by FMRP depend on protein concentration, which in turn might be sensitive to gene dosage.
Direct support for this idea comes from surveying the entire parental population for carriers of potentially disruptive gene variants. Using a well-annotated set of human genes as controls, FMRP-associated genes are strongly depleted for mutations that affect splicing or introduce stop codons. The statistical significance of the numbers is striking, whether computed as a rate relative to synonymous mutations or on a per gene basis. We see a similar depletion of LGDs in a set of human orthologs of mouse genes that are enriched for essential genes but we do not see this extreme depletion in a set of 250 genes linked to known disabling genetic disorders. This difference may reflect the strong purifying selection in humans against disruptions of even a single allele of genes in this set. The hypothesis that the majority of the FMRP-associated genes are dosage-sensitive requires a more thorough analysis.
Mediators of Neuroplasticity in Cognitive and Behavioral Disorders
FMRP may act as one component of a central regulator of synaptic plasticity, among others such as TSC2 (Darnell et al., 2011
; Auerbach et al., 2011
). Impairment of its function, or the components it regulates, or other regulators like it, might produce a deficit in human adaptive responses. This study shows these components may be dosage-sensitive targets in autism. By extension, neuroplasticity, the hallmark function of our nervous system that enables learning and adaption in responses to stimulation, might have a general vulnerability to mutation affecting gene dosage. Mediators of neuroplasticity could be searched profitably for involvement in other cognitive disorders.
Three Recent Studies
While our manuscript has been under review, three similar but smaller studies were published: Neale et al., 2012
(N), O’Roak et al., 2012
(O), and Sanders et al., 2012
(S). Each reported exome sequence of about 200 family trios (N) or a mixture of trios and quads (O and S). (O) and (S) report of families from the SSC collection. None of the SSC samples overlapped with ours, but unlike our random selection from the SSC, (O) was enriched for females and severely affected children, and (S) was enriched for families with > 1 normal sibling.
We summarize the findings in these papers that overlap ours: more de novo point mutation in children with older parents (all three), higher incidence in female than male probands (N), paternal origin of most de novo mutations (O), an elevated ratio (≥2:1) of de novo gene disruptions in probands versus siblings (S), no segregation distortion of rare polymorphisms from parents (S), and a de novo point mutation rate of about 2.0 × 10−8 per base pair per generation (O and N). The single point of slight disagreement concerns differential signal from de novo missense mutation, which is marginal in (S) and not evident in our data.
All groups report de novo gene disruptions (nonsense, splice, and frame shifts) in probands, 18 in (N), 33 in (O), and 17 in (S), for a total of 68. With the 59 from this study, a total of 127 hits in probands have been found. Judging from our two-fold differential rate in probands and siblings, we expect that at least half of the 127 hits, about 65, are causal. Five genes were hit twice. DYRK1A and POGZ are the new recurrences found by combining our data with theirs. With our projected differential between probands and sibling controls, these five genes that are recurrent targets of de novo disruptions in probands are almost certainly autism targets.
From our estimate of 65 causal gene disruptions and 5 recurrent gene targets, we project that the total number of dosagesensitive targets for autism is about 370 genes. We made a similar estimate from de novo CNVs (Levy et al., 2011
; see Recurrence Analysis in Supplemental Experimental Procedures). With this target size, and an expected 50% increase in rate of discovery of de novo gene disruptions, similar studies of all 2800 SSC families should yield about 116 autism genes, thereby identifying unequivocally about a third of the dosage-sensitive gene targets.
The other groups did not report on the number of gene disruptions occurring within the FMRP-associated genes. However, 15 of their 68 do hit these genes, a rate similar to what we observed (14 of 59). Combining data, we now compute a p value of 2 × 10−4 that this is mere coincidence. We project that nearly half of autism target genes will be among the list of FMRP-associated genes.