|Home | About | Journals | Submit | Contact Us | Français|
By analyzing the exomes of 12,332 unrelated Swedish individuals – including 4,877 affected with schizophrenia – in ways informed by exome sequences from 45,376 other individuals, we identified 244,246 coding-sequence and splice-site ultra-rare variants (URVs) that were unique to individual Swedes. We found that gene-disruptive and putatively protein-damaging URVs (but not synonymous URVs) were more abundant in schizophrenia cases than controls (P = 1.3 × 10−10). This elevation of protein-compromising URVs was several times larger than an analogously elevated rate for de novo mutations, suggesting that most rare-variant effects on schizophrenia risk are inherited. Among individuals with schizophrenia, the elevated frequency of protein-compromising URVs was concentrated in brain-expressed genes, particularly in neuronally expressed genes; most of this genetic signal arose from large sets of genes whose RNAs have been found to interact with synaptically localized proteins. Our results suggest that synaptic dysfunction may mediate a large fraction of strong, individually rare genetic influences on schizophrenia risk.
Schizophrenia is a psychiatric disorder with a lifetime risk of about 0.7%1 and a heritability of 60–80%2,3 despite greatly reduced reproductive fecundity4,5. Because individuals affected with schizophrenia have fewer offspring, purifying selection is expected to prevent high-risk alleles from reaching even modest allele frequencies6. Indeed, estimates of selection (when based only on the reproductive costs of schizophrenia) may underestimate the actual selective pressure against such alleles given emerging evidence that such alleles have multiple adverse effects: for example, rare copy number variations (CNVs), implicated with penetrances ranging from 2–30% (the latter observed for 22q11.2 deletions), negatively impact cognition and fecundity even in their more-typical presentation without schizophrenia7. To the extent that such observations condition expectations for rare single-nucleotide variants, variants with a large effect on schizophrenia risk are likely to be rare in populations, requiring sequencing to find them.
Distinguishing those variants that are extremely rare from variants that are segregating in a population is ideally informed by sequencing very-large numbers of individuals from the same population. We thus analyzed the sequences of 12,332 unrelated individuals (4,946 affected with schizophrenia, 6,242 unaffected controls, and 1,144 with other psychiatric illnesses whose analysis is beyond the scope of the current study) from Sweden (Online Methods). We further informed this analysis with a much larger set of exome sequencing data from 45,376 individuals from multiple non-psychiatric cohorts ascertained by the Exome Aggregation Consortium8. This made it possible to identify among the Swedish research participants 244,246 coding-sequence and splice-site ultra-rare variants (URVs) that were present in single individuals – a set of variants that is greatly enriched for recent mutations, relative to the vastly larger fraction of heterozygosity that is due to less-rare variants (Fig. 1a). This large set of variants made it possible to identify broad biological patterns among an excess of more than 1,000 protein-damaging URVs that we found in the exomes of 4,946 individuals affected with schizophrenia.
We analyzed the protein-coding sequences (the exomes) of 12,332 unrelated Swedish individuals, including 4,946 affected with schizophrenia (2,951 males and 1,995 females), 6,242 unaffected controls (3,182 males and 3,060 females), and 1,144 affected with other disorders (443 males and 701 females, used for population genetic analyses but not as cases or controls). After removing 119 individuals for quality control reasons (mostly due to divergent ancestry, Online Methods), we identified 244,246 coding-sequence and splice-site URVs (among 4,877 schizophrenia cases and 6,203 controls) that were present in only one of the 12,332 unrelated Swedish exomes analyzed and never seen in the Exome Aggregation Consortium (ExAC) cohort (which numbered 45,376 individuals after excluding the subjects from this cohort and other subjects ascertained for psychiatric disorders).
We focused on URVs in most analyses because such variants – although comprising a tiny fraction (less than 0.2%) of the heterozygous sites in an individual – will be greatly enriched for recent mutations and thus have been exposed to fewer generations of purifying selection. The size of the Swedish cohort analyzed, and the additional sequence data for the Exome Aggregation Consortium, allowed us to greatly refine the identification of URVs; for example, among 5,092 (of the 12,332) individuals who were also part of an earlier sequencing study9, the additional data allowed us to re-classify ~66% of variants that had been “singletons” as segregating variants (not URVs in the current analysis). This may have been particularly helpful for refining analyses of challenging-to-interpret missense variants, as we describe below.
We classified coding-sequence and splice-site variants into four groups (Fig. 1b):
The terms protein-“damaging” and gene-“disruptive” refer to predicted effects on individual gene copies and the encoded proteins, rather than effects on phenotypes; effects on phenotypes can be inferred only from association analysis.
Missense damaging URVs accounted for approximately 15% of all missense URVs (Supplementary Fig. 1). There was a median of 2 disruptive and 2 damaging URVs per individual (4 total) (Supplementary Fig. 2).
To assess whether schizophrenia was associated with an increased number of coding-sequence and splice-site URVs (in specific genes, across the exome, or in sets of genes), we used a linear regression model to control for possible confounding variables, including each individual’s overall number of detected URVs (including non-coding URVs), sex, birth year, the hybrid selection kit used for exome enrichment, and the first 20 principal components estimated from exome-wide SNP and indel genotypes (Supplementary Table 1).
An important negative control – to address the possibility that analyses could be affected by population structure, differences in average relatedness within the case and control groups, or by technical variation – is to ask whether functionally neutral forms of variation show any apparent differences in frequency between case and control groups. We did not observe a significant difference in the rate of synonymous URVs between schizophrenia cases and controls (Fig. 1c). We also did not observe a significant difference for non-coding URVs (Supplementary Fig. 3a).
In contrast, we observed significant case-control differences in the rates of disruptive URVs (a difference of 0.12 variants/person; 95% confidence interval [CI]=0.07–0.17; P = 1.8 × 10−6) and damaging URVs (0.12 variants/person; 95% CI=0.07–0.18; P = 3.4 × 10−5). (P values determined by permuting the phenotype data 10 million times agreed with P values from the linear regression analysis, and were: P = 0.81 for synonymous, P = 0.045 for missense non-damaging, P = 3.5 × 10−5 for damaging, and P = 1.8 × 10−6 for disruptive; this suggests that the P values from the regression model are well-calibrated) Damaging and disruptive URVs showed similarly elevated frequencies in cases and were thus combined into a single category termed dURVs (disruptive and damaging ultra-rare variants) for subsequent analyses.
Adjusting for covariates, there were 7% more dURVs in affected individuals than in controls (odds ratios [OR]=1.07; 95% CI=1.05-1.09; P = 1.5 × 10−10), as the case-associated elevation in dURVs (of about 0.25 variants/patient; 95% CI=0.17-0.32) occurred on a background of about 4 dURVs per patient. The elevated frequency of dURVs among individuals affected with schizophrenia appeared to arise from multiple types of dURVs, including in-frame indels, protein-protein-contact, splice-acceptor, splice-donor, stop-gained, and frame-shift variants (Supplementary Fig. 3b).
To assure that this result was not the result of population stratification within Sweden, we further estimated the enrichment in a more genetically homogeneous subset of the Swedish cohort (3,554 schizophrenia cases and 5,164 controls) that excluded individuals with significant amounts of Finnish or Northern Sweden ancestry. Individuals with schizophrenia showed a similar dURV excess in this more genetically homogeneous group (excess of 0.25 dURVs/case, 95% CI 0.16-0.34; P = 2.2 × 10−8).
We next estimated the extent to which dURVs tend to be inherited or de novo. While parental DNA would be necessary to directly ascertain which specific dURVs are de novo mutations (DNMs), we can compare the schizophrenia-associated elevation in dURVs (~0.25 per exome) to an analogous elevation in DNMs detected in earlier studies of 617 affected and 1911 unaffected father-mother-offspring trios12,13. Using data from the trios, we estimated the frequencies of DNMs that were protein-damaging or -disruptive (dDNMs) by the same criteria we used to identify dURVs (including restricting to variants not previously observed in ExAC8). These data yielded an elevation of about 0.03 such DNMs per exome, based on the difference between rates of 0.185 (95% CI 0.151-0.219) for individuals with schizophrenia12 and 0.156 (95% CI 0.139-0.174) for unaffected individuals13. This estimate (0.03 per exome) was several times smaller than the elevation of dURVs in affected individuals in our population-based study (0.25 per exome). (We note that such a comparison requires the imperfect assumption of uniform technical ascertainment across the sequencing studies; even under plausible relaxations of this assumption, the dURV excess greatly exceeds the dDNM excess. Also, when estimated by this same approach, rates of synonymous and non-damaging DNMs were similar – 0.475 in affected and 0.459 in unaffected individuals – suggesting that the analysis is well-calibrated.) We conclude that the great majority of the dURVs driving the elevated rates in schizophrenia were inherited rather than de novo, though the very-low allele frequency of these variants suggests that they are on average just a few generations old.
Although the elevated frequency of dURVs among affected individuals was statistically significant (P = 1.4 × 10−10), it was still only a modest increase of 0.25 dURVs on a background of about 4 dURVs per individual. This excess could in principle be concentrated in individual genes or in sets of functionally related genes, possibilities we address below.
Joint analysis of many rare variants that affect the same gene or sets of genes can increase power to identify genes whose disruption increases the risk of schizophrenia. To find individual genes that had significantly more rare variants in cases or controls, we performed a burden test using SKAT14 adjusting for previously defined covariates (Online Methods). We tested for (a) disruptive, (b) damaging, (c) disruptive and damaging, and (d) missense variants that were either (i) ultra-rare, (ii) singletons in the Sweden cohort, (iii) had a minor allele count ≤5 (minor allele frequency <0.02%), (iv) had a minor allele count ≤10 (minor allele frequency <0.05%), (v) had a minor allele frequency <0.1%, or (vi) had a minor allele frequency <0.5%.
Given the sample size, our analysis would have >90% power (at α = 2.5 × 10−6) to detect any gene for which rare, disruptive and damaging variants were present in 1% of schizophrenia cases, even if such variants had only a relatively modest effect size15 (odds ratio of at least 3, i.e. about 2% penetrance), and still greater power if effect sizes were larger. No individual gene surpassed exome-wide significance in this analysis (Supplementary Fig. 4), suggesting that no one gene is likely to have rare variants that explain even 1% of schizophrenia cases. The individual gene with the strongest enrichment was KL (klotho) (Supplementary Table 2), in which we found eight different dURVs in cases and none in controls (P = 3.7 × 10−4), but this result was not significant given the number of genes tested. Other models, based on higher levels of polygenicity, therefore appear to be more plausible: in a model in which a hypothetical gene is affected in 0.1% of schizophrenia cases, we would have only ~4% power to conclusively find this effect at exome-wide significance, and a far-larger sample would be required. The finding that no individual gene surpassed exome-wide significance in this analysis (Supplementary Fig. 4), suggests that no one gene is likely to have rare variants that explain even 1% of schizophrenia cases.
Among genes previously reported to have potential connections between rare variants and schizophrenia, we identified an ultra-rare splice donor variant in TAF1312, an ultra-rare nonsense variant in SETD1A16,17, and a single ultra-rare nonsense variant in NRXN1, a gene in which exonic deletions associate with schizophrenia18. We did not find any evidence of enrichment of dURVs in DPYD19 (in which we found two dURVs in cases and six in controls), nor in DISC120 (one dURV among cases and two in controls) (Supplementary Table 3).
With this high level of polygenicity foreshadowed by earlier results9,16, it appears that definitive implication of individual genes will require sequencing still-larger numbers of exomes or whole genomes6. We therefore focused on sets of genes with plausibly overlapping biological functions as a way of concentrating a diffuse genetic signal.
We tested gene sets for an enrichment of dURVs (in cases relative to controls) by comparing each gene set’s enrichment level to that of the average gene (Online Methods). We made this stringent correction to account for the fact that any large gene set is more likely to encompass the exome-wide excess of dURVs we see in schizophrenia cases. Our practice greatly deflates the resulting P values.
Subsets of human genes have been previously identified as “missense constrained” (based on a lack of functional coding variation in controls) or “loss-of-function intolerant” (based on a smaller-than-expected number of loss-of-function mutations in population-scale data)13,21. Similar to recent findings in autism, we observed a significant enrichment (in cases relative to controls) of dURVs in missense constrained genes22 (OR=1.28; 95% CI=1.20–1.37; P = 3.2 × 10−8) and loss-of-function intolerant genes8 (OR=1.17; 95% CI=1.12-1.21; P = 1.7 × 10−8) (Fig. 2). Both missense constrained and loss-of-function intolerant genes were enriched for disruptive variants relative to damaging variants (Supplementary Fig. 5); for the latter set, this may reflect these genes having been ascertained specifically for intolerance to disruptive mutations.
In contrast, genes not meeting earlier criteria for loss-of-function intolerance or missense constraint were much less enriched for dURVs (Supplementary Fig. 6). This important negative control confirms that the schizophrenia-associated elevation we observe (for constrained genes) is not due to false positives disproportionally represented across disruptive and damaging variants in cases. The observed enrichment was consistent across data from previously analyzed exomes9 and newly generated data (Supplementary Fig. 5); as the previously analyzed exomes were sequenced across randomized batches with equal number of cases and controls in each batch, this provides additional evidence that the enrichment is not due to technical effects.
The excess of dURVs could in principle be concentrated in genes expressed within specific tissues. Distinct tissues have both shared and tissue-specific sets of expressed genes. We found that a set of 2,647 genes expressed specifically in brain tissue23 was strongly enriched for dURVs (OR=1.17; 95% CI=1.11–1.23; P = 1.2 × 10−4), whereas sets of genes with expression specific to other tissues (including immune cells) were not (Fig. 3a and Supplementary Fig. 7). (At the same time, the “brain-specific” genes explained only part of this signal, while a larger set of brain-expressed genes explained most of it, suggesting that much of the signal may come from genes that are expressed in brain as well as other tissues, Fig. 3a). This aligns with earlier findings that SNP haplotypes implicated in schizophrenia genome-wide association studies (GWAS) tend to overlap (to a non-random degree) with sequences identified as putative enhancers in chromatin-profiling experiments on brain tissue24,25.
The brain contains a complex mixture of cell types, each of which expresses different, only partially overlapping sets of genes. To identify cell types through which rare variants might act to affect risk of schizophrenia, we evaluated (for enrichment of dURVs in affected relative to unaffected individuals) sets of genes identified as specific to neurons, astrocytes, and oligodendrocytes by earlier cell sorting and transcriptional profiling experiments26. A set of 3,388 neuron-specific genes had a strong enrichment of mutations in schizophrenia cases (OR=1.17; 95% CI=1.12–1.22; P = 1.9 × 10−7), comparable to that observed for genes specific to brain tissue itself. Genes specifically expressed in other brain cell types, such as astrocytes and oligodendrocytes, were no more enriched than the average gene (Fig. 3b and Supplementary Fig. 8a). These results nominate neurons as the central nervous system (CNS) cell type in which genetic perturbations most affect schizophrenia risk, though they do not exclude more-modest contributions from other CNS cell types.
Neurons are broadly classified into excitatory and inhibitory classes. The case-control excess of dURVs showed a similar degree of concentration into genes expressed in excitatory and inhibitory neurons (Supplementary Fig. 8b). The small number of genes that were specific to excitatory or inhibitory neurons (relative to the other class) were insufficient to concentrate this genetic signal, which appeared to reside primarily in genes that were expressed in both neuronal classes (Supplementary Fig. 8b).
A strong and consistent finding in exome-sequencing studies of schizophrenia involves an excess of variants in genes whose mRNAs are bound by the fragile X mental retardation protein (FMRP)9,12,27. The large excess of dURVs ascertained in the current set of schizophrenia cases elevated evidence for this relationship (OR=1.23; 95% CI=1.17–1.30; P = 8.2 × 10−9).
The enrichment of dURVs among genes that encode FMRP-bound transcripts has multiple potential biological explanations. One potential explanation could involve the translational-inhibition capacity of FMRP, as implicit in the common description of such genes as FMRP “targets”. Another potential interpretation is that it is in fact the localization of these RNAs to neuronal processes and synapses by FMRP – its shuttling activity – that defines the important biological commonality among these genes. Yet a third possibility is that FMRP-binding experiments have simply been effective ways of ascertaining neuronally expressed genes.
To evaluate these possibilities, we first considered a different set of genes whose mRNAs are carried to synapses by a different shuttling protein, CELF428. The genes encoding CELF4-bound mRNAs also showed an enrichment of dURVs in schizophrenia cases; this enrichment (OR=1.14; 95% CI=1.09–1.19; P = 6.6 × 10−4) was greater than that of the average gene though less strong than that of genes encoding FMRP-bound RNAs.
We also investigated whether genes encoding mRNAs that are bound by RBFOX splicing factors, known to regulate synaptic genes29 and also previously observed at synapses30, could explain a substantial fraction of the dURVs. Earlier experimental work (based on the HITS-CLIP technique for identifying RNAs bound to proteins of interest) has defined constellations of genes whose RNAs are bound by RBFOX1, RBFOX2, or RBFOX3. (We considered RBFOX1 and RBFOX3 together below due to their largely overlapping sets of bound genes31). Genes whose transcripts are bound by RBFOX1 or RBFOX3 were enriched in dURVs (OR=1.16; 95% CI=1.11–1.21; P = 6.7 × 10−7). A somewhat stronger enrichment was apparent for genes whose RNAs are bound by RBFOX2 (OR=1.21; 95% CI=1.16–1.26; P = 6.3 × 10−12).
We also observed enrichment in synaptic genes as defined by the SynaptomeDB32 (OR=1.14; 95% CI=1.09–1.20; P = 0.0022), though this smaller set of genes explained a smaller fraction of the case-control difference in dURVs (Fig. 3c).
We were concerned that the enrichment for dURVs in genes with synaptically localized transcripts could, in principle, be simply due to these experiments having been highly effective at isolating transcripts that are present in neurons (which strongly express FMR1, CELF4, and RBFOX1/2/3); in this case, the importance of synaptic localization would be uncertain. To address this possibility, we identified, from earlier experimental data, sets of genes expressed in brain tissue23, neurons26, excitatory neurons, and inhibitory neurons33. Within each set, we defined a gene as “potentially synaptic” if it was in any of the previously constructed FMRP, CELF4, RBFOX2, or SynaptomeDB gene sets, then stratified each of the neuronal/brain expression gene sets based on whether or not the genes were potentially synaptic (Fig. 4). No matter how we defined neuronally expressed genes, we observed that this tendency to contain an excess of dURVs in schizophrenia cases distinguished the potentially synaptic genes (which showed elevated rates of dURVs in schizophrenia) from other neuronally expressed genes (which did not) (Fig. 4).
These large constellations of potentially synaptic genes appeared to explain a large fraction (collectively more than 70%) of the exome-wide enrichment in dURVs (Fig. 4).
Protein complexes have been used to define sets of genes with aligned activities, offering potentially meaningful ways to group genes for genetic analysis. We focused on genes encoding proteins that have been detected at synaptic complexes by co-immunoprecipitation with known synaptic components followed by mass spectrometric proteomic analyses. These gene sets have been the source of primary enrichment results in earlier studies of CNVs and rare and de novo SNVs in schizophrenia patients9,12,34. We observed case-control enrichment of dURVs among genes thus defined as encoding interactors with PSD-95 (OR=1.52; 95% CI 1.21–1.90; P = 0.0017), ARC and N-methyl-D-aspartate receptors (NMDAR)35 (OR=1.55; 95% CI=1.21–1.98; P = 0.0028) (Fig. 2). Despite these elevated levels of enrichment, these smaller gene sets explained much smaller fractions (collectively 4-12%) of the case-control enrichment in dURVs, perhaps reflecting that these gene sets include just a fraction of the proteins that are present at synapses.
More-complete ascertainment of the protein components of synaptic structures is an important future research direction that might advance functional analysis and interpretation of larger constellations of rare variants.
We tested whether genes within the 108 GWAS loci recently identified in schizophrenia also contain an excess of dURVs. We observed a nominally significant enrichment in genes overlapping regions near common variants associated with schizophrenia24 (OR=1.37; 95% CI=1.09–1.73; P = 0.027) (Fig. 2) hinting at some degree of convergence. This overlap was greater than could be explained by any individual gene or small set of genes. Predicted targets of microRNA-13736, previously identified as localized near common SNPs associated with schizophrenia37, were also significantly enriched for dURVs (OR=1.13; 95% CI=1.09–1.18; P = 6.8 × 10−4) (Fig. 2).
Mutations associated with intellectual disability and developmental disorders are often also substantial risk factors for syndromic forms of autism and perhaps schizophrenia38–40,17. We did observe the dURV elevation to be concentrated in X-linked intellectual disability (XLID) genes41,42 (OR=1.88; 95% CI=1.34–2.64; P = 9.5 × 10−4) and in developmental disorder (DD) genes43 (OR=1.67; 95% CI=1.31-2.13; P = 1.6 × 10−4) (Online Methods). Of potential interest, we identified four dURVs in schizophrenia cases (and none in controls) in XLID gene KDM5C, an H3K4 methylation eraser gene44, 11 dURVs in cases (and 2 in controls) in DD gene KDM5B, another H3K4 methylation eraser gene45, and 11 dURVs in cases (and 3 in controls) in DD gene ITPR1, which encodes an inositol triphosphate receptor46 (Supplementary Tables 2 and 3). The enrichment of XLID variants was not different between female and male cases (Supplementary Fig. 9).
We further tested for enrichment of dURVs in (i) genes overlapping de novo copy number variants (CNVs) previously found in individuals with schizophrenia, bipolar disorder, and autism (Supplementary Tables 4 and 5, see Online Methods), and (ii) genes in which de novo non-synonymous mutations were previously ascertained in individuals with autism, congenital heart disease, epilepsy, intellectual disability, and schizophrenia (Supplementary Tables 6 and 7, see Online Methods). Because de novo non-synonymous mutations have been ascertained in such a large number of genes, we sought to increase specificity by restricting this analysis to loss-of-function intolerant (LoF-intolerant) genes, as previously defined8. We observed a significant enrichment in genes within de novo deletions previously ascertained in schizophrenia cases (OR=1.34; 95% CI=1.13–1.59; P = 0.0052) (Fig. 5a), as well as an enrichment in loss-of-function intolerant genes with de novo non-synonymous mutations in schizophrenia cases (OR=1.41; 95% CI=1.25–1.60; P = 0.0011) (Fig. 5b).
By sequencing the exomes of 12,332 unrelated individuals from Sweden, including 4,946 affected with schizophrenia, we observed an exome-wide burden of dURVs in schizophrenia cases. This excess rare-variant burden – approximately 0.25 such variants per person (on a background of 4 such variants) – was several times greater than the schizophrenia-associated elevation in rates of gene-disruptive and protein-damaging de novo mutations, suggesting that the observed excess arose mostly from inherited variants. For less-rare (segregating) variants of even modest allele frequencies, we were unable to detect any excess in affected relative to unaffected individuals, consistent with a previous analysis of non-ultra-rare exonic variants in other cohorts47.
The excess of dURVs in schizophrenia cases largely resided in brain-expressed genes, and more specifically in genes that are expressed in neurons, rather than in other CNS cell types (Fig. 6). It is possible that earlier associations to small, protein-interaction-defined gene sets (such as PSD-95, NMDAR and ARC9,12,34), which appear to explain combined a much-smaller fraction of the exome-wide dURV burden in schizophrenia (collectively 4-12%), have been proxies for a far-wider set of rare-variant effects at synapses.
Most of the excess of dURVs in affected individuals’ exomes appeared to be concentrated in a larger set of genes encoding potentially synaptic proteins. Genes whose transcripts are bound by FMRP or CELF4 – which transport a subset of neuronal RNAs to neuronal processes and synapses – or RBFOX2 – which regulates many synaptic RNAs and has been observed at synapses – explained considerably larger fractions (collectively more than 70%) of the global rare-variant enrichment observed in cases. Genes encoding RNAs bound by FMRP or RBFOX proteins have previously been shown to be enriched for mutations in subjects with autism and/or schizophrenia48,9,12,13,49, though it has been unclear whether such potential effects are a small or a large fraction of strongly risk-increasing variants. While it is tempting to attribute the association of schizophrenia with dURVs in FMRP–associated, CELF4–associated, and RBFOX2–associated genes to the specific biological activities of these proteins, we propose that their association may simply reflect the synaptic localization and function of the transcripts and proteins encoded by these genes.
We observed a significant overlap of the dURV excess with genes in which de novo non-synonymous mutations and deletions have been found in schizophrenia cases. We also observed a significant enrichment across intellectual disability genes on the X chromosome and in developmental disorder genes. This enrichment is compatible with observations of the role of intellectual disability genes in some cases of autism38–40 and schizophrenia17, though the penetrance of such mutations for schizophrenia may be much less than their penetrance for intellectual disability, and they may reside primarily in syndromic cases in which schizophrenia is preceded by other developmental disorders17.
The fact that an analysis of the current scale (4,877 cases, 6,203 controls, and 45,376 other genomes used to help identify ultra-rare variants) did not implicate individual genes of large effect in an unbiased exome-wide search – while documenting a very clear exome-wide elevation of hundreds of pathogenic variants across 4,877 individuals affected with schizophrenia relative to controls – lends further support to the emerging impression that the high polygenicity of schizophrenia extends to rare as well as common variants9,16. Because of the rareness of these variants and the infrequency with which any individual gene is affected by them – even among schizophrenia cases – the sequencing of much larger cohorts will be needed to identify the specific individual genes in which rare variants shape risk for schizophrenia.
A total of 12,384 blood-derived DNA samples from Swedish research participants were collected from 2005 to 2013. Psychiatric cases with a diagnosis of schizophrenia or bipolar disorder were ascertained from the Swedish National Hospital Discharge Register as described in previous studies9,50, which captures all inpatient hospitalizations. Controls were randomly selected from population registers. Excluding subjects with bipolar disorder, age information at the time of DNA sampling was available for each individual. All subjects provided informed consent; institutional human subject committees approved the research (UNC IRB # 04-1465). All procedures were approved by the ethical committees in Sweden and in the United States.
The 12,384 samples collected were sequenced in twelve separate waves. The first wave employed an earlier version of the hybrid-capture procedure (Agilent SureSelect Human All Exon Kit), which targets ~28 million base pairs of the human genome, partitioned in ~160,000 intervals, whereas the samples from the other waves used a newer version (Agilent SureSelect Human All Exon v.2 Kit), which targets ~32 million base pairs of the human genome, partitioned in ~190,000 intervals. The first wave was sequenced using Illumina GAII instruments and the remaining waves were sequenced using Illumina HiSeq 2000 and HiSeq 2500 instruments, with pair ended sequencing reads of 76 base pairs across all waves. Sequencing was performed at the Broad Institute of MIT and Harvard across the period of time from 2010 to 2013. With the exception of the first wave, we did not observe significant differences across waves and cases status beyond what could be explained by ancestry (Supplementary Fig. 10).
This cohort has been previously analyzed in relation to schizophrenia for common variants51,37,52,50,24 and copy number variants52,53, and in relation to somatic mosaic mutations54. Exome sequence data for approximately half of the individuals in the cohort had already been analyzed in relation to schizophrenia phenotype in a previous study9 and in a more recent study17.
Exome sequence data from 12,384 samples was aligned against the GRCh37 human genome reference with bwa aln 0.5.955 and further processed using the GATK framework56. Genotype calls were generated using GATK Haplotype Caller version 3.1-144-g00f68a3 and best practices57,58. Variants filtered out by the GATK Variant Quality Score Recalibration (VQSR) tool were excluded. Genotypes over sites with less than 10x sequencing coverage were set to missing. We identified and removed 4 duplicate individuals and 48 individuals with a first degree relationship (Supplementary Fig. 11) with other individuals in the cohort using the plink59,60 software. We then computed the number of ultra-rare single nucleotide polymorphysms (SNPs) and indels never observed in ExAC8 for each of the remaining 12,332 samples and identified one individual with 1,757 ultra-rare SNPs and 22 ultra-rare indels from the sixth sequencing wave (see Supplementary Table S5F from Genovese et al.54), four individuals with between 92 and 127 ultra-rare indels from the first sequencing wave, five individuals from waves 11 and 12 with between 410 and 496 ultra-rare SNPs due to African ancestry. These 10 individuals were excluded from further analysis. The resulting individuals had a range of 5-259 ultra-rare SNPs and 0-19 ultra-rare indels (Supplementary Fig. 12a). We further removed 15 individuals for whom the reported sex and the inferred sex from inbreeding coefficients on the X chromosome mismatched (Supplementary Fig. 13), including 7 individuals with 47, XXY karyotype (Klinefelter syndrome), and 94 individuals with more than 100 URVs.
A logistic regression model was used to estimate association between single variants and schizophrenia phenotype correcting for sex and the first five principal components using plink59,60. We identified two loci (Supplementary Fig. 14) with statistically significant associations (p<10−6) replicating a couple of common variants associations previously observed for this cohort50: a single variant rs281766 on chromosome 2 in the UTR5 of genes TYW5 and C2orf47, a variant in strong linkage disequilibrium with the seventh strongest independently associated variant in the largest meta-analysis for schizophrenia24, and seven variants in the MHC region around the HLA genes, also a region with extensive linkage to known causal variants associated with schizophrenia61.
We annotated all genotyped variants with SnpEff 4.2 (build 2015-12-05)62 using Ensembl gene models from database GRCh37.75. We further annotated variants with SnpSift 4.2 (build 2015-12-05)63 using annotations from database dbNSFP 2.964,65. Variants identified within transcripts from UCSC known genes66 were further classified into four groups:
Notice that FATHMM74 predictions included in dbNSFP were not used due to poor performance with respect to minor allele count (Supplementary Fig. 1) and a small number of variants defined as damaging by the predictor (Supplementary Fig. 3b). The final predictor performed better than all other individual predictors (Supplementary Fig. 3b) but it was not overfit as to be the best predictor for this cohort (Supplementary Fig. 15).
Out of a total of 1,753,312 variants passing VQSR filters, 66,874 were identified as in common with variants from the 1000 Genomes project phase 1 dataset75 and included as part of the Omni2.5 genotype array. We used this subset of highly confident variants to estimate population stratification. We selected exclusively Omni2.5 polymorphic sites because more robust in the 1000 Genomes dataset to artifacts due to the heterogeneity of the sequencing technologies used within the 1000 Genomes project. We then further restricted to variants with minor allele frequency larger than 1% in both the Sweden and the 1000 Genomes dataset and we pruned for variants in linkage disequilibrium using plink59,60 (with command line '--indep 50 5 2'). We then merged the Sweden and the 1000 Genomes dataset and computed principal components using plink and GCTA76 (Supplementary Fig. 12c-d). Estimated 3rd and 5th principal components corresponded to previously observed Finnish and Northern-Southern Sweden clines12 (Supplementary Fig. 12d), while 1st, 2nd, and 4th principal components corresponded to the three main principal components in the 1000 Genomes project phase 1 distinguishing African, East Asian, and Native American ancestry. While principal components did correlate with overall amounts of URVs (Supplementary Fig. 12e-f), rather than removing individuals with exotic ancestry based on principal components loading, we simply removed individuals with more than 100 URVs and we included sex, year of birth (Supplementary Fig. 12b), exome capturing kit, the first 20 principal component loadings, and the total number of URVs for each individual as covariates in all statistical analysis involving URVs and dURVs.
Variants were excluded whether failing the GATK VQSR tool (117,629 variants), having inbreeding coefficient less than zero (that is, more observed heterozygotes than expected) while at the same time failing a Hardy-Weinberg disequilibrium test with a false discovery rate of 10−6 (8,306 additional variants), or whether associating with any of the 146 batches among which the cohort was split for sequencing in the sequencing facility at the Broad Institute (3,700 additional variants). Due to prevalent population stratification within batches, we estimated unusual associations with a logistic regression model including sex and the first 20 principal components loadings as covariates.
For each gene set, to estimate the excess of dURVs in cases with schizophrenia (or the odds ratios for schizophrenia phenotype) we used a linear (or logistic) regression model correcting for: (i) sex; (ii) overall URV count; (iii) birth year; (iv) the hybrid selection kit used to enrich for exome sequence; and (v) the first twenty principal components estimated from exome-wide SNP genotypes. When estimating P values, to estimate the importance of each gene set with respect to the observed exome-wide enrichment, we further corrected for exome-wide dURV count. This expedient allows to better estimate the importance of each gene set irrespectively of its size and to answer the more precise question of whether a gene set concentrates the exome-wide dURV enrichment better than the average gene.
We used different resources to build the gene sets for which the burden of dURVs was computed:
We used three different resources available online to define X-linked intellectual disability (XLID) genes:
We tested for enrichment of dURVs in the above gene sets and in genes including all of the above gene sets, and seperately genes believed to escape and not escape X-inactivation in humans105, genes from OMIM including autosomal genes causing intellectual disability, and genes involved in developmental disorders through de novo mutations43. We tested for enrichment separately in males and females, as well as combined (Supplementary Fig. 9).
While XLID genes and developmental disorders genes were strongly enriched in schizophrenia cases, autosomally linked intellectually disability genes were not. This discrepancy might reflect a better characterization of intellectual disability genes on the X chromosome due to a more straightforward study design for how these genes where discovered. We also observed that XLID genes which escape X-inactivation were more enriched than other XLID genes. This might reflect a disproportionate contribution to intellectual disability and psychosis from dosage sensitive brain-related genes on the X chromosome106–108. We did not observe a difference in effect sizes between males and females.
Similarly to rare variants enrichment, common variants associated with schizophrenia are localized near XLID genes CNKSR2 and NLGN4X, both of which escape X inactivation, as well as non-XLID gene PJA124.
Given the strong case-control enrichment of dURVs in potentially synaptic genes, we used these genes to perform a sensitive evaluation of whether we could observe an increased burden of less-rare disruptive and damaging variants in the same set. Using a standard burden test for non-ultra-rare variants with minor allele count 10 or less and controlling for covariates, we observed no statistically significant enrichment of disruptive and damaging variants in schizophrenia cases compared to controls (P = 0.59), wheareas the same test was highly significant when restricted to URVs (P = 1.7 × 10−19, without controlling for exome-wide enrichment) (Supplementary Table 8).
While a predictor based on common variants24,109 explained 15% of the variance in schizophrenia liability in this cohort, a predictor based on the cumulative burden of dURVs in all genes explained only 0.48% (P = 1.5 × 10−10), and a similar predictor in potentially synaptic genes explained only 0.92% (Nagelkerke's coefficient of determination) (P = 6.3 × 10−19). We also attempted to generate a polygenic score based on the cumulative number of dURVs in genes that had burden of dURVs in cases greater than or equal to controls. Using a leave-one-out strategy, the resulting predictor explained 0.47% (P = 2.3 × 10−10) of the phenotypic variability. These estimates are naturally lower bounds on the effect of rare variants; knowledge of the correct effect size of each variant would significantly increase the predictive value of dURVs, though obtaining such knowledge will require sequencing a vastly larger number of exomes.
We thank C. Usher for comments on the manuscript and work on the figures. This study was supported by grants from the National Human Genome Research Institute (U54 HG003067, R01 HG006855 to S.A.M.), the National Institute of Mental Health (R01 MH077139 to P.F.S. and RC2 MH089905), the Stanley Center for Psychiatric Research, the Alexander and Margaret Stewart Trust, and the Sylvan C. Herman Foundation.
Reprints and permissions information is available online at http://www.nature.com/reprints/index.html.
Accession codes dbGaP: phs000473.
Note: Any Supplementary Information and Source Data files are available in the online version of the paper.
G.G. and S.A.M. designed the analyses and wrote early drafts of the manuscript. G.G. performed the analyses. M.F. contributed to analyses of de novo mutated genes, D.M.R and E.A.S. contributed with the specific design of the analyses. K.C. contributed with sample processing and data management, M.L., J.L.M., S.M.P., P.S., P.F.S., and C.M.H. contributed with sample and phenotpye collection. All authors contributed to interpretation of the findings and revisions of the manuscript.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
Scripts used to perform all of the analyses are available at https://github.com/freeseek/gwaspipeline and data are available through dbGAP at http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000473