We found two novel associations at a genome-wide level of significance near SCARB2
(rs6812193) and SREBF1
(rs11868035), both of which were replicated in data from 
. We also report two novel associations (near RIT2
) just under the level of significance, one of which (RIT2
) was also replicated. While it is difficult to pinpoint any causal genes from a GWAS, there are a few biologically plausible candidates worthy of discussion.
The PD-associated SNP rs6812193 lies in an intron of the FAM47E
gene, which gives rise to multiple alternatively spliced transcripts, many of which are protein-coding; the functions of these hypothetical proteins are unknown. A more attractive candidate, located
kb centromeric to the SNP, is SCARB2
(scavenger receptor class B, member 2), which encodes the lysosomal integral membrane protein type 2 (LIMP-2). LIMP-2 deficiency causes the autosomal-recessive disorder Action Myoclonus-Renal Failure syndrome (AMRF), which combines renal glomerulosclerosis with progressive myoclonus epilepsy associated with storage material in the brain 
. LIMP-2 is involved in directing
-glucocerebrosidase to the lysosome where it hydrolyzes the
-glycosyl linkage of glucosylceramide 
. Deficiency of this enzyme due to mutations in its gene (GBA
) causes the most common lysosomal storage disorder, Gaucher's disease. Recently, mutations in GBA
have also been identified in PD 
, pointing to a possible functional link between the newly identified candidate gene SCARB2
rs11868035 appears in an intron of the alternatively spliced gene, SREBF1
(sterol regulatory element-binding transcription factor 1), within the Smith-Magenis syndrome (SMS) deletion region on 17p11.2. SREBF1
encodes SREBP-1 (sterol regulatory element-binding protein 1), a transcriptional activator required for lipid homeostasis, which regulates cholesterol synthesis and its cellular uptake from plasma LDL 
. Studies of neuronal cell cultures have implicated SREBP-1 as a mediator of NMDA-induced excitotoxicity 
. rs11868035 is directly adjacent to the acceptor splice site for the C-terminal exon of the SREBP-1c isoform of the protein 
, suggesting that the effect of the polymorphism may be specifically related to the splicing machinery for this protein. The mutation is also in strong LD with rs11649804, a nonsynonymous variant in the nearby gene RAI1
(retinoic acid-induced protein 1), which regulates transcription by remodeling chromatin and interacting with the basic transcriptional machinery. Heterozygous mutations in RAI1
reproduce the major symptoms of SMS, such as developmental and growth delay, self-injurious behaviors, sleep disturbance, and distinct craniofacial and skeletal anomalies 
. Future work is needed to identify the functionally important variant(s) responsible for this association.
The SNP rs4130047, slightly below the genome-wide significance threshold, lies in an intron of the RIT2
(Ras-like without CAAX 2) gene that encodes Rit2, a member of the Ras superfamily of small GTPases. Though we do not claim this SNP as a confirmed replication, there are a number of reasons to suspect that this association may also be real. Rit2 binds calmodulin in a calcium-dependent manner, and is thought to regulate signaling pathways and cellular processes distinct from those controlled by Ras 
. It localizes to both the nucleus and the cytoplasm. Independent of our study, RIT2
was previously proposed as a candidate gene for PD, based on the possibility that dopaminergic neurons may be especially vulnerable to high intracellular calcium levels, perhaps through an interaction with
. The PD-associated region contains another biologically plausible candidate gene, SYT4
(synaptotagmin IV), which encodes synaptotagmin-4, an integral membrane protein of synaptic vesicles thought to serve as
sensor in the process of vesicular trafficking and exocytosis. It is expressed widely in the brain but not in extraneural tissues 
. Homozygous Syt4−/− mouse mutants have impaired motor coordination 
is particularly interesting as a SNP near SYT11
(synaptotagmin XI) has been associated with PD in 
, and the encoded protein, synaptotagmin-11, is known to interact with parkin 
The suggestively associated SNP rs28233572 lies in a gene-poor region with only one candidate gene downstream, USP25
, encoding ubiquitin specific peptidase 25, which regulates intracellular protein breakdown by disassembly of the polyubiquitin chains. Other ubiquitin-specific proteases (USP24
) have been proposed as candidate genes for PD 
fails to replicate here, see ).
Our heritability estimates, which suggest that genetic factors account for at least one-fourth of the total variation in liability to PD, represent the tightest confidence bounds determined for the heritability of PD to date. These estimates, which rely on observed genetic sharing rather than predicted relationship coefficients, avoid confounding from shared environmental covariance by restricting attention to very distantly related individuals. Furthermore, they complement estimates of heritability from twin studies by considering large numbers of individuals with low amounts of genetic sharing, rather than small numbers of twin pairs with large amounts of genetic sharing.
These estimates should only be interpreted as lower bounds on the actual heritability of liability of PD for two reasons. First, they only reflect phenotypic variation due to causal variants in LD with SNPs on the genotyping platform. Second, they only capture the contribution to additive variance that arises from a polygenic model of many SNPs of small effect, but do not include the variance arising from known specific associations. This limitation is most apparent in our estimate of heritability based on only early-onset cases (
), which is considerably lower than reported in prior twin studies (e.g.,
). In early-onset PD, mutations in six specific genes (SNCA
, and GBA
) have been reported to account for 16% of cases 
; these specific mutations are not directly accounted for in our estimate, which is based on a polygenic model. We note that a similar effect may explain the low heritability estimate for early-onset PD in 
. Thus, the actual heritability of PD, and the corresponding true upper bound on discriminative accuracy achievable through genetic factors, may be even higher than the estimates we provide.
Our estimates also indicate a substantial genetic component for late-onset PD (
), for which previous estimates of heritability have been inconclusive due to the lack of statistical power (e.g., 0.068 in 
and 0.453 in 
). One might ask, if late-onset PD is indeed so heritable, why do cases frequently appear sporadically in the general population? Following the analysis of 
, if one were to assume a heritability of
and an average of three children per family, then the proportion of sporadic cases (i.e., no parent, child, sibling, grandparent, aunt or uncle, or first cousin with PD) among all PD cases would be 64% for a prevalence of
; in the 23andMe cohort, 69% of PD cases would be considered sporadic by this definition based on self-reported family history. Similarly, the expected proportion of PD cases with no affected parent or sibling would be 88% under the same assumptions, compared with 84% as reported in 
, or 89% based on the cohort in 
. These examples illustrate the fact that the presence or absence of a familial pattern cannot always be used to determine pathogenesis, especially for diseases that are rare and have a complex etiology.
Overall, our risk prediction results are consistent with a measured AUC of roughly 0.6. The cross-validated AUCs presented here should be distinguished from more usual measurements of AUC in genome-wide association studies, which are typically only estimated on the development set, and which rely on weighted combinations of SNPs with independently estimated odds ratios. In some cases, the bias resulting from lack of proper external validation can be quite large. For example, a simple genetic profile score based on multiplying together odds ratios for the SNPs in appears to achieve an AUC of
in the 23andMe data (or
if no covariate adjustment is performed) making it appear competitive with some of the best models described in . However, when the same model is evaluated in the NINDS data, the AUC drops to
, exhibiting a drop in performance characteristic of models that have been overfit to their training data. In contrast, the consistency between the internal and external validation results in the models shown in demonstrate not only the predictiveness of our models within the 23andMe cohort but also their ability to generalize to other populations.
Our empirical demonstration that including SNPs beyond the genome-wide significant level provides improved discriminative power mirrors the recent results of 
, which also studied the performance of sparse regression methods in a risk prediction setting. In an applied setting where the goal is to achieve the best predictive accuracy rather than to isolate the contribution of individual genetic factors, however, even higher discriminative accuracies may be possible if one were to incorporate these covariates as part of the predictive models. Even without these, however, significant improvements in risk prediction are likely still possible, with our heritability analyses indicating asymptotic target AUCs above 0.8.
Our AUCs are generally conservative for a number of reasons. In the internal experiments, they were obtained by training on only 80% of the data. In the external experiments, the models included only the SNPs in common between the 23andMe and NINDS datasets and thus excluded several SNPs with large effects in LRRK2 and GBA that may add a percent or more to the AUC if included. Furthermore, our analyses adjusted for confounding from population structure and other covariates so as to ensure that the discriminative accuracies we reported were specifically due to genetic effects.
Finally, we note that data for the 23andMe cohort used in this study were acquired in a novel manner, using genotype and survey data acquired through a commercial online personal genetic testing service. The use of self-reported phenotype data raised some unique challenges. For example, our cohort was not a true population sample for a number of reasons, such as the general bias toward higher socioeconomic status, as typical of 23andMe customers. In general, however, we would not expect these ascertainment biases to substantially affect our conclusions unless their effects varied differentially between the case and control sets.
As another example, in compiling the cohort, we used participants with varying levels of completeness in their self-reported data (see Materials and Methods
). Out of the 3,426 cases in the 23andMe cohort, though most cases reported having PD in a questionnaire, 482 affirmatively stated they had PD upon entry to the research study but did not fill out any PD-related questionnaire during the study. However, we did not see a large difference between those answering questions and not. Among the 11 associations presented in , only the association with MAPT
showed a significant difference between the cohort who answered a questionnaire and those who did not (see Table S7
). Also, approximately 84% of the cases filled out a questionnaire, and of them, over 96% reported a PD diagnosis. Even if a larger fraction (say 10–15%) of those who did not take a questionnaire did not have PD, the gain in power from the additional cases would more than offset the loss of power from having some 50 more false positive cases.
Despite the challenges associated with using self-reported data collected through online surveys, ultimately, our results lend credibility to the accuracy of this novel research design. For example, the agreement between our study and previous studies in terms of the ORs estimated for the 19 associations replicated in strongly suggests that our cohort is similar to those used in other PD studies. Similarly, the consistency of AUCs and heritability estimates across our cohort and the NINDS cohort both suggest a limited role of bias in our study.
Importantly, our mode of data collection also provided a number of clear benefits. The use of internet-based techniques enabled rapid recruitment of a large patient community. The 3,426 cases in this study were enrolled in about 18 months, with over half joining in the first month of the study. Also adding significantly to the power and robustness of this study was the availability of a large cohort of controls derived from the 23andMe customer base. By using a non-traditional recruitment approach, we thus were able to attain good power for our study through large sample sizes. To our knowledge, this study represents the largest genome-wide association study of Parkinson's disease conducted on a single cohort to date, with only a recent meta-analysis achieving a larger number of cases 
. We suggest that this methodology for study design may prove advantageous for other conditions where the advantage of having a large cohort is paramount for detecting subtle genetic effects.
In summary, we have for the first time used a rapid, web-based enrollment method to assemble a large population for a genome-wide association study of PD. We have replicated results from numerous previous studies, providing support for the utility of our study design. We have also identified two new associations, both in genes related to pathways that have been previously implicated in the pathogenesis of PD. Using cross-validation, we have provided evidence that many suggestive associations in our data may also play an important role. Using recently developed analytic approaches developed for GWAS that take into account the ascertainment bias inherent in a case-control population, we have estimated the genetic contribution to PD in this sample. These findings confirm the hypothesis that PD is a complex disorder, with both genetic and environmental determinants. Future investigations, expanded to include environmental as well as genetic factors, will likely further refine our understanding of the pathogensis of PD, and, ultimately, lead to new approaches to treatment.