Our exome capture, sequencing, and analyses are among the first to examine for the presence and affect of rare variation in ischemic stroke. Given that the optimal methodologies regarding exome analyses are not fully-established in the study of complex disease, including stroke, we developed a variety of analyses to examine our exome data. First, pursuant to our hypothesis that excess rare variation may play an important role in gene-specific stroke risk, we evaluated specific genes chosen based upon our predominant stroke subtype, lacunar stroke, and also by selecting genes based upon prior candidate gene studies. The exomic analyses informed upon by our lacunar stroke GWAS results identified two genes, CSN3 and HLA-DPB1, which appeared to have excess variation in our stroke cases. Further analyses of CSN3 are ongoing; however given the expected high variability of HLA genes, no further analyses of HLA-DPB1 has occurred. None of the previously identified candidate genes studied demonstrated ‘excess’ rare variation in our exome samples. A compound heterozygote analyses pursuant to the hypothesis that stroke susceptibility genes may be enriched for novel variants across cases was also performed, identifying several genes with rare variation across all 10 cases. In particular, CTPB2 appeared to contain a nonsense codon that appeared in all 10 cases, however after further analyses this was identified as a false positive. As a pilot study we identified several important methodological considerations and pitfalls that should be considered when implementing such methods. First, we emphasize the importance of selecting a highly defined phenotype among the study subjects, such as selecting on one stroke subtype, few established vascular risk factors, and positive a family history of stroke. Further selecting the study subjects in the extremes of the phenotypic tail (i.e. young-onset stroke) may also further maximize the genetic contribution to stroke risk. Second, as our results demonstrate, investigators must evaluate for the presence of potential false positive findings, a major pitfall regarding this type of research. Utilizing existing databases for non-diseased control samples may be fraught with difficulties as a control populations ascertained to study a different disease may contain an individual with the disease under study (in this case stroke). When studying rare variation, introducing such an error would be particularly troublesome.
While our gene specific analyses failed to identify excess rare variation in any of the ischemic stroke candidate genes evaluated, our analyses evaluating our exome data in consideration of our prior GWAS results identified one gene, kappa-casein (
CSN3), which warrants additional discussion. As described, by screening our prior GWAS results for genes containing coding SNPs in lacunar stroke, and then evaluating the implicated genes in our exome data, we identified
CSN3 (OMIM: 601695;
http://omim.org/entry/601695 - accessed 2011 July) as containing an interesting coding polymorphism as well as containing excess rare variation.
CSN3 maps to chromosome 4 at 4q21.1 and contains 5 exons covering 8.85 kb on the direct strand from position 71108305 to 71117153 (NCBI 37, August 2010). The predicted protein has 182 aa (20.3 kDa, pI 8.1). While the function and tissue expression pattern of this gene is are not fully established, Kappa-casein is known to stabilize micelle formation preventing casein precipitation in human breast milk (
http://www.genecards.org/ - accessed 2011 July). Expression studies have demonstrated the presence of kappa-casein in a variety of soft tissue/muscle tissue cancers
[29] and also in coronary artery atherosclerosis
[30]. In the coronary atherosclerosis study, the investigators performed a comprehensive gene level expression assessment of coronary atherosclerosis using 51 coronary artery segments isolated from the explanted hearts of 22 cardiac transplant patients, demonstrating that
CSN3 was consistently highly expressed in all arterial segments analyzed.
Other studies have indicated that kappa-casein may play a role in immune and inflammatory responses via regulation of the transcription factor nuclear factor kappaB (NF-kappaB)
[31]. Activation of NF-kappaB requires the activity of IKK, a kinase complex that contains two catalytic subunits, IKKalpha and IKKbeta, and a regulatory subunit IKKgamma. In one such study, the investigators worked to understand how IKK activity was regulated, and searched for IKKgamma-interacting proteins utilizing a ‘yeast two-hybrid system’. Screening identified CSN3, a component of the COP9 signalsome, as a protein specifically interacting with IKKgamma. Over-expression of
CSN3 inhibited NF-kappaB activation triggered by tumor necrosis factor (TNF), but not interleukin-1 (IL-1). Moreover, over-expression of
CSN3 also inhibited NF-kappaB activation triggered by proteins involved in TNF signaling, including TNF-R1, TRAF2, RIP, and NIK, but not by TRAF6, a protein involved in IL-1 signaling. These data suggest that CSN3 is a specific negative regulator of TNF- but not IL-1-induced NF-kappaB activation pathways.
CSN3 was also identified as a novel candidate genes for type 2-diabetes mellitus (T2-DM) in a genome-wide association scan in the Old Order Amish
[32]. In this study of 124 type 2 diabetic case subjects as compared with 295 control subjects,
CSN3 SNP rs3775745 was found to be associated with T2-DM in the Amish (p

=

0.002), with this result replicating in a Mexican-American population (p

=

0.003). Of note, none of our exome study subjects had diabetes.
Exome based approaches offer different, yet complimentary, information to ongoing stroke genome wide association study (GWAS) efforts, including ongoing projects implementing substantially larger samples sizes as compared to our GWAS sample. While we are optimistic about the success of GWAS to identify stroke associated variants, we do not believe that GWAS results will account for all stroke risk. This point of view was recently supported by the results of a Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium GWAS study evaluating ischemic stroke risk in which only 2 intergeneic SNPs were found to achieve genome-wide significance
[27]. In the CHARGE study, SNP rs12425791 on chromosome 12p13 was associated with an increased risk of ischemic stroke with a hazard ratio of 1.33 (95% CI, 1.21 to 1.47), yielding a population attributable risk of 12%. A corresponding hazard ratio of 1.42 (95% CI, 1.06 to 1.91; p

=

0.02) was attained in a large follow-up cohort of black subjects. While these results were extremely important, they did not demonstrate that a few common variants account for the majority of stroke risk. Furthermore, our research group and several other groups participating in the ISGC failed to replicate the findings of the CHARGE study, thereby calling into question the validity of the CHARGE results
[33].
While genome-wide association studies (GWAS) can identify common variants, they are not suited for situations where genetic architecture is such that multiple rare disease-causing variants contribute significantly to disease risk. This is because GWAS chips most often implement common variants as identified through the HapMap project and these variants do not serve as markers for rare variation
[34]. As such, we believe that sequencing will ultimately identify rare risk variants and that exome sequencing, given its ability to identify rare variants of high penetrance, is an excellent methodology to begin these efforts. Large scale exome projects are now well underway, most notably including the National Heart, Lung, and Blood Institute (NHLBI) GO Exome Sequencing Project (ESP). The ESP is funded by NHLBI and managed by both NHLBI and National Human Genome Research Institute (NHGRI). The goal of the ESP is to develop and validate a cost-effective, high-throughput sequencing application for all protein coding regions of the human genome in the study of several diseases, including ischemic stroke. The purpose of developing this resequencing application is to enable the sequencing of tens of thousands of individual samples from NHLBI's well-phenotyped populations in a cost-effective manner. These data will be publically-available and free to use by the scientific community. Among these there will be a deeply phenotyped reference sample not selected on basis of disease, consisting of 750 Caucasians of European origin and 250 African-Americans. The first samples have recently been submitted to dbGaP with many more over the next year.
Our pilot study has several limitations. Most notably the small sample size which was directly attributable to the costs associated with exome sequencing. Further, given that stroke is a prototypical complex disease with numerous risk factors and subtypes, the genetic mechanisms of each stroke (by subtype) included in our study is likely somewhat different. However, as described in the Methods section, we worked to limit the presence of risk factors in our young-stroke onset samples. Our study was also hampered by a lack of standardized analysis programs for exome analysis. At present the analytic methodologies regarding rare-variant analyses are an area of active methodological research
[35]. While there were certainly many different potential analyses that could be performed on our dataset, we focused our efforts on what we felt might yield the ‘low lying fruit’; that is, non-synonymous coding variants leading to missense or nonsense mutations. This was motivated by the fact that over half of all known disease mutations come from such replacement polymorphisms; such an assumption may or may not be appropriate in the setting of ischemic stroke. Here we also highlight that our definition of “non-common variants" refers to variants that are not shared between the exome data and the combined 3 control populations. For example, a shared variant between the African-American control data (YRI or ASW) and the 8 African-American exome cases, that was not seen in the Caucasian control data, would be considered “non-common". The reasoning for formulating our definition in this fashion was that we were seeking a shared gene-specific stroke mechanism across ethnicities rather than an ethnicity specific mechanism. In other words, we were seeking global excess variation among the cases in the genes evaluated as compared to control population as a whole. While it would certainly be interesting to perform an ethnicity stratified analyses, we felt that our current sample size precludes such analyses. Another important analysis issue we considered is the potentially confounding effect of population substructure which could induce false-positive (or false negative) findings. Implementing PLINK
[36],
[37] on our GWAS data allowed us to identify 47 population outliers who were removed from the pool considered for our exome analyses. Lastly, by including African-Americans in our exome analyses we inherently complicated our study. While there is little data available on the allelic spectrum of the exome in African-Americans, HapMap data has shown up to a 2 fold greater number of SNPs in the Yoruba (Ibadan, Nigeria) (YRI) same set samples compared to the Caucasians (CEU). However, given the fact that African-American adults are twice as likely to have a stroke as their Caucasian adult counterparts
[38], with this especially true in the young
[39],
[40], we felt that the additional scientific opportunity offset the additional exome complexity inherent when studying African-Americans.
Our study also has several advantages. First, our study population is an ideal sample to evaluate since young age-of-onset is an extreme phenotype that is likely enriched with rare variants; further supported by the fact that numerous familial aggregation studies implicate a greater genetic risk at younger ages. Another strength is that our exome samples were also genotyped on the Illumina 1 M Quad [Omni Quad] as part of our groups' participation in the GENEVA consortium
[41], thereby allowing us to perform the combined exome and GWAS analyses as described. Of note, our GWAS data was rigorously evaluated by the GENEVA Data Cleaning Group (University of Washington, Seattle, WA) and the genotyping center, the Center for Inherited Disease Research (CIDR) (
http://www.cidr.jhmi.edu/ - accessed 2011 July). Quality metrics include missing genotype call rates by sample and SNP, Hardy-Weinberg analysis, and detailed global and local population substructure analysis using principle components analysis
[18].
In closing, our study evaluated a highly prevalent complex disease for which the genetic architecture remains uncertain. Prior and ongoing studies implementing genome-wide association techniques are only capable of identifying common variants associated with disease. The extent to which ischemic stroke is determined by rare or low frequency variants has not yet been explored. Our study demonstrates the feasibility of utilizing exome based techniques to address this important research gap by implementing several different analyses at the variant and gene levels. Our study also highlights several of the important considerations in this type of research, such as attaining a highly specific phenotype and utilizing an extreme phenotype that is likely enriched with rare variants; in our case young-onset subtype-specific stroke. One gene, CSN3, identified by screening our prior GWAS results for genes containing coding SNPs in lacunar stroke, was found to both contain an interesting coding polymorphism as well as excess rare variation in our exome data as compared with the other genes evaluated. Additional research will be required to determine if CSN3 variants are truly associated with ischemic stroke risk. Lastly, while rare coding variants may predispose to the risk of ischemic stroke, this fact has yet to be definitively proven. Our study demonstrates the complexities of such research and highlights that while exome data can be obtained, the optimal analytical methods have yet to be determined.