To characterize the role of rare complete human knockouts in autism spectrum disorders (ASD), we identify genes with homozygous or compound heterozygous loss-of-function (LoF) variants (defined as nonsense and essential splice sites) from exome sequencing of 933 cases and 869 controls. We identify a two-fold increase in complete knockouts of autosomal genes with low rates of LoF variation (≤5% frequency) in cases and estimate a 3% contribution to ASD risk by these events, confirming this observation in an independent set of 563 probands and 4,605 controls. Outside the pseudo-autosomal regions on the X-chromosome, we similarly observe a significant 1.5-fold increase in rare hemizygous knockouts in males, contributing to another 2% of ASDs in males. Taken together these results provide compelling evidence that rare autosomal and X-chromosome complete gene knockouts are important inherited risk factors for ASD.
Genome wide association studies revealed that variation in the Melatonin Receptor 1B gene (MTNR1B) is associated with insulin and glucose concentrations. Here we show that the risk genotype of this SNP predicts future type 2 diabetes (T2D) in two large prospective studies. Specifically, the risk genotype was associated with impairment of early insulin response to both oral and intravenous glucose and with faster deterioration of insulin secretion over time. We also show that the Melatonin Receptor 1B mRNA is expressed in human islets, and immunocytochemistry confirms that it is primarily localized in β-cells in islets. Non-diabetic individuals carrying the risk allele and patients with T2D showed increased expression of the receptor in islets. Insulin release from clonal β-cells in response to glucose was inhibited in the presence of melatonin. These data suggest that the circulating hormone melatonin, which is predominantly released from the pineal gland in the brain, is involved in the pathogenesis of T2D. Given the increased expression of Melatonin Receptor 1B in individuals at risk of T2D, the pathogenic effects are likely exerted via a direct inhibitory effect on β-cells. In view of these results, blocking the melatonin ligand-receptor system could be a therapeutic avenue in T2D.
As a first step toward understanding how rare variants contribute to risk for complex diseases, we sequenced 15,585 human protein-coding genes to an average median depth of 111× in 2440 individuals of European (n = 1351) and African (n = 1088) ancestry. We identified over 500,000 single-nucleotide variants (SNVs), the majority of which were rare (86% with a minor allele frequency less than 0.5%), previously unknown (82%), and population-specific (82%). On average, 2.3% of the 13,595 SNVs each person carried were predicted to affect protein function of ∼313 genes per genome, and ∼95.7% of SNVs predicted to be functionally important were rare. This excess of rare functional variants is due to the combined effects of explosive, recent accelerated population growth and weak purifying selection. Furthermore, we show that large sample sizes will be required to associate rare variants with complex traits.
Establishing the age of each mutation segregating in contemporary human populations is important to fully understand our evolutionary history1,2 and will help facilitate the development of new approaches for disease gene discovery3. Large-scale surveys of human genetic variation have reported signatures of recent explosive population growth4-6, notable for an excess of rare genetic variants, qualitatively suggesting that many mutations arose recently. To more quantitatively assess the distribution of mutation ages, we resequenced 15,336 genes in 6,515 individuals of European (n=4,298) and African (n=2,217) American ancestry and inferred the age of 1,146,401 autosomal single nucleotide variants (SNVs). We estimate that ~73% of all protein-coding SNVs and ~86% of SNVs predicted to be deleterious arose in the past 5,000-10,000 years. The average age of deleterious SNVs varied significantly across molecular pathways, and disease genes contained a significantly higher proportion of recently arisen deleterious SNVs compared to other genes. Furthermore, European Americans had an excess of deleterious variants in essential and Mendelian disease genes compared to African Americans, consistent with weaker purifying selection due to the out-of-Africa dispersal. Our results better delimit the historical details of human protein-coding variation, illustrate the profound effect recent human history has had on the burden of deleterious SNVs segregating in contemporary populations, and provides important practical information that can be used to prioritize variants in disease gene discovery.
Systemic lupus erythematosus (SLE) is a multisystem complex autoimmune disease of uncertain etiology (OMIM 152700). Over recent years a genetic component to SLE susceptibility has been established1–3. Recent successes with association studies in SLE have identified genes including IRF5 (refs. 4,5) and FCGR3B6. Two tumor necrosis factor (TNF) superfamily members located within intervals showing genetic linkage with SLE are TNFSF4 (also known as OX40L; 1q25), which is expressed on activated antigen-presenting cells (APCs)7,8 and vascular endothelial cells9, and also its unique receptor, TNFRSF4 (also known as OX40; 1p36), which is primarily expressed on activated CD4+ T cells10. TNFSF4 produces a potent co-stimulatory signal for activated CD4+ T cells after engagement of TNFRSF4 (ref. 11). Using both a family-based and a case-control study design, we show that the upstream region of TNFSF4 contains a single risk haplotype for SLE, which is correlated with increased expression of both cell-surface TNFSF4 and the TNFSF4 transcript. We hypothesize that increased expression of TNFSF4 predisposes to SLE either by quantitatively augmenting T cell–APC interaction or by influencing the functional consequences of T cell activation via TNFRSF4.
Motivation: The question of how to best use information from known associated variants when conducting disease association studies has yet to be answered. Some studies compute a marginal P-value for each Several Nucleotide Polymorphisms independently, ignoring previously discovered variants. Other studies include known variants as covariates in logistic regression, but a weakness of this standard conditioning strategy is that it does not account for disease prevalence and non-random ascertainment, which can induce a correlation structure between candidate variants and known associated variants even if the variants lie on different chromosomes. Here, we propose a new conditioning approach, which is based in part on the classical technique of liability threshold modeling. Roughly, this method estimates model parameters for each known variant while accounting for the published disease prevalence from the epidemiological literature.
Results: We show via simulation and application to empirical datasets that our approach outperforms both the no conditioning strategy and the standard conditioning strategy, with a properly controlled false-positive rate. Furthermore, in multiple data sets involving diseases of low prevalence, standard conditioning produces a severe drop in test statistics whereas our approach generally performs as well or better than no conditioning. Our approach may substantially improve disease gene discovery for diseases with many known risk variants.
Availability: LTSOFT software is available online http://www.hsph.harvard.edu/faculty/alkes-price/software/
Supplementary information: Supplementary data are available at Bioinformatics online.
Coronary artery calcification (CAC) detected by computed tomography is a non-invasive measure of coronary atherosclerosis, that underlies most cases of myocardial infarction (MI). We aimed to identify common genetic variants associated with CAC and further investigate their associations with MI.
Methods and Results
Computed tomography was used to assess quantity of CAC. A meta-analysis of genome-wide association studies for CAC was carried out in 9,961 men and women from five independent community-based cohorts, with replication in three additional independent cohorts (n=6,032). We examined the top single nucleotide polymorphisms (SNPs) associated with CAC quantity for association with MI in multiple large genome-wide association studies of MI. Genome-wide significant associations with CAC for SNPs on chromosome 9p21 near CDKN2A and CDKN2B (top SNP: rs1333049, P=7.58×10−19) and 6p24 (top SNP: rs9349379, within the PHACTR1 gene, P=2.65×10−11) replicated for CAC and for MI. Additionally, there is evidence for concordance of SNP associations with both CAC and with MI at a number of other loci, including 3q22 (MRAS gene), 13q34 (COL4A1/COL4A2 genes), and 1p13 (SORT1 gene).
SNPs in the 9p21 and PHACTR1 gene loci were strongly associated with CAC and MI, and there are suggestive associations with both CAC and MI of SNPs in additional loci. Multiple genetic loci are associated with development of both underlying coronary atherosclerosis and clinical events.
cardiac computed tomography; coronary artery calcification; coronary atherosclerosis; genome-wide association studies; myocardial infarction
Weight-loss interventions generally improve lipid profiles and reduce cardiovascular disease risk, but effects are variable and may depend on genetic factors. We performed a genetic association analysis of data from 2,993 participants in the Diabetes Prevention Program to test the hypotheses that a genetic risk score (GRS) based on deleterious alleles at 32 lipid-associated single-nucleotide polymorphisms modifies the effects of lifestyle and/or metformin interventions on lipid levels and nuclear magnetic resonance (NMR) lipoprotein subfraction size and number. Twenty-three loci previously associated with fasting LDL-C, HDL-C, or triglycerides replicated (P = 0.04–1×10−17). Except for total HDL particles (r = −0.03, P = 0.26), all components of the lipid profile correlated with the GRS (partial |r| = 0.07–0.17, P = 5×10−5–1×10−19). The GRS was associated with higher baseline-adjusted 1-year LDL cholesterol levels (β = +0.87, SEE±0.22 mg/dl/allele, P = 8×10−5, Pinteraction = 0.02) in the lifestyle intervention group, but not in the placebo (β = +0.20, SEE±0.22 mg/dl/allele, P = 0.35) or metformin (β = −0.03, SEE±0.22 mg/dl/allele, P = 0.90; Pinteraction = 0.64) groups. Similarly, a higher GRS predicted a greater number of baseline-adjusted small LDL particles at 1 year in the lifestyle intervention arm (β = +0.30, SEE±0.012 ln nmol/L/allele, P = 0.01, Pinteraction = 0.01) but not in the placebo (β = −0.002, SEE±0.008 ln nmol/L/allele, P = 0.74) or metformin (β = +0.013, SEE±0.008 nmol/L/allele, P = 0.12; Pinteraction = 0.24) groups. Our findings suggest that a high genetic burden confers an adverse lipid profile and predicts attenuated response in LDL-C levels and small LDL particle number to dietary and physical activity interventions aimed at weight loss.
The study included 2,993 participants from the Diabetes Prevention Program, a randomized clinical trial of intensive lifestyle intervention, metformin treatment, and placebo control. We examined associations between 32 gene variants that have been reproducibly associated with dyslipidemia and concentrations of lipids and NMR lipoprotein particle sizes and numbers. We also examined whether genetic background influences a person's response to cardioprotective interventions on lipid levels. Our analysis, which focused on determining whether common genetic variants impact the effects of cardioprotective interventions on lipid and lipoprotein particle size, shows that in persons with a high genetic risk score the benefit of intensive lifestyle intervention on LDL and small LDL particle levels is substantially diminished; this information may be informative for the targeted prevention of dyslipidemia, as it suggests that genetics might help identify persons in whom lifestyle intervention is likely to be an effective treatment for elevated lipids and lipoproteins. The NMR subfraction analyses provide novel insight into the biology of dyslipidemia by illustrating how numerous genetic variants that have previously been associated with lipid levels also modulate NMR lipoprotein particle sizes and number. This information may be informative for the targeted prevention of cardiovascular disease.
The genetic loci that have been found by genome-wide association studies to modulate risk of coronary heart disease explain only a fraction of its total variance, and gene-gene interactions have been proposed as a potential source of the remaining heritability. Given the potentially large testing burden, we sought to enrich our search space with real interactions by analyzing variants that may be more likely to interact on the basis of two distinct hypotheses: a biological hypothesis, under which MI risk is modulated by interactions between variants that are known to be relevant for its risk factors; and a statistical hypothesis, under which interacting variants individually show weak marginal association with MI. In a discovery sample of 2,967 cases of early-onset myocardial infarction (MI) and 3,075 controls from the MIGen study, we performed pair-wise SNP interaction testing using a logistic regression framework. Despite having reasonable power to detect interaction effects of plausible magnitudes, we observed no statistically significant evidence of interaction under these hypotheses, and no clear consistency between the top results in our discovery sample and those in a large validation sample of 1,766 cases of coronary heart disease and 2,938 controls from the Wellcome Trust Case-Control Consortium. Our results do not support the existence of strong interaction effects as a common risk factor for MI. Within the scope of the hypotheses we have explored, this study places a modest upper limit on the magnitude that epistatic risk effects are likely to have at the population level (odds ratio for MI risk 1.3–2.0, depending on allele frequency and interaction model).
High coverage whole genome sequencing provides near complete information about genetic variation. However, other technologies can be more efficient in some settings by (a) reducing redundant coverage within samples and (b) exploiting patterns of genetic variation across samples. To characterize as many samples as possible, many genetic studies therefore employ lower coverage sequencing or SNP array genotyping coupled to statistical imputation. To compare these approaches individually and in conjunction, we developed a statistical framework to estimate genotypes jointly from sequence reads, array intensities, and imputation. In European samples, we find similar sensitivity (89%) and specificity (99.6%) from imputation with either 1× sequencing or 1 M SNP arrays. Sensitivity is increased, particularly for low-frequency polymorphisms (), when low coverage sequence reads are added to dense genome-wide SNP arrays — the converse, however, is not true. At sites where sequence reads and array intensities produce different sample genotypes, joint analysis reduces genotype errors and identifies novel error modes. Our joint framework informs the use of next-generation sequencing in genome wide association studies and supports development of improved methods for genotype calling.
In this work we address a series of questions prompted by the rise of next-generation sequencing as a data collection strategy for genetic studies. How does low coverage sequencing compare to traditional microarray based genotyping? Do studies increase sensitivity by collecting both sequencing and array data? What can we learn about technology error modes based on analysis of SNPs for which sequence and array data disagree? To answer these questions, we developed a statistical framework to estimate genotypes from sequence reads, array intensities, and imputation. Through experiments with intensity and read data from the Hapmap and 1000 Genomes (1000 G) Projects, we show that 1 M SNP arrays used for genome wide association studies perform similarly to 1× sequencing. We find that adding low coverage sequence reads to dense array data significantly increases rare variant sensitivity, but adding dense array data to low coverage sequencing has only a small impact. Finally, we describe an improved SNP calling algorithm used in the 1000 G project, inspired by a novel next-generation sequencing error mode identified through analysis of disputed SNPs. These results inform the use of next-generation sequencing in genetic studies and model an approach to further improve genotype calling methods.
More than a thousand disease susceptibility loci have been identified via genome-wide association studies (GWAS) of common variants; however, the specific genes and full allelic spectrum of causal variants underlying these findings generally remain to be defined. We utilize pooled next-generation sequencing to study 56 genes in regions associated to Crohn’s Disease in 350 cases and 350 controls. Follow up genotyping of 70 rare and low-frequency protein-altering variants (MAF ~ .001-.05) in nine independent case-control series (16054 CD patients, 12153 UC patients, 17575 healthy controls) identifies four additional independent risk factors in NOD2, two additional protective variants in IL23R, a highly significant association to a novel, protective splice variant in CARD9 (p < 1e-16, OR ~ 0.29), as well as additional associations to coding variants in IL18RAP, CUL2, C1orf106, PTPN22 and MUC19. We extend the results of successful GWAS by providing novel, rare, and likely functional variants that will empower functional experiments and predictive models.
The let-7 tumor suppressor microRNAs are known for their regulation of oncogenes, while the RNA-binding proteins Lin28a/b promote malignancy by blocking let-7 biogenesis. In studies of the Lin28/let-7 pathway, we discovered unexpected roles in regulating metabolism. When overexpressed in mice, both Lin28a and LIN28B promoted an insulin-sensitized state that resisted high fat diet-induced diabetes, whereas muscle-specific loss of Lin28a and overexpression of let-7 resulted in insulin resistance and impaired glucose tolerance. These phenomena occurred in part through let-7-mediated repression of multiple components of the insulin-PI3K-mTOR pathway, including IGF1R, INSR, and IRS2. The mTOR inhibitor rapamycin abrogated the enhanced glucose uptake and insulin-sensitivity conferred by Lin28a in vitro and in vivo. In addition, we found that let-7 targets were enriched for genes that contain SNPs associated with type 2 diabetes and fasting glucose in human genome-wide association studies. These data establish the Lin28/let-7 pathway as a central regulator of mammalian glucose metabolism.
Over 30 loci have been associated with risk of type 2 diabetes at genome-wide statistical significance. Genetic risk scores (GRSs) developed from these loci predict diabetes in the general population. We tested if a GRS based on an updated list of 34 type 2 diabetes–associated loci predicted progression to diabetes or regression toward normal glucose regulation (NGR) in the Diabetes Prevention Program (DPP).
RESEARCH DESIGN AND METHODS
We genotyped 34 type 2 diabetes–associated variants in 2,843 DPP participants at high risk of type 2 diabetes from five ethnic groups representative of the U.S. population, who had been randomized to placebo, metformin, or lifestyle intervention. We built a GRS by weighting each risk allele by its reported effect size on type 2 diabetes risk and summing these values. We tested its ability to predict diabetes incidence or regression to NGR in models adjusted for age, sex, ethnicity, waist circumference, and treatment assignment.
In multivariate-adjusted models, the GRS was significantly associated with increased risk of progression to diabetes (hazard ratio [HR] = 1.02 per risk allele [95% CI 1.00–1.05]; P = 0.03) and a lower probability of regression to NGR (HR = 0.95 per risk allele [95% CI 0.93–0.98]; P < 0.0001). At baseline, a higher GRS was associated with a lower insulinogenic index (P < 0.001), confirming an impairment in β-cell function. We detected no significant interaction between GRS and treatment, but the lifestyle intervention was effective in the highest quartile of GRS (P < 0.0001).
A high GRS is associated with increased risk of developing diabetes and lower probability of returning to NGR in high-risk individuals, but a lifestyle intervention attenuates this risk.
The risk of type 2 diabetes is approximately 2-fold higher in African Americans than in European Americans even after adjusting for known environmental risk factors, including socioeconomic status (SES), suggesting that genetic factors may explain some of this population difference in disease risk. However, relatively few genetic studies have examined this hypothesis in a large sample of African Americans with and without diabetes. Therefore, we performed an admixture analysis using 2,189 ancestry-informative markers in 7,021 African Americans (2,373 with type 2 diabetes and 4,648 without) from the Atherosclerosis Risk in Communities Study, the Jackson Heart Study, and the Multiethnic Cohort to 1) determine the association of type 2 diabetes and its related quantitative traits with African ancestry controlling for measures of SES and 2) identify genetic loci for type 2 diabetes through a genome-wide admixture mapping scan. The median percentage of African ancestry of diabetic participants was slightly greater than that of non-diabetic participants (study-adjusted difference = 1.6%, P<0.001). The odds ratio for diabetes comparing participants in the highest vs. lowest tertile of African ancestry was 1.33 (95% confidence interval 1.13–1.55), after adjustment for age, sex, study, body mass index (BMI), and SES. Admixture scans identified two potential loci for diabetes at 12p13.31 (LOD = 4.0) and 13q14.3 (Z score = 4.5, P = 6.6×10−6). In conclusion, genetic ancestry has a significant association with type 2 diabetes above and beyond its association with non-genetic risk factors for type 2 diabetes in African Americans, but no single gene with a major effect is sufficient to explain a large portion of the observed population difference in risk of diabetes. There undoubtedly is a complex interplay among specific genetic loci and non-genetic factors, which may both be associated with overall admixture, leading to the observed ethnic differences in diabetes risk.
The potential benefits of using population isolates in genetic mapping, such as reduced genetic, phenotypic and environmental heterogeneity, are offset by the challenges posed by the large amounts of direct and cryptic relatedness in these populations confounding basic assumptions of independence. We have evaluated four representative specialized methods for association testing in the presence of relatedness; (i) within-family (ii) within- and between-family and (iii) mixed-models methods, using simulated traits for 2906 subjects with known genome-wide genotype data from an extremely isolated population, the Island of Kosrae, Federated States of Micronesia. We report that mixed models optimally extract association information from such samples, demonstrating 88% power to rank the true variant as among the top 10 genome-wide with 56% achieving genome-wide significance, a >80% improvement over the other methods, and demonstrate that population isolates have similar power to non-isolate populations for observing variants of known effects. We then used the mixed-model method to reanalyze data for 17 published phenotypes relating to metabolic traits and electrocardiographic measures, along with another 8 previously unreported. We replicate nine genome-wide significant associations with known loci of plasma cholesterol, high-density lipoprotein, low-density lipoprotein, triglycerides, thyroid stimulating hormone, homocysteine, C-reactive protein and uric acid, with only one detected in the previous analysis of the same traits. Further, we leveraged shared identity-by-descent genetic segments in the region of the uric acid locus to fine-map the signal, refining the known locus by a factor of 4. Finally, we report a novel associations for height (rs17629022, P< 2.1 × 10−8).
We performed a meta-analysis of 14 genome-wide association studies of coronary artery disease (CAD) comprising 22,233 cases and 64,762 controls of European descent, followed by genotyping of top association signals in 60,738 additional individuals. This genomic analysis identified 13 novel loci harboring one or more SNPs that were associated with CAD at P<5×10−8 and confirmed the association of 10 of 12 previously reported CAD loci. The 13 novel loci displayed risk allele frequencies ranging from 0.13 to 0.91 and were associated with a 6 to 17 percent increase in the risk of CAD per allele. Notably, only three of the novel loci displayed significant association with traditional CAD risk factors, while the majority lie in gene regions not previously implicated in the pathogenesis of CAD. Finally, five of the novel CAD risk loci appear to have pleiotropic effects, showing strong association with various other human diseases or traits.
Genome-wide association studies have begun to elucidate the genetic architecture of type 2 diabetes. We examined whether single nucleotide polymorphisms (SNPs) identified through targeted complementary approaches affect diabetes incidence in the at-risk population of the Diabetes Prevention Program (DPP) and whether they influence a response to preventive interventions.
RESEARCH DESIGN AND METHODS
We selected SNPs identified by prior genome-wide association studies for type 2 diabetes and related traits, or capturing common variation in 40 candidate genes previously associated with type 2 diabetes, implicated in monogenic diabetes, encoding type 2 diabetes drug targets or drug-metabolizing/transporting enzymes, or involved in relevant physiological processes. We analyzed 1,590 SNPs for association with incident diabetes and their interaction with response to metformin or lifestyle interventions in 2,994 DPP participants. We controlled for multiple hypothesis testing by assessing false discovery rates.
We replicated the association of variants in the metformin transporter gene SLC47A1 with metformin response and detected nominal interactions in the AMP kinase (AMPK) gene STK11, the AMPK subunit genes PRKAA1 and PRKAA2, and a missense SNP in SLC22A1, which encodes another metformin transporter. The most significant association with diabetes incidence occurred in the AMPK subunit gene PRKAG2 (hazard ratio 1.24, 95% CI 1.09–1.40, P = 7 × 10−4). Overall, there were nominal associations with diabetes incidence at 85 SNPs and nominal interactions with the metformin and lifestyle interventions at 91 and 69 mostly nonoverlapping SNPs, respectively. The lowest P values were consistent with experiment-wide 33% false discovery rates.
We have identified potential genetic determinants of metformin response. These results merit confirmation in independent samples.
Rare coding variants constitute an important class of human genetic variation, but are underrepresented in current databases that are based on small population samples. Recent studies show that variants altering amino acid sequence and protein function are enriched at low variant allele frequency, 2 to 5%, but because of insufficient sample size it is not clear if the same trend holds for rare variants below 1% allele frequency.
The 1000 Genomes Exon Pilot Project has collected deep-coverage exon-capture data in roughly 1,000 human genes, for nearly 700 samples. Although medical whole-exome projects are currently afoot, this is still the deepest reported sampling of a large number of human genes with next-generation technologies. According to the goals of the 1000 Genomes Project, we created effective informatics pipelines to process and analyze the data, and discovered 12,758 exonic SNPs, 70% of them novel, and 74% below 1% allele frequency in the seven population samples we examined. Our analysis confirms that coding variants below 1% allele frequency show increased population-specificity and are enriched for functional variants.
This study represents a large step toward detecting and interpreting low frequency coding variation, clearly lays out technical steps for effective analysis of DNA capture data, and articulates functional and population properties of this important class of genetic variation.
The number and volume of cells in the blood affect a wide range of disorders including cancer and cardiovascular, metabolic, infectious and immune conditions. We consider here the genetic variation in eight clinically relevant hematological parameters, including hemoglobin levels, red and white blood cell counts and platelet counts and volume. We describe common variants within 22 genetic loci reproducibly associated with these hematological parameters in 13,943 samples from six European population-based studies, including 6 associated with red blood cell parameters, 15 associated with platelet parameters and 1 associated with total white blood cell count. We further identified a long-range haplotype at 12q24 associated with coronary artery disease in 9,479 cases and 10,527 controls. We show that this haplotype demonstrates extensive disease pleiotropy, as it contains known risk loci for type 1 diabetes, hypertension and celiac disease and has been spread by a selective sweep specific to European and geographically nearby populations.
We sequenced all protein-coding regions of the genome (the “exome”) in two family members with combined hypolipidemia, marked by extremely low plasma levels of low-density lipoprotein (LDL) cholesterol, high-density lipoprotein (HDL) cholesterol, and triglycerides. These two participants were compound heterozygotes for two distinct nonsense mutations in ANGPTL3 (encoding the angiopoietin-like 3 protein). ANGPTL3 has been reported to inhibit lipoprotein lipase and endothelial lipase, thereby increasing plasma triglyceride and HDL cholesterol levels in rodents. Our finding of ANGPTL3 mutations highlights a role for the gene in LDL cholesterol metabolism in humans and shows the usefulness of exome sequencing for identification of novel genetic causes of inherited disorders. (Funded by the National Human Genome Research Institute and others.)
The CHEK2-1100delC mutation is recurrent in the population and is a moderate risk factor for breast cancer. To identify additional CHEK2 mutations potentially contributing to breast cancer susceptibility, we sequenced 248 cases with early-onset disease; functionally characterized new variants and conducted a population-based case–control analysis to evaluate their contribution to breast cancer risk. We identified 1 additional null mutation and 5 missense variants in the germline of cancer patients. In vitro, the CHEK2-H143Y variant resulted in gross protein destabilization, while others had variable suppression of in vitro kinase activity using BRCA1 as a substrate. The germline CHEK2-1100delC mutation was present among 8/1,646 (0.5%) sporadic, 2/400 (0.5%) early-onset and 3/302 (1%) familial breast cancer cases, but undetectable amongst 2,105 multiethnic controls, including 633 from the US. CHEK2-positive breast cancer families also carried a deleterious BRCA1 mutation. 1100delC appears to be the only recurrent CHEK2 mutation associated with a potentially significant contribution to breast cancer risk in the general population. Another recurrent mutation with attenuated in vitro function, CHEK2-P85L, is not associated with increased breast cancer susceptibility, but exhibits a striking difference in frequency across populations with different ancestral histories. These observations illustrate the importance of genotyping ethnically diverse groups when assessing the impact of low-penetrance susceptibility alleles on population risk. Our findings highlight the notion that clinical testing for rare missense mutations within CHEK2 may have limited value in predicting breast cancer risk, but that testing for the 1100delC variant may be valuable in phenotypically- and geographically-selected populations.
CHEK2; susceptibility; breast; cancer; mutation
Discovering the molecular basis of mitochondrial respiratory chain disease is challenging given the large number of both mitochondrial and nuclear genes involved. We report a strategy of focused candidate gene prediction, high-throughput sequencing, and experimental validation to uncover the molecular basis of mitochondrial complex I (CI) disorders. We created five pools of DNA from a cohort of 103 patients and then performed deep sequencing of 103 candidate genes to spotlight 151 rare variants predicted to impact protein function. We used confirmatory experiments to establish genetic diagnoses in 22% of previously unsolved cases, and discovered that defects in NUBPL and FOXRED1 can cause CI deficiency. Our study illustrates how large-scale sequencing, coupled with functional prediction and experimental validation, can reveal novel disease-causing mutations in individual patients.
Technological advances make it possible to use high-throughput sequencing as a primary discovery tool of medical genetics, specifically for assaying rare variation. Still this approach faces the analytic challenge that the influence of very rare variants can only be evaluated effectively as a group. A further complication is that any given rare variant could have no effect, could increase risk, or could be protective. We propose here the C-alpha test statistic as a novel approach for testing for the presence of this mixture of effects across a set of rare variants. Unlike existing burden tests, C-alpha, by testing the variance rather than the mean, maintains consistent power when the target set contains both risk and protective variants. Through simulations and analysis of case/control data, we demonstrate good power relative to existing methods that assess the burden of rare variants in individuals.
Developments in sequencing technology now enable us to assay all genetic variation, much of which is extremely rare. We propose to test the distribution of rare variants we observe in cases versus controls. To do so, we present a novel application of the C-alpha statistic to test these rare variants. C-alpha aims to determine whether the set of variants observed in cases and controls is a mixture, such that some of the variants confer risk or protection or are phenotypically neutral. Risk variants are expected to be more common in cases; protective variants more common in controls. C-alpha is sensitive to this imbalance, regardless of its origin—risk, protective, or both—but is ideally suited for a mixture of protective and risk variants. Variation in APOB nicely illustrates a mixture, in that certain rare variants increase triglyceride levels while others decrease it. The hallmark feature of C-alpha is that it uses the distribution of variation observed in cases and controls to detect the presence of a mixture, thus implicating genes or pathways as risk factors for disease.
Investigators have linked rare copy number variation (CNVs) to neuropsychiatric diseases, such as schizophrenia. One hypothesis is that CNV events cause disease by affecting genes with specific brain functions. Under these circumstances, we expect that CNV events in cases should impact brain-function genes more frequently than those events in controls. Previous publications have applied “pathway” analyses to genes within neuropsychiatric case CNVs to show enrichment for brain-functions. While such analyses have been suggestive, they often have not rigorously compared the rates of CNVs impacting genes with brain function in cases to controls, and therefore do not address important confounders such as the large size of brain genes and overall differences in rates and sizes of CNVs. To demonstrate the potential impact of confounders, we genotyped rare CNV events in 2,415 unaffected controls with Affymetrix 6.0; we then applied standard pathway analyses using four sets of brain-function genes and observed an apparently highly significant enrichment for each set. The enrichment is simply driven by the large size of brain-function genes. Instead, we propose a case-control statistical test, cnv-enrichment-test, to compare the rate of CNVs impacting specific gene sets in cases versus controls. With simulations, we demonstrate that cnv-enrichment-test is robust to case-control differences in CNV size, CNV rate, and systematic differences in gene size. Finally, we apply cnv-enrichment-test to rare CNV events published by the International Schizophrenia Consortium (ISC). This approach reveals nominal evidence of case-association in neuronal-activity and the learning gene sets, but not the other two examined gene sets. The neuronal-activity genes have been associated in a separate set of schizophrenia cases and controls; however, testing in independent samples is necessary to definitively confirm this association. Our method is implemented in the PLINK software package.
Specific rare deletion and duplication events in the genome have now been shown to be associated with neuropsychiatric diseases such as 16p11.2 to autism and 22q11.21 to schizophrenia. However, controversy remains as to whether rare events impacting certain pathways as a group increase the risk of disease, and if so, what those pathways are. Other studies have used standard gene-set enrichment approaches to demonstrate that events discovered in cases contain more genes in neuro-developmental pathways than would be expected by chance. However, these analyses do not explicitly compare the relative enrichment in cases to any enrichment that may also be present in controls. Therefore, they can be confounded by the large size of brain genes or by larger size or frequency of CNVs in cases. Here we propose a case-control statistical test to assess whether a key pathway is differentially impacted by CNVs in cases compared to controls. Our approach is robust to skewed gene sizes and case-control differences in CNV rate and size.