Pinpointing the small number of causal variants among the abundant naturally occurring genetic variation is a difficult challenge, but a crucial one for understanding precise molecular mechanisms of disease and follow-up functional studies. We propose and investigate two complementary statistical approaches for identification of rare causal variants in sequencing studies: a backward elimination procedure based on groupwise association tests, and a hierarchical approach that can integrate sequencing data with diverse functional and evolutionary conservation annotations for individual variants. Using simulations, we show that incorporation of multiple bioinformatic predictors of deleteriousness, such as PolyPhen-2, SIFT and GERP++ scores, can improve the power to discover truly causal variants. As proof of principle, we apply the proposed methods to VPS13B, a gene mutated in the rare neurodevelopmental disorder called Cohen syndrome, and recently reported with recessive variants in autism. We identify a small set of promising candidates for causal variants, including two loss-of-function variants and a rare, homozygous probably-damaging variant that could contribute to autism risk.
Sequencing technologies allow identification of genetic variants down to single base resolution for a whole human genome. The vast majority of these variants (over 90%) are rare, with population frequencies less than 1%. Furthermore, in a specific study, many of the variants identified are not associated with the disease of interest, and identification of the small proportion of truly causal variants is a difficult task. Clearly, for causal variants that are rare enough to only appear a few times in a study, observed frequencies in cases and controls are not enough to distinguish them from the vast majority of random variation, and rich functional annotations can help identify the causal variants. Here we propose to develop a set of statistical methods that leverage diverse functional genomics annotations with sequencing data to identify a small set of potentially causal variants and estimate their effects. Pinpointing a subset of potentially causal variants is crucial for understanding precise biological mechanisms, and for further experimental functional studies.
Recent advances in high-throughput sequencing technologies make it increasingly more efficient to sequence large cohorts for many complex traits. We discuss here a class of sequence-based association tests for family-based designs that corresponds naturally to previously proposed population-based tests, including the classical Burden and variance-component tests. This framework allows for a direct comparison between the powers of sequence-based association tests with family- vs population-based designs. We show that for dichotomous traits using family-based controls results in similar power levels as the population-based design (although at an increased sequencing cost for the family-based design), while for continuous traits (in random samples, no ascertainment) the population-based design can be substantially more powerful. A possible disadvantage of population-based designs is that they can lead to increased false-positive rates in the presence of population stratification, while the family-based designs are robust to population stratification. We show also an application to a small exome-sequencing family-based study on autism spectrum disorders. The tests are implemented in publicly available software.
family- and population-based association tests; sequence data; burden and variance-component tests
We previously reported genome-wide significant evidence for linkage between chromosome 6q and bipolar I disorder (BPI) by performing a meta-analysis of original genotype data from 11 genome scan linkage studies. We now present follow-up linkage disequilibrium mapping of the linked region utilizing 3,047 single nucleotide polymorphism (SNP) markers in a case–control sample (N = 530 cases, 534 controls) and family-based sample (N = 256 nuclear families, 1,301 individuals). The strongest single SNP result (rs6938431, P=6.72× 10−5) was observed in the case–control sample, near the solute carrier family 22, member 16 gene (SLC22A16). In a replication study, we genotyped 151 SNPs in an independent sample (N = 622 cases, 1,181 controls) and observed further evidence of association between variants at SLC22A16 and BPI. Although consistent evidence of association with any single variant was not seen across samples, SNP-wise and gene-based test results in the three samples provided convergent evidence for association with SLC22A16, a carnitine transporter, implicating this gene as a novel candidate for BPI risk. Further studies in larger samples are warranted to clarify which, if any, genes in the 6q region confer risk for bipolar disorder.
bipolar disorder; genetic; association; SLC22A16; 6q
We are now well into the sequencing era of genetic analysis, and methods to investigate rare variants associated with disease remain in high demand. Currently, the more common rare variant analysis methods are burden tests and variance component tests. This report introduces a burden test known as the modified replication based sum statistic and evaluates its performance, and the performance of other common burden and variance component tests under the setting of a small sample size (103 total cases and controls) using the Genetic Analysis Workshop 18 simulated data with complete knowledge of the simulation model. Specifically we look at the variable threshold sum statistic, replication-based sum statistics, the C-alpha, and sequence kernel association test. Using minor allele frequency thresholds of less than 0.05, we find that the modified replication based sum statistic is competitive with all methods and that using 103 individuals leads to all methods being vastly underpowered. Much larger sample sizes are needed to confidently find truly associated genes.
We apply a family-based extension of the sequence kernel association test (SKAT) to 93 trios extracted from the 20 pedigrees in the Genetic Analysis Workshop 18 simulated data. Each extracted trio includes a unique set of parents to ensure conditionally independent trios are sampled. We compare the empirical type I error and power between the family-based SKAT and the burden test under varying percentages of causal single-nucleotide polymorphisms included in the analysis. Our investigation using simulated data suggests that, under the setting used for Genetic Analysis Workshop 18 data, both the family-based SKAT and the burden test have limited power, and that there is no substantial impact of percentage of signal on the power of either test. The low power is partially a result of the small sample size. However, we find that both the family-based SKAT and the burden test are more powerful when we use only rare variants, rather than common variants, to test the association.
Genetic studies have identified numerous genes reproducibly associated with asthma, yet these studies have focused almost entirely on single nucleotide polymorphisms (SNPs), and virtually ignored another highly prevalent form of genetic variation: Copy Number Variants (CNVs).
To survey the prevalence of CNVs in genes previously associated with asthma, and to assess whether CNVs represent the functional asthma-susceptibility variants at these loci.
We genotyped 383 asthmatic trios participating in the Childhood Asthma Management Program (CAMP) using a competitive genomic hybridization (CGH) array designed to interrogate 20,092 CNVs. To ensure comprehensive assessment of all potential asthma candidate genes, we purposely used liberal asthma gene inclusion criteria, resulting in consideration of 270 candidate genes previously implicated in asthma. We performed statistical testing using FBAT-CNV.
Copy number variation in asthma candidate genes was prevalent, with 21% of tested genes residing near or within one of 69 CNVs. In 6 instances, the complete candidate gene sequence resides within the CNV boundaries. On average, asthmatic probands carried 6 asthma-candidate CNVs (range 1–29). However, the vast majority of identified CNVs were of rare frequency (< 5%), and were not statistically associated with asthma. Modest evidence for association with asthma was observed for 2 CNVs near NOS1 and SERPINA3. Linkage disequilibrium analysis suggests that CNV effects are unlikely to explain previously detected SNP associations with asthma.
Although a substantial proportion of asthma-susceptibility genes harbor polymorphic CNVs, the majority of these variants do not confer increased asthma risk. The lack of linkage disequilibrium (LD) between CNVs and asthma-associated SNPs suggests that these CNVs are unlikely to represent the functional variant responsible for most known asthma associations.
Essential tremor (ET) is a progressive disorder, worsening gradually with time in most patients. Yet there are few data on the factors that influence rate of progression. ET is a highly familial disorder, and physicians often care for patients who have other affected family members. Do ET families differ from one another with respect to rate of progression? Are some families slower progressors and other families faster progressors? We are unaware of published data.
ET probands and relatives were enrolled in a cross-sectional genetic study at Columbia University. Rate of progression was calculated as total tremor score ÷ log disease duration.
There were 100 enrollees (28 probands, 72 relatives). Data from 78 enrollees (23 probands, 55 relatives) were selected for final analysis. The mean familial rate of progression ranged from as little as 8.4 to as much as 34.3, a >4-fold difference. In an analysis of variance, we found significant evidence of heterogeneity in the log rate of progression across families (p <0.001), with more than one-half (i.e., 55.4%) of the total variance in the log rate of progression explained by the family grouping.
Familial factors seem to affect rate of tremor progression in ET. There was a 4-fold difference across families in observed mean rate of progression; thus, some families seemed to be more rapid progressors than others. We hope these data may be used by clinicians to provide basic prognostic and family guidance information to their patients and families with ET.
essential tremor; genetics; familial; clinical; rate of progression
We investigated 4 members of a family with type 2C Charcot-Marie-Tooth (CMT) and self-reported essential tremor (ET). A heterozygous missense mutation, R269H, in the TRPV4 gene was previously reported in this family. Our genotypic data provided a rare opportunity to determine the etiology of the tremor.
Family study; the 4 tremor cases underwent a detailed neurological assessment.
The clinical diagnosis of ET was confirmed in all 4 tremor cases based on stringent published research criteria. Two of these also had CMT. We genotyped all 4 family members for the TRPV4 R269H mutation. We confirmed the presence of the TRPV4 R269H mutation in the 2 family members with ET and CMT; however, the TRPV4 R269H mutation did not segregate with ET in the same family.
In this particular CMT family, the tremor was clinically attributed to ET. Furthermore, genotype data indicated that the tremor was unlikely to be caused by incomplete penetrance or variable expressivity of the TRPV4 R269H mutation. Hence, the tremor likely represents ET. This establishes that in some CMT families the tremor diathesis likely represents a second disorder, namely ET.
Essential tremor; Charcot-Marie-Tooth; Neuropathy; Genetics
To evaluate evidence for de novo etiologies in schizophrenia, we sequenced at high coverage the exomes of families recruited from two populations with distinct demographic structure and history. We sequenced a total of 795 exomes from 231 parent-proband trios enriched for sporadic schizophrenia cases, as well as 34 unaffected trios. We observed in cases an excess of non-synonymous single nucleotide variants as well as a higher prevalence of gene-disruptive de novo mutations. We found four genes (LAMA2, DPYD, TRRAP and VPS39) affected by recurrent de novo events within or across the two populations, a finding unlikely to have occurred by chance. We show that de novo mutations affect genes with diverse functions and developmental profiles but we also find a substantial contribution of mutations in genes with higher expression in early fetal life. Our results help define the pattern of genomic and neural architecture of schizophrenia.
In recent years there has been a growing interest in the role of copy number variations (CNV) in genetic diseases. Though there has been rapid development of technologies and statistical methods devoted to detection in CNVs from array data, the inherent challenges in data quality associated with most hybridization techniques remains a challenging problem in CNV association studies.
To help address these data quality issues in the context of family-based association studies, we introduce a statistical framework for the intensity-based array data that takes into account the family information for copy-number assignment. The method is an adaptation of traditional methods for modeling SNP genotype data that assume Gaussian mixture model, whereby CNV calling is performed for all family members simultaneously and leveraging within family-data to reduce CNV calls that are incompatible with Mendelian inheritance while still allowing de-novo CNVs. Applying this method to simulation studies and a genome-wide association study in asthma, we find that our approach significantly improves CNV calls accuracy, and reduces the Mendelian inconsistency rates and false positive genotype calls. The results were validated using qPCR experiments.
In conclusion, we have demonstrated that the use of family information can improve the quality of CNV calling and hopefully give more powerful association test of CNVs.
In families with autosomal dominant partial epilepsy with auditory features (ADPEAF) with mutations in the LGI1 gene, we evaluated clustering of mutations within the gene and associations of penetrance and phenotypic features with mutation location and predicted effect (truncation or missense).
We abstracted clinical and molecular information from the literature for all 36 previously published ADPEAF families with LGI1 mutations. We used a sliding window approach to analyze mutation clustering within the gene. Each mutation was mapped to one of the gene's 2 major functional domains, N-terminal leucine-rich repeats (LRRs) and C-terminal epitempin (EPTP) repeats, and classified according to predicted effect on the encoded protein (truncation vs missense). Analyses of phenotypic features (age at onset and occurrence of auditory symptoms) in relation to mutation site and predicted effect included 160 patients with idiopathic focal unprovoked seizures from the 36 families.
ADPEAF-causing mutations clustered significantly in the LRR domain (exons 3–5) of LGI1 (p = 0.026). Auditory symptoms were less frequent in individuals with truncation mutations in the EPTP domain than in those with other mutation type/domain combinations (58% vs 80%, p = 0.018).
The LRR region of the LGI1 gene is likely to play a major role in pathogenesis of ADPEAF.
Genome-wide association studies have been able to identify disease associations with many common variants; however most of the estimated genetic contribution explained by these variants appears to be very modest. Rare variants are thought to have larger effect sizes compared to common SNPs but effects of rare variants cannot be tested in the GWAS setting. Here we propose a novel method to test for association of rare variants obtained by sequencing in family-based samples by collapsing the standard family-based association test (FBAT) statistic over a region of interest. We also propose a suitable weighting scheme so that low frequency SNPs that may be enriched in functional variants can be upweighted compared to common variants. Using simulations we show that the family-based methods perform at par with the population-based methods under no population stratification. By construction, family-based tests are completely robust to population stratification; we show that our proposed methods remain valid even when population stratification is present.
Copy number variants (CNVs), defined as losses and gains of segments of genomic DNA, are a major source of genomic variation.
In this study, we identified over 2,000 human CNVs that overlap with orthologous chimpanzee or orthologous macaque CNVs. Of these, 170 CNVs overlap with both chimpanzee and macaque CNVs, and these were collapsed into 34 hotspot regions of CNV formation. Many of these hotspot regions of CNV formation are functionally relevant, with a bias toward genes involved in immune function, some of which were previously shown to evolve under balancing selection in humans. The genes in these primate CNV formation hotspots have significant differential expression levels between species and show evidence for positive selection, indicating that they have evolved under species-specific, directional selection.
These hotspots of primate CNV formation provide a novel perspective on divergence and selective pressures acting on these genomic regions.
Genome-wide association studies have been successful at identifying common disease variants associated with complex diseases, but the common variants identified have small effect sizes and account for only a small fraction of the estimated heritability for common diseases. Theoretical and empirical studies suggest that rare variants, which are much less frequent in populations and are poorly captured by single-nucleotide polymorphism chips, could play a significant role in complex diseases. Several new statistical methods have been developed for the analysis of rare variants, for example, the combined multivariate and collapsing method, the weighted-sum method and a replication-based method. Here, we apply and compare these methods to the simulated data sets of Genetic Analysis Workshop 17 and thereby explore the contribution of rare variants to disease risk. In addition, we investigate the usefulness of extreme phenotypes in identifying rare risk variants when dealing with quantitative traits. Finally, we perform a pathway analysis and show the importance of the vascular endothelial growth factor pathway in explaining different phenotypes.
The recent emergence of massively parallel sequencing technologies has enabled an increasing number of human genome re-sequencing studies, notable among them being the 1000 Genomes Project. The main aim of these studies is to identify the yet unknown genetic variants in a genomic region, mostly low frequency variants (frequency less than 5%). We propose here a set of statistical tools that address how to optimally design such studies in order to increase the number of genetic variants we expect to discover. Within this framework, the tradeoff between lower coverage for more individuals and higher coverage for fewer individuals can be naturally solved.
The methods here are also useful for estimating the number of genetic variants missed in a discovery study performed at low coverage.
We show applications to simulated data based on coalescent models and to sequence data from the ENCODE project. In particular, we show the extent to which combining data from multiple populations in a discovery study may increase the number of genetic variants identified relative to studies on single populations.
species problem; variant discovery studies; sequencing technologies
Rapid advances in sequencing technologies set the stage for the large-scale medical sequencing efforts to be performed in the near future, with the goal of assessing the importance of rare variants in complex diseases. The discovery of new disease susceptibility genes requires powerful statistical methods for rare variant analysis. The low frequency and the expected large number of such variants pose great difficulties for the analysis of these data. We propose here a robust and powerful testing strategy to study the role rare variants may play in affecting susceptibility to complex traits. The strategy is based on assessing whether rare variants in a genetic region collectively occur at significantly higher frequencies in cases compared with controls (or vice versa). A main feature of the proposed methodology is that, although it is an overall test assessing a possibly large number of rare variants simultaneously, the disease variants can be both protective and risk variants, with moderate decreases in statistical power when both types of variants are present. Using simulations, we show that this approach can be powerful under complex and general disease models, as well as in larger genetic regions where the proportion of disease susceptibility variants may be small. Comparisons with previously published tests on simulated data show that the proposed approach can have better power than the existing methods. An application to a recently published study on Type-1 Diabetes finds rare variants in gene IFIH1 to be protective against Type-1 Diabetes.
Risk to common diseases, such as diabetes, heart disease, etc., is influenced by a complex interaction among genetic and environmental factors. Most of the disease-association studies conducted so far have focused on common variants, widely available on genotyping platforms. However, recent advances in sequencing technologies pave the way for large-scale medical sequencing studies with the goal of elucidating the role rare variants may play in affecting susceptibility to complex traits. The large number of rare variants and their low frequencies pose great challenges for the analysis of these data. We present here a novel testing strategy, based on a weighted-sum statistic, that is less sensitive than existing methods to the presence of both risk and protective variants in the genetic region under investigation. We show applications to simulated data and to a real dataset on Type-1 Diabetes.
The glutathione S-transferase M1 (GSTM1) null variant is a common copy number variant associated with adverse pulmonary outcomes, including asthma and airflow obstruction, with evidence of important gene-by-environment interactions with exposures to oxidative stress.
To explore the joint interactive effects of GSTM1 copy number and tobacco smoke exposure on the development of asthma and asthma-related phenotypes in a family-based cohort of childhood asthmatics.
We performed quantitative PCR-based genotyping for GSTM1 copy number in children of self-reported white ancestry with mild to moderate asthma in the Childhood Asthma Management Program. Questionnaire data regarding intrauterine (IUS) and postnatal, longitudinal environmental tobacco smoke exposure were available. We performed both family-based and population-based tests of association for the interaction between GSTM1 copy number and tobacco smoke exposure with asthma and asthma-related phenotypes.
Associations of GSTM1 null variants with asthma (p= .03), younger age of asthma symptom onset (p=.03), and greater airflow obstruction (reduced FEV1/FVC, p=.01) were observed among the 50 children (10% of the cohort) with exposure to IUS. In contrast, no associations were observed between GSTM1 null variants and asthma-related phenotypes among children without IUS exposure. Presence of at least one copy of GSTM1 conferred protection.
These findings support an important gene-by-environment interaction between two common factors: increased risk of asthma and asthma-related phenotypes conferred by GSTM1-null homozygosity in children is restricted to those with a history of IUS exposure.
Asthma; GSTM1; copy number variation (CNV); gene by environment; intrauterine smoke exposure; tobacco smoke
Structural genetic variation, including copy number variation (CNV), constitutes a substantial fraction of total genetic variability and the importance of structural genetic variants in modulating human disease is increasingly being recognized. Early successes in identifying disease-associated CNVs via a candidate gene approach mandate that future disease association studies need to include structural genetic variation. Such analyses should not rely on previously developed methodologies that were designed to evaluate single nucleotide polymorphisms (SNPs). Instead, development of novel technical, statistical, and epidemiologic methods will be necessary to optimally capture this newly-appreciated form of genetic variation in a meaningful manner.
Copy number variation; CNV; structural genetic variation; disease association study; complex trait
Motivation: Estimating the frequency distribution of copy number variants (CNVs) is an important aspect of the effort to characterize this new type of genetic variation. Currently, most studies report a strong skew toward low-frequency CNVs. In this article, our goal is to investigate the frequencies of CNVs. We employ a two-step procedure for the CNV frequency estimation process. We use family information a posteriori to select only the most reliable CNV regions, i.e. those showing high rates of Mendelian transmission.
Results: Our results suggest that the current skew toward low-frequency CNVs may not be representative of the true frequency distribution, but may be due, among other reasons, to the non-negligible false negative rates that characterize CNV detection methods. Moreover, false positives are also likely, as low-frequency CNVs are hard to detect with small sample sizes and technologies that are not ideally suited for their detection. Without appropriate validation methods, such as incorporation of biologically relevant information (for example, in our case, the transmission of heritable CNVs from parents to offspring), it is difficult to assess the validity of specific CNVs, and even harder to obtain reliable frequency estimates.
Availability: Software implementing the methods described in this article is available for download at the following address: http://www.isites.harvard.edu/icb/icb.do?keyword=k36162
Supplementary informantion: Supplementary data are available at Bioinformatics online.
Allele transmissions in pedigrees provide a natural way of evaluating the genotyping quality of a particular proband in a family-based, genome-wide association study. We propose a transmission test that is based on this feature and that can be used for quality control filtering of genome-wide genotype data for individual probands. The test has one degree of freedom and assesses the average genotyping error rate of the genotyped SNPs for a particular proband. As we show in simulation studies, the test is sufficiently powerful to identify probands with an unreliable genotyping quality that cannot be detected with standard quality control filters. This feature of the test is further exemplified by an application to the third release of the HapMap data. The test is ideally suited as the final layer of quality control filters in the cleaning process of genome-wide association studies. It identifies probands with insufficient genotyping quality that were not removed by standard quality control filtering.
Genome-wide association studies have led to the discovery of many novel, reproducible associations between genetic loci and disease phenotypes. An important step in the analysis of genome-wide association studies is the data cleaning/QC filtering step. The statistical analysis tools that are applied as QC filters typically include testing for Hardy-Weinberg equilibrium, testing for Mendelian inconsistencies, evaluating quality scores, etc. We propose a new genome-wide transmission test for family-based designs that is applied to the dataset after the QC filtering. It allows for the assessment of the genotyping error rate that is caused by miscalled genotypes that could not be detected by the QC filters. By applying the test to individual probands, probands with insufficient genotyping quality can be identified and removed from the dataset before the analysis.
The mRNA expression levels of genes have been shown to have discriminating power for the classification of breast cancer. Studying the heritability of gene expression levels on breast cancer related transcripts can lead to the identification of shared common regulators and inter-regulation patterns, which would be important for dissecting the etiology of breast cancer.
We applied multilocus association genome-wide scans to 18 breast cancer related transcripts and combined the results with traditional linkage scans. Regulatory hotspots for these transcripts were identified and some inter-regulation patterns were observed. We also derived evidence on interacting genetic regulatory loci shared by a number of these transcripts.
In this paper, by restricting to a set of related genes, we were able to employ a more detailed multilocus approach that evaluates both marginal and interaction association signals at each single-nucleotide polymorphism. Interesting inter-regulation patterns and significant overlaps of genetic regulators between transcripts were observed. Interaction association results returned more expression quantitative trait locus hotspots that are significant.
Rheumatoid arthritis (RA, MIM 180300) is a common and complex inflammatory disorder. The North American Rheumatoid Arthritis Consortium (NARAC) data, as part of the Genetic Analysis Workshop 15 data, consists of both genome scan and candidate gene studies on RA patients.
We applied the backward genotype-trait association (BGTA) algorithm to capture marginal and gene × gene interaction effects of multiple susceptibility loci on RA disease status. A two-stage screening approach was used for the genome scan, whereas a comprehensive study of all possible subsets was conducted for the candidate genes. For the genome scan, we constructed an association network among 39 genetic loci that demonstrated strong signals, 19 of which have been reported in the RA literature. For the candidate genes, we found strong signals for PTPN22 and SUMO4. Based on significant association evidence, we built an association network among the loci of PTPN22, PADI4, DLG5, SLC22A4, SUMO4, and CARD15. To control for false positives, we used permutation tests to constrain the family-wise type I error rate to 1%.
Using the BGTA algorithm, we identified genetic loci and candidate genes that were associated with RA susceptibility and association networks among them. For the first time, we report possible interactions between single-nucleotide polymorphisms/genes, which may be useful for biological interpretation.