There is great interest to sequence unrelated or pedigree samples for detecting rare variant quantitative trait associations. In order to reduce the cost of sequencing and improve power, many studies sequence selected samples with extreme traits. Existing methods for detecting rare variant associations were developed for unrelated samples. Methods are needed to analyze (selected or randomly ascertained) pedigree samples.
We propose a unified framework of modeling extreme trait genetic associations (MEGA) with rare variants. Using MEGA and appropriate permutation algorithms, many rare variant tests can be extended to family data. As an application, we compared study designs using both sib-pairs and unrelated individuals. Extensive simulations were carried out using realistic population genetic and complex trait models.
It is demonstrated that when extreme sampling is implemented within equal-sized cohorts of unrelated individuals or sib-pairs, analyzing unrelated individuals is consistently more powerful than studying sib-pairs. A higher portion of rare variants can be identified through sequencing unrelated samples compared to sibs. Alternatively, if samples are ascertained using fixed thresholds from an infinite-sized population, sequencing one sib with the most extreme trait from each extreme concordant sib-pair is consistently the most powerful design.
MEGA will play an important role in the analysis of sequence-based genetic association studies.
Extreme sampling; Next-generation sequencing; Pedigree samples; Quantitative trait loci; Rare variants
Principal components analysis of genetic data has benefited from advances in random matrix theory. The Tracy-Widom distribution has been identified as the limiting distribution of the lead eigenvalue, enabling formal hypothesis testing of population structure. Additionally, a phase change exists between small and large eigenvalues, such that population divergence below a threshold of FST is impossible to detect and above which it is always detectable. I show that the plug-in estimate of the effective number of markers in the EIGEN-SOFT software often exceeds the rank of the sample covariance matrix, leading to a systematic overestimation of the number of significant principal components. I describe an alternative plug-in estimate that eliminates the problem. This improvement is not just an asymptotic result but is directly applicable to finite samples. The minimum average partial test, based on minimizing the average squared partial correlation between individuals, can detect population structure at smaller FST values than the corrected test. The minimum average partial test is applicable to both unadmixed and admixed samples, with arbitrary numbers of discrete subpopulations or parental populations, respectively. Application of the minimum average partial test to the 11 HapMap Phase III samples, comprising 8 unadmixed samples and 3 admixed samples, revealed 13 significant principal components.
Admixture; Population stratification; Population structure; Principal components analysis
It is thought that a proportion of the genetic susceptibility to complex diseases is due to low-frequency and rare variants. Next-generation sequencing in large populations facilitates the detection of rare variant associations to disease risk. In order to achieve adequate power to detect association at low-frequency and rare variants, locus-specific statistical methods are being developed that combine information across variants within a functional unit and test for association with this enriched signal through so-called burden tests.
We propose a hierarchical clustering approach and a similarity kernel-based association test for continuous phenotypes. This method clusters individuals into groups, within which samples are assumed to be genetically similar, and subsequently tests the group effects among the different clusters.
The power of this approach is comparable to that of collapsing methods when causal variants have the same direction of effect, but its power is significantly higher compared to burden tests when both protective and risk variants are present in the region of interest. Overall, we observe that the Sequence Kernel Association Test (SKAT) is the most powerful approach under the allelic architectures considered.
In our overall comparison, we find the analytical framework within which SKAT operates to yield higher power and to control type I error appropriately.
Allele match kernel; Genetic similarity; Next-generation sequencing; Single nucleotide polymorphism
For genome-wide association studies (GWAS) with case-control designs, one of the most widely used association tests is the Cochran-Armitage (CA) trend test assuming an additive mode of inheritance. The CA trend test often has higher power than other association tests under additive and multiplicative disease models. However, it can have very low power under a recessive disease model in GWAS. Although tests (such as MAX3) robust to different genetic models have been developed, they often have relatively lower power than the CA trend test under additive and multiplicative models. The goal of this study is to propose an efficient method that not only has higher power than the CA trend test under dominant and recessive models but also maintains the power of the CA trend test under additive and multiplicative models.
We employed the generalized sequential Bonferroni (GSB) procedure of Holm to incorporate information from a Hardy-Weinberg disequilibrium (HWD) test into the CA trend test based on estimating weights from the p values of the HWD test. We proposed to smooth the weights to reduce possible noise.
Results and Conclusions
Results from extensive simulation studies showed that the proposed GSB procedure can achieve the goal described above.
Generalized sequential Bonferroni procedure; Genome-wide association studies; Hardy-Weinberg disequilibrium; Multiple testing; Smoothed weights
In genetic association studies, due to the varying underlying genetic models, no single statistical test can be the most powerful test under all situations. Current studies show that if the underlying genetic models are known, trend-based tests, which outperform the classical Pearson χ2 test, can be constructed. However, when the underlying genetic models are unknown, the χ2 test is usually more robust than trend-based tests. In this paper, we propose a new association test based on a generalized genetic model, namely the generalized order-restricted relative risks model. Through a Monte Carlo simulation study, we show that the proposed association test is generally more powerful than the χ2 test, and more robust than those trend-based tests. The proposed methodologies are also illustrated by some real SNP datasets.
Genetic association; Robust test; Trend test; SNP
For the meta-analysis of genome-wide association studies, we propose a new method to adjust for the population stratification and a linear mixed approach that combines family-based and unrelated samples. The proposed approach achieves similar power levels as a standard meta-analysis which combines the different test statistics or p values across studies. However, by virtue of its design, the proposed approach is robust against population admixture and stratification, and no adjustments for population admixture and stratification, even in unrelated samples, are required. Using simulation studies, we examine the power of the proposed method and compare it to standard approaches in the meta-analysis of genome-wide association studies. The practical features of the approach are illustrated with a meta-analysis of three genome-wide association studies for Alzheimer's disease. We identify three single nucleotide polymorphisms showing significant genome-wide association with affection status. Two single nucleotide polymorphisms are novel and will be verified in other populations in our follow-up study.
Meta-analysis; Genome-wide study; Population stratification
Genotype imputations based on 1000 Genomes (1KG) Project data have the advantage of imputing many more SNPs than imputations based on HapMap data. It also provides an opportunity to discover associations with relatively rare variants. Recent investigations are increasingly using 1KG data for genotype imputations, but only limited evaluations of the performance of this approach are available. In this paper, we empirically evaluated imputation performance using 1KG data by comparing imputation results to those using the HapMap Phase II data that have been widely used. We used three reference panels: the CEU panel consisting of 120 haplotypes from HapMap II and 1KG data (June 2010 release) and the EUR panel consisting of 566 haplotypes also from 1KG data (August 2010 release). We used Illumina 324,607 autosomal SNPs genotyped in 501 individuals of European ancestry. Our most important finding was that both 1KG reference panels provided much higher imputation yield than the HapMap II panel. There were more than twice as many successfully imputed SNPs as there were using the HapMap II panel (6.7 million vs. 2.5 million). Our second most important finding was that accuracy using both 1KG panels was high and almost identical to accuracy using the HapMap II panel. Furthermore, after removing SNPs with MACH Rsq <0.3, accuracy for both rare and low frequency SNPs was very high and almost identical to accuracy for common SNPs. We found that imputation using the 1KG-EUR panel had advantages in successfully imputing rare, low frequency and common variants. Our findings suggest that 1KG-based imputation can increase the opportunity to discover significant associations for SNPs across the allele frequency spectrum. Because the 1KG Project is still underway, we expect that later versions will provide even better imputation performance.
1000 Genomes Project; HapMap Project; Genome-wide association study; Imputation performance
Linkage and association analysis based on haplotype transmission disequilibrium can be more informative than single marker analysis. Several works have been proposed in recent years to extend the transmission disequilibrium test (TDT) to haplotypes. Among them, a powerful approach called the evolutionary tree TDT (ET-TDT) incorporates information about the evolutionary relationship among haplotypes using the cladogram of the locus.
In this work we extend this approach by taking into consideration the sparsity of causal mutations in the evolutionary history. We first introduce the notion of a Bradley-Terry (BT) graph representation of a haplotype locus. The most important property of the BT graph is that sparsity of the edge set of the graph corresponds to small number of causal mutations in the evolution of the haplotypes. We then propose a method to test the null hypothesis of no linkage and association against sparse alternatives under which a small number of edges on the BT graph have non-nil effects.
Results and Conclusion
We compare the performance of our approach to that of the ET-TDT through a power study, and show that incorporating sparsity of causal mutations can significantly improve the power of a haplotype-based TDT.
Case-parent trio; Family-based study; Linkage; Linkage disequilibrium; Penalized logistic regression
For both model-free and model-based linkage analysis the S.A.G.E. (Statistical Analysis for Genetic Epidemiology) program package has some unique capabilities in analyzing both continuous traits and binary traits with variable age of onset. Here we highlight model-based linkage analysis of a quantitative trait (plasma dopamine β hydroxylase) that is known to be largely determined by monogenic inheritance, using a prior segregation analysis to produce the best fitting model for the trait. For a binary trait with variable age of onset (schizophrenia), we illustrate how using age of onset information to obtain a quantitative susceptibility trait leads to more statistically significant linkage signals, suggesting better power.
Age of onset; Best linear unbiased predictor; Haseman-Elston; Individual-specific penetrance values; Multipoint; Power transform; Segregation model for a continuous trait
Computer simulation methods are under-used tools in genetic analysis because simulation approaches have been portrayed as inferior to analytic methods. Even when simulation is used, its advantages are not fully exploited. Here, I present SHIMSHON, our package of genetic simulation programs that have been developed, tested, used for research, and used to generated data for Genetic Analysis Workshops (GAW). These simulation programs, now web-accessible, can be used by anyone to answer questions about designing and analyzing genetic disease studies for locus identification. This work has three foci: (1) the historical context of SHIMSHON's development, suggesting why simulation has not been more widely used so far. (2) Advantages of simulation: computer simulation helps us to understand how genetic analysis methods work. It has advantages for understanding disease inheritance and methods for gene searches. Furthermore, simulation methods can be used to answer fundamental questions that either cannot be answered by analytical approaches or cannot even be defined until the problems are identified and studied, using simulation. (3) I argue that, because simulation was not accepted, there was a failure to grasp the meaning of some simulation-based studies of linkage. This may have contributed to perceived weaknesses in linkage analysis; weaknesses that did not, in fact, exist.
Linkage analysis; Association analysis; Mode of inheritance; GWAS; Type 1 error; Affected sib pairs; LOD scores
Linkage analysis was developed to detect excess co-segregation of the putative alleles underlying a phenotype with the alleles at a marker locus in family data. Many different variations of this analysis and corresponding study design have been developed to detect this co-segregation. Linkage studies have been shown to have high power to detect loci that have alleles (or variants) with a large effect size, i.e. alleles that make large contributions to the risk of a disease or to the variation of a quantitative trait. However, alleles with a large effect size tend to be rare in the population. In contrast, association studies are designed to have high power to detect common alleles which tend to have a small effect size for most diseases or traits. Although genome-wide association studies have been successful in detecting many new loci with common alleles of small effect for many complex traits, these common variants often do not explain a large proportion of disease risk or variation of the trait. In the past, linkage studies were successful in detecting regions of the genome that were likely to harbor rare variants with large effect for many simple Mendelian diseases and for many complex traits. However, identifying the actual sequence variant(s) responsible for these linkage signals was challenging because of difficulties in sequencing the large regions implicated by each linkage peak. Current ‘next-generation’ DNA sequencing techniques have made it economically feasible to sequence all exons or the whole genomes of a reasonably large number of individuals. Studies have shown that rare variants are quite common in the general population, and it is now possible to combine these new DNA sequencing methods with linkage studies to identify rare causal variants with a large effect size. A brief review of linkage methods is presented here with examples of their relevance and usefulness for the interpretation of whole-exome and whole-genome sequence data.
Linkage; Genetics; DNA sequence; Whole-genome sequence; Whole-exome sequence
jPAP (Java Pedigree Analysis Package) performs variance components linkage analysis of either quantitative or discrete traits. Multivariate linkage analysis of two or more traits (all quantitative, all discrete, or any combination) allows the inference of pleiotropy between the traits. The inclusion of multiple quantitative trait loci in linkage analysis allows the inference of epistasis between loci. A user-friendly graphical user interface facilitates the usage of jPAP.
jPAP; Epistasis; Pleiotropy; Variance components linkage analysis; Quantitative trait loci
Multipoint (MP) linkage analysis represents a valuable tool for whole-genome studies but suffers from the disadvantage that its probability distribution is unknown and varies as a function of marker information and density, genetic model, number and structure of pedigrees, and the affection status distribution [Xing and Elston: Genet Epidemiol 2006;30:447–458; Hodge et al.: Genet Epidemiol 2008;32:800–815]. This implies that the MP significance criterion can differ for each marker and each dataset, and this fact makes planning and evaluation of MP linkage studies difficult. One way to circumvent this difficulty is to use simulations or permutation testing. Another approach is to use an alternative statistical paradigm to assess the statistical evidence for linkage, one that does not require computation of a p value. Here we show how to use the evidential statistical paradigm for planning, conducting, and interpreting MP linkage studies when the disease model is known (lod analysis) or unknown (mod analysis). As a key feature, the evidential paradigm decouples uncertainty (i.e. error probabilities) from statistical evidence. In the planning stage, the user calculates error probabilities, as functions of one's design choices (sample size, choice of alternative hypothesis, choice of likelihood ratio (LR) criterion k) in order to ensure a reliable study design. In the data analysis stage one no longer pays attention to those error probabilities. In this stage, one calculates the LR for two simple hypotheses (i.e. trait locus is unlinked vs. trait locus is located at a particular position) as a function of the parameter of interest (position). The LR directly measures the strength of evidence for linkage in a given data set and remains completely divorced from the error probabilities calculated in the planning stage. An important consequence of this procedure is that one can use the same criterion k for all analyses. This contrasts with the situation described above, in which the value one uses to conclude significance may differ for each marker and each dataset in order to accommodate a fixed test size, α. In this study we accomplish two goals that lead to a general algorithm for conducting evidential MP linkage studies. (1) We provide two theoretical results that translate into guidelines for investigators conducting evidential MP linkage: (a) Comparing mods to lods, error rates (including probabilities of weak evidence) are generally higher for mods when the null hypothesis is true, but lower for mods in the presence of true linkage. Royall [J Am Stat Assoc 2000;95:760–780] has shown that errors based on lods are bounded and generally small. Therefore when the true disease model is unknown and one chooses to use mods, one needs to control misleading evidence rates only under the null hypothesis; (b) for any given pair of contiguous marker loci, error rates under the null are greatest at the midpoint between the markers spaced furthest apart, which provides an obvious simple alternative hypothesis to specify for planning MP linkage studies. (2) We demonstrate through extensive simulation that this evidential approach can yield low error rates under the null and alternative hypotheses for both lods and mods, despite the fact that mod scores are not true LRs. Using these results we provide a coherent approach to implement a MP linkage study using the evidential paradigm.
Evidential paradigm; Likelihood; Parametric linkage; Complex disease
Linkage analysis identifies markers that appear to be co-inherited with a trait within pedigrees. The inheritance of a chromosomal segment may be probabilistically reconstructed, with missing data complicating inference. Inheritance patterns are further obscured in the analysis of complex traits, where variants in one or more genes may contribute to phenotypic variation within a pedigree. In this case, determining which relatives share a trait variant is not simple. We describe how to represent these patterns of inheritance for marker loci. We summarize how to sample patterns of inheritance consistent with genotypic and pedigree data using gl_auto, available in MORGAN v3.0. We describe identification of classes of equivalent inheritance patterns with the program IBDgraph. We finally provide an example of how these programs may be used to simplify interpretation of linkage analysis of complex traits in general pedigrees. We borrow information across loci in a parametric linkage analysis of a large pedigree. We explore the contribution of each equivalence class to a linkage signal, illustrate estimated patterns of identity-by-descent sharing, and identify a haplotype tagging the chromosomal segment driving the linkage signal. Haplotype carriers are more likely to share the linked trait variant, and can be prioritized for subsequent DNA sequencing.
Inheritance vector; Segregation; Genome scan; Haplotype; Equivalence class
Background and Methods
Association studies using unrelated individuals cannot detect intergenerational genetic effects contributing to disease. To detect these effects, we improve the extended maternal-fetal genotype (EMFG) incompatibility test to estimate any combination of maternal effects, offspring effects, and their interactions at polymorphic loci or multiple SNPs, using any size pedigrees. We explore the advantages of using extended pedigrees rather than nuclear families. We apply our methods to schizophrenia pedigrees to investigate whether the previously associated mother-daughter HLA-B matching is a genuine risk or the result of bias.
Simulations demonstrate that using the EMFG test with extended pedigrees increases power and precision, while partitioning extended pedigrees into nuclear families can underestimate intergenerational effects. Application to actual data demonstrates that mother-daughter HLA-B matching remains a schizophrenia risk factor. Furthermore, ascertainment and mate selection biases cannot by themselves explain the observed HLA-B matching and schizophrenia association.
Our results demonstrate the power of the EMFG test to examine intergenerational genetic effects, highlight the importance of pedigree rather than case/control or case-mother/control-mother designs, illustrate that pedigrees provide a means to examine alternative, non-causal mechanisms, and they strongly support the hypothesis that HLA-B matching is causally involved in the etiology of schizophrenia in females.
Family-based association; Maternal-fetal genotype test; Extended maternal-fetal genotype test; Genotype interactions; Complex disease; Olfactory deficits; Gene-by-gene; Gene-by-environment; Major histocompatibility complex
We aim to quantify the effect of non-differential genotyping errors on the power of rare variant tests and identify those situations when genotyping errors are most harmful.
We simulated genotype and phenotype data for a range of sample sizes, minor allele frequencies, disease relative risks and numbers of rare variants. Genotype errors were then simulated using five different error models covering a wide range of error rates.
Even at very low error rates, misclassifying a common homozygote as a heterozygote translates into a substantial loss of power, a result that is exacerbated even further as the minor allele frequency decreases. While the power loss from heterozygote to common homozygote errors tends to be smaller for a given error rate, in practice heterozygote to homozygote errors are more frequent and, thus, will have measurable impact on power.
Error rates from genotype-calling technology for next-generation sequencing data suggest that substantial power loss may be seen when applying current rare variant tests of association to called genotypes.
Sequencing data; Power; Case-control; Misclassification
We introduce an innovative multilocus test for disease association. It is an extension of an existing score test that gains power over alternative methods by incorporating a parsimonious one-degree-of-freedom model for interaction. We use our method in applications designed to detect interactions that generate hypotheses about the functionality of prostate cancer (PRCA) susceptibility regions.
Our proposed score test is designed to gain additional power through the use of a retrospective likelihood that exploits an assumption of independence between unlinked loci in the underlying population. Its performance is validated through simulation. The method is used in conditional scans with data from stage II of the Cancer Genetic Markers of Susceptibility PRCA genome-wide association study.
Our proposed method increases power to detect susceptibility loci in diverse settings. It identified two high-ranking, biologically interesting interactions: (1) rs748120 of NR2C2 and subregions of 8q24 that contain independent susceptibility loci specific to PRCA and (2) rs4810671 of SULF2 and both JAZF1 and HNF1B that are associated with PRCA and type 2 diabetes.
Our score test is a promising multilocus tool for genetic epidemiology. The results of our applications suggest functionality for poorly understood PRCA susceptibility regions. They motivate replication study.
Gene-gene interaction; Score test; Prostate cancer
The measles virus (MV) interacts with two known cellular receptors: CD46 and SLAM. The transmembrane receptor CD209 interacts with MV and augments dendritic cell infection.
764 subjects previously immunized with measles-mumps-rubella vaccine were genotyped for 66 candidate SNPs in the CD46, SLAM and CD209 genes as part of a larger study.
A previously detected association of the CD46 SNP rs2724384 with measles-specific antibodies was successfully replicated in this study. Increased representation of the minor allele G for an intronic CD46 SNP was associated with an allele dose-related decrease (978 vs. 522 mIU/ml, p = 0.0007) in antibody levels. This polymorphism rs2724384 also demonstrated associations with IL-6 (p = 0.02), IFN-α (p = 0.007) and TNF-α (p = 0.0007) responses. Two polymorphisms (coding rs164288 and intronic rs11265452) in the SLAM gene that were associated with measles antibody levels in our previous study were associated with IFN-γ Elispot (p = 0.04) and IL-10 responses (p = 0.0008), respectively, in this study. We found associations between haplotypes, AACGGAATGGAAAG (p = 0.009) and GGCCGAGAGGAGAG (p < 0.001), in the CD46 gene and TNF-α secretion.
Understanding the functional and mechanistic consequences of these genetic polymorphisms on immune response variations could assist in directing new measles and potentially other viral vaccine design, and in better understanding measles immunogenetics.
Measles virus receptors; Single nucleotide polymorphisms; Measles vaccine immunity; SNP; CD46; SLAM; CD209; Replication study
Longitudinal measurements with bivariate response have been analyzed by several authors using two separate models for each response. However, for most of the biological or medical experiments, the two responses are highly correlated and hence a separate model for each response might not be a desirable way to analyze such data. A single model considering a bivariate response provides a more powerful inference as the correlation between the responses is modeled appropriately. In this article, we propose a dynamic statistical model to detect the genes controlling human blood pressure (systolic and diastolic).
By modeling the mean function with orthogonal Legendre polynomials and the covariance matrix with a stationary parametric structure, we incorporate the statistical ideas in functional genome-wide association studies to detect SNPs which have significant control on human blood pressure. The traditional false discovery rate is used for multiple comparisons.
We analyze the data from the Framingham Heart Study to detect such SNPs by appropriately considering gender-gene interaction. We detect 8 SNPs for males and 7 for females which are most significant in controlling blood pressure. The genotype-specific mean curves and additive and dominant effects over time are shown for each significant SNP for both genders. Simulation studies are performed to examine the statistical properties of our model. The current model will be extremely useful in detecting genes controlling different traits and diseases for humans or non-human subjects.
Bivariate response; Functional GWAS; Multiple testing; SNP
Antibodies against infectious pathogens provide information on past or present exposure to infectious agents. While host genetic factors are known to affect the immune response, the influence of genetic factors on antibody levels to common infectious agents is largely unknown. Here we test whether antibody levels for 13 common infections are significantly heritable.
IgG antibodies to Chlamydophila pneumoniae, Helicobacter pylori, Toxoplasma gondii, adenovirus 36 (Ad36), hepatitis A virus, influenza A and B, cytomegalovirus, Epstein-Barr virus, herpes simplex virus (HSV)-1 and −2, human herpesvirus-6, and varicella zoster virus were determined for 1,227 Mexican Americans. Both quantitative and dichotomous (seropositive/seronegative) traits were analyzed. Influences of genetic and shared environmental factors were estimated using variance components pedigree analysis, and sharing of underlying genetic factors among traits was investigated using bivariate analyses.
Serological phenotypes were significantly heritable for most pathogens (h2 = 0.17–0.39), except for Ad36 and HSV-2. Shared environment was significant for several pathogens (c2 = 0.10–0.32). The underlying genetic etiology appears to be largely different for most pathogens.
Our results demonstrate, for the first time for many of these pathogens, that individual genetic differences of the human host contribute substantially to antibody levels to many common infectious agents, providing impetus for the identification of underlying genetic variants, which may be of clinical importance.
Pathogen; Infection; Antibody; Serology; Genetics; Heritability; Mexican Americans
Genetic association studies, thus far, have focused on the analysis of individual main effects of SNP markers. Nonetheless, there is a clear need for modeling epistasis or gene-gene interactions to better understand the biologic basis of existing associations. Tree-based methods have been widely studied as tools for building prediction models based on complex variable interactions. An understanding of the power of such methods for the discovery of genetic associations in the presence of complex interactions is of great importance. Here, we systematically evaluate the power of three leading algorithms: random forests (RF), Monte Carlo logic regression (MCLR), and multifactor dimensionality reduction (MDR).
We use the algorithm-specific variable importance measures (VIMs) as statistics and employ permutation-based resampling to generate the null distribution and associated p values. The power of the three is assessed via simulation studies. Additionally, in a data analysis, we evaluate the associations between individual SNPs in pro-inflammatory and immunoregulatory genes and the risk of non-Hodgkin lymphoma.
The power of RF is highest in all simulation models, that of MCLR is similar to RF in half, and that of MDR is consistently the lowest.
Our study indicates that the power of RF VIMs is most reliable. However, in addition to tuning parameters, the power of RF is notably influenced by the type of variable (continuous vs. categorical) and the chosen VIM.
Genetic associations; Power; Random forests; SNP; Variable importance measure
There has been an increasing interest in detecting gene-gene and gene-environment interactions in genetic association studies. A major statistical challenge is how to deal with a large number of parameters measuring possible interaction effects, which leads to reduced power of any statistical test due to a large number of degrees of freedom or high cost of adjustment for multiple testing. Hence, a popular idea is to first apply some dimension reduction techniques before testing, while another is to apply only statistical tests that are developed for and robust to high-dimensional data. To combine both ideas, we propose applying an adaptive sum of squared score (SSU) test and several other adaptive tests. These adaptive tests are extensions of the adaptive Neyman test [Fan, 1996], which was originally proposed for high-dimensional data, providing a simple and effective way for dimension reduction. On the other hand, the original SSU test coincides with a version of a test specifically developed for high-dimensional data. We apply these adaptive tests and their original nonadaptive versions to simulated data to detect interactions between two groups of SNPs (e.g. multiple SNPs in two candidate regions). We found that for sparse models (i.e. with only few non-zero interaction parameters), the adaptive SSU test and its close variant, an adaptive version of the weighted sum of squared score (SSUw) test, improved the power over their non-adaptive versions, and performed consistently well across various scenarios. The proposed adaptive tests are built in the general framework of regression analysis, and can thus be applied to various types of traits in the presence of covariates.
Complex traits; Epistasis; Logistic regression; Adaptive Neyman test; Simulation; SSU test; Sum test; UminP test
Multiple rare variants have been suggested as accounting for some of the associations with common single nucleotide polymorphisms identified in genome-wide association studies or possibly some of the as yet undiscovered heritability. We consider the power of various approaches to designing substudies aimed at using next-generation sequencing technologies to discover novel variants and to select some subsets that are possibly causal for genotyping in the original case-control study and testing for association using various weighted sum indices. We find that the selection of variants based on the statistical significance of the case-control difference in the subsample yields good power for testing rare variant indices in the main study, and that multivariate models including both the summary index of rare variants and the associated common single nucleotide polymorphisms can distinguish which is the causal factor. By simulation, we explore the effects of varying the size of the discovery subsample, choice of index, and true causal model.
Next-generation sequencing; Multiple rare variants; Burden indices; Study design
Our goal was to evaluate the influence of quality control (QC) decisions using two genotype calling algorithms, CRLMM and Birdseed, designed for the Affymetrix SNP Array 6.0.
Various QC options were tried using the two algorithms and comparisons were made on subject and call rate and on association results using two data sets.
For Birdseed, we recommend using the contrast QC instead of QC call rate for sample QC. For CRLMM, we recommend using the signal-to-noise rate ≥4 for sample QC and a posterior probability of 90% for genotype accuracy. For both algorithms, we recommend calling the genotype separately for each plate, and dropping SNPs with a lower call rate (<95%) before evaluating samples with lower call rates. To investigate whether the genotype calls from the two algorithms impacted the genome-wide association results, we performed association analysis using data from the GENOA cohort; we observed that the number of significant SNPs were similar using either CRLMM or Birdseed.
Using our suggested workflow both algorithms performed similarly; however, fewer samples were removed and CRLMM took half the time to run our 854 study samples (4.2 h) compared to Birdseed (8.4 h).
Genotype call; Birdseed; CRLMM; Quality control decisions; Association
Studying one locus or one single nucleotide polymorphism (SNP) at a time may not be sufficient to understand complex diseases because they are unlikely to result from the effect of only one SNP. Each SNP alone may have little or no effect on the risk of the disease, but together they may increase the risk substantially. Analyses focusing on individual SNPs ignore the possibility of interaction among SNPs. In this paper, we propose a parsimonious model to assess the joint effect of a group of SNPs in a case-control study. The model implements a data reduction strategy within a likelihood framework and uses a test to assess the statistical significance of the effect of the group of SNPs on the binary trait. The primary advantage of the proposed approach is that the dimension reduction technique produces a test statistic with degrees of freedom significantly lower than a multiple logistic regression with only main effects of the SNPs, and our parsimonious model can incorporate the possibility of interaction among the SNPs. Moreover, the proposed approach estimates the direction of association of each SNP with the disease and provides an estimate of the average effect of the group of SNPs positively and negatively associated with the disease in the given SNP set. We illustrate the proposed model on simulated and real data, and compare its performance with a few other existing approaches. Our proposed approach appeared to outperform the other approaches for independent SNPs in our simulation studies.
Case-control study; Gene-gene interaction; Dimension reduction