In genetic association studies, it is necessary to correct for population structure to avoid inference bias. During the past decade, prevailing corrections often only involved adjustments of global ancestry differences between sampled individuals. Nevertheless, population structure may vary across local genomic regions due to the variability of local ancestries associated with natural selection, migration, or random genetic drift. Adjusting for global ancestry alone may be inadequate when local population structure is an important confounding factor. In contrast, adjusting for local ancestry can more effectively prevent false-positives due to local population structure. To more accurately locate disease genes, we recommend adjusting for local ancestries by interrogating local structure. In practice, locus-specific ancestries are usually unknown and cannot be accurately inferred when ancestral population information is not available. For such scenarios, we propose employing local principal components (PC) to represent local ancestries and adjusting for local PCs when testing for genotype–phenotype association. With an acceptable computation burden, the proposed algorithm successfully eliminates the known spurious association between SNPs in the LCT gene and height due to the population structure in European Americans.
Genome-wide association studies; Local ancestries; Local principal components; Migration; Random genetic drift; Natural selection; Genomic inflation factor; Genomic control; Local ancestry principal components correction; Fine mapping
It has been postulated that multiple-marker methods may have added ability, over single-marker methods, to detect genetic variants associated with disease. The Wellcome Trust Case Control Consortium (WTCCC) provided the first successful large genome-wide association studies (GWAS) which included single-marker association analyses for seven common complex diseases. Of those signals detected, only one was associated with coronary artery disease (CAD), and none were identified for hypertension (HTN). Our objective was to find additional genetic associations and pathways for cardiovascular disease by examining the WTCCC data for variants associated with CAD and HTN using two-marker testing methods. We applied two-marker association testing to the WTCCC dataset, which includes ~2,000 affected individuals with each disorder, and a shared pool of ~3,000 controls, all genotyped using Affymetrix GeneChip 500 K arrays. For CAD, we detected single nucleotide polymorphisms (SNP) pairs in three genes showing genome-wide significance: HFE2, STK32B, and DIPC2. The most notable SNP pairs in a non-protein-coding region were at 9p21, a known major CAD-associated region. For HTN, we detected SNP pairs in five genes: GPR39, XRCC4, MYO6, ZFAT, and MACROD2. Four further associated SNP pair regions were at least 70 kb from any known gene. We have shown that novel, multiple-marker, statistical methods can be of use in finding variants in GWAS. We describe many new, associated variants for both CAD and HTN and describe their known genetic mechanisms.
Populations of ethnic mixtures can be useful in genetic studies. Admixture mapping, or mapping by admixture linkage disequilibrium (MALD), is specially developed for admixed populations and can supplement traditional genome-wide association analyses in the search for genetic variants underlying complex traits. Admixture mapping tests the association between a trait and locus-specific ancestries. The locus-specific ancestries are in linkage disequilibrium (LD) which is generated by the admixture process between genetically distinct ancestral populations. Because of highly correlated locus-specific ancestries, admixture mapping performs many fewer independent tests across the genome than current genome-wide association analysis. Therefore, admixture mapping can be more powerful because of the smaller penalty due to multiple tests. In this chapter, I introduce the theory behind admixture mapping and how we conduct the analysis in practice.
Admixture mapping; Population admixture; Ancestry information marker; Hidden Markov model
Interactions among genomic loci (also known as epistasis) have been suggested as one of the potential sources of missing heritability in single locus analysis of genome-wide association studies (GWAS). The computational burden of searching for interactions is compounded by the extremely low threshold for identifying significant p-values due to multiple hypothesis testing corrections. Utilizing prior biological knowledge to restrict the set of candidate SNP pairs to be tested can alleviate this problem, but systematic studies that investigate the relative merits of integrating different biological frameworks and GWAS data have not been conducted.
We developed four biologically based frameworks to identify pairwise interactions among candidate SNP pairs as follows: (1) for each human protein-coding gene, a set of SNPs associated with that gene was constructed providing a gene-based interaction model, (2) for each known biological pathway, a set of SNPs associated with the genes in the pathway was constructed providing a pathway-based interaction model, (3) a set of SNPs associated with genes in a disease-related subnetwork provides a network-based interaction model, and (4) a framework is based on the function of SNPs. The last approach uses expression SNPs (eSNPs or eQTLs), which are SNPs or loci that have defined effects on the abundance of transcripts of other genes. We constructed pairs of eSNPs and SNPs located in the target genes whose expression is regulated by eSNPs. For all four frameworks the SNP sets were exhaustively tested for pairwise interactions within the sets using a traditional logistic regression model after excluding genes that were previously identified to associate with the trait. Using previously published GWAS data for type 2 diabetes (T2D) and the biologically based pair-wise interaction modeling, we identify twelve genes not seen in the previous single locus analysis.
We present four approaches to detect interactions associated with complex diseases. The results show our approaches outperform the traditional single locus approaches in detecting genes that previously did not reach significance; the results also provide novel drug targets and biomarkers relevant to the underlying mechanisms of disease.
Although obstructive sleep apnea (OSA) is known to have a strong familial basis, no genetic polymorphisms influencing apnea risk have been identified in cross-cohort analyses. We utilized the National Heart, Lung, and Blood Institute (NHLBI) Candidate Gene Association Resource (CARe) to identify sleep apnea susceptibility loci. Using a panel of 46,449 polymorphisms from roughly 2,100 candidate genes on a customized Illumina iSelect chip, we tested for association with the apnea hypopnea index (AHI) as well as moderate to severe OSA (AHI≥15) in 3,551 participants of the Cleveland Family Study and two cohorts participating in the Sleep Heart Health Study.
Among 647 African-Americans, rs11126184 in the pleckstrin (PLEK) gene was associated with OSA while rs7030789 in the lysophosphatidic acid receptor 1 (LPAR1) gene was associated with AHI using a chip-wide significance threshold of p-value<2×10−6. Among 2,904 individuals of European ancestry, rs1409986 in the prostaglandin E2 receptor (PTGER3) gene was significantly associated with OSA. Consistency of effects between rs7030789 and rs1409986 in LPAR1 and PTGER3 and apnea phenotypes were observed in independent clinic-based cohorts.
Novel genetic loci for apnea phenotypes were identified through the use of customized gene chips and meta-analyses of cohort data with replication in clinic-based samples. The identified SNPs all lie in genes associated with inflammation suggesting inflammation may play a role in OSA pathogenesis.
Genome-wide genotyping of a cohort using pools rather than individual samples has long been proposed as a cost-saving alternative for performing genome-wide association (GWA) studies. However, successful disease gene mapping using pooled genotyping has thus far been limited to detecting common variants with large effect sizes, which tend not to exist for many complex common diseases or traits. Therefore, for DNA pooling to be a viable strategy for conducting GWA studies, it is important to determine whether commonly used genome-wide SNP array platforms such as the Affymetrix 6.0 array can reliably detect common variants of small effect sizes using pooled DNA. Taking obesity and age at menarche as examples of human complex traits, we assessed the feasibility of genome-wide genotyping of pooled DNA as a single-stage design for phenotype association. By individually genotyping the top associations identified by pooling, we obtained a 14- to 16-fold enrichment of SNPs nominally associated with the phenotype, but we likely missed the top true associations. In addition, we assessed whether genotyping pooled DNA can serve as an inexpensive screen as the second stage of a multi-stage design with a large number of samples by comparing the most cost-effective 3-stage designs with 80% power to detect common variants with genotypic relative risk of 1.1, with and without pooling. Given the current state of the specific technology we employed and the associated genotyping costs, we showed through simulation that a design involving pooling would be 1.07 times more expensive than a design without pooling. Thus, while a significant amount of information exists within the data from pooled DNA, our analysis does not support genotyping pooled DNA as a means to efficiently identify common variants contributing small effects to phenotypes of interest. While our conclusions were based on the specific technology and study design we employed, the approach presented here will be useful for evaluating the utility of other or future genome-wide genotyping platforms in pooled DNA studies.
Many complex diseases are influenced by genetic variations in multiple genes, each with only a small marginal effect on disease susceptibility. Pathway analysis, which identifies biological pathways associated with disease outcome, has become increasingly popular for genome-wide association studies (GWAS). In addition to combining weak signals from a number of SNPs in the same pathway, results from pathway analysis also shed light on the biological processes underlying disease. We propose a new pathway-based analysis method for GWAS, the supervised principal component analysis (SPCA) model. In the proposed SPCA model, a selected subset of SNPs most associated with disease outcome is used to estimate the latent variable for a pathway. The estimated latent variable for each pathway is an optimal linear combination of a selected subset of SNPs; therefore, the proposed SPCA model provides the ability to borrow strength across the SNPs in a pathway. In addition to identifying pathways associated with disease outcome, SPCA also carries out additional within-category selection to identify the most important SNPs within each gene set. The proposed model operates in a well-established statistical framework and can handle design information such as covariate adjustment and matching information in GWAS. We compare the proposed method with currently available methods using data with realistic linkage disequilibrium structures and we illustrate the SPCA method using the Wellcome Trust Case-Control Consortium Crohn Disease (CD) dataset.
SNPs; genome-wide association; pathway analysis; principal component analysis
Hyper-phosphorylation at the Y705 residue of signal transducer and activator of transcription 3 (STAT3) is implicated in tumorigenesis of leukemia and some solid tumors. However, its role in the development of colorectal cancer (CRC) is not well defined. To rigorously test the impact of this phosphorylation on colorectal tumorigenesis, we engineered a STAT3 Y705F knock-in to interrupt STAT3 activity in HCT116 and RKO CRC cells. These STAT3 Y705F mutant cells fail to respond to cytokine stimulation and grow slower than parental cells. These mutant cells are also greatly diminished in their abilities to form colonies in culture, to exhibit anchorage-independent growth in soft agar, and to grow as xenografts in nude mice. These observations strongly support the premise that STAT3 Y705 phosphorylation is crucial in colorectal tumorigenesis. Although it is generally believed that STAT3 functions as a transcription factor, recent studies indicate that transcription-independent functions of STAT3 also play an important role in tumorigenesis. We show here that wild-type STAT3, but not STAT3 Y705F mutant protein, associates with PLCγ1. PLCγ1 is a central signal transducer of growth factor and cytokine signaling pathways that are involved in tumorigenesis. In STAT3 Y705F mutant CRC cells, PLCγ1 activity is reduced. Moreover, over-expression of a constitutively active form of PLC γ1 rescues the transformation defect of STAT3 Y705F mutant cells. In aggregate, our study identifies previously unknown cross-talk between STAT3 and the PLCγ signaling pathways that may play a critical role in colorectal tumorigenesis.
STAT3; PLC; colorectal cancer; phosphorylation; PTPRT
It is generally known that risk variants segregate together with a disease within families but this information has not been used in the existing statistical methods for detecting rare variants. Here we introduce two weighted sum statistics that can apply to either genome-wide association data or resequencing data for identifying rare disease variants: weights calculated based on sibpairs and odd ratios, respectively. We evaluated the two methods via extensive simulations under different disease models. We compared the proposed methods with the weighted sum statistic (WSS) proposed by Madsen and Browning, keeping the same genotyping or resequencing cost. Our methods clearly demonstrate more statistical power than the WSS. In addition, we found using sibpair information can increase power over using only unrelated samples by more than 40%. We applied our methods to the Framingham Heart Study (FHS) and Wellcome Trust Case Control Consortium (WTCCC) hypertension datasets. Although we did not identify any genes as reaching a genome-wide significance level, we found variants in the candidate gene angiotensinogen (AGT) significantly associated with hypertension at P=6.9×10-4, whereas the most significant single SNP association evidence is P=0.063. We further applied the odds ratio weighted method to the IFIH1 gene for type 1 diabetes in the WTCCC data. Our method yielded a P value of 4.82×10-4, much more significant than that obtained by haplotype-based methods. We demonstrated that family data are extremely informative in searching for rare variants underlying complex traits, and the odds ratio weighted sum statistic is more efficient than currently existing methods.
Admixture mapping based on recently admixed populations is a powerful method to detect disease variants with substantial allele frequency differences in ancestral populations. We performed admixture mapping analysis for systolic blood pressure (SBP) and diastolic blood pressure (DBP), followed by trait-marker association analysis, in 6303 unrelated African-American participants of the Candidate Gene Association Resource (CARe) consortium. We identified five genomic regions (P< 0.001) harboring genetic variants contributing to inter-individual BP variation. In follow-up association analyses, correcting for all tests performed in this study, three loci were significantly associated with SBP and one significantly associated with DBP (P< 10−5). Further analyses suggested that six independent single-nucleotide polymorphisms (SNPs) contributed to the phenotypic variation observed in the admixture mapping analysis. These six SNPs were examined for replication in multiple, large, independent studies of African-Americans [Women's Health Initiative (WHI), Maywood, Genetic Epidemiology Network of Arteriopathy (GENOA) and Howard University Family Study (HUFS)] as well as one native African sample (Nigerian study), with a total replication sample size of 11 882. Meta-analysis of the replication set identified a novel variant (rs7726475) on chromosome 5 between the SUB1 and NPR3 genes, as being associated with SBP and DBP (P< 0.0015 for both); in meta-analyses combining the CARe samples with the replication data, we observed P-values of 4.45 × 10−7 for SBP and 7.52 × 10−7 for DBP for rs7726475 that were significant after accounting for all the tests performed. Our study highlights that admixture mapping analysis can help identify genetic variants missed by genome-wide association studies because of drastically reduced number of tests in the whole genome.
The structure of 3-methyladenine DNA glycosylase I in complex with 3-methyladenine is reported.
The removal of chemically damaged DNA bases such as 3-methyladenine (3-MeA) is an essential process in all living organisms and is catalyzed by the enzyme 3-MeA DNA glycosylase I. A key question is how the enzyme selectively recognizes the alkylated 3-MeA over the much more abundant adenine. The crystal structures of native and Y16F-mutant 3-MeA DNA glycosylase I from Staphylococcus aureus in complex with 3-MeA are reported to 1.8 and 2.2 Å resolution, respectively. Isothermal titration calorimetry shows that protonation of 3-MeA decreases its binding affinity, confirming previous fluorescence studies that show that charge–charge recognition is not critical for the selection of 3-MeA over adenine. It is hypothesized that the hydrogen-bonding pattern of Glu38 and Tyr16 of 3-MeA DNA glycosylase I with a particular tautomer unique to 3-MeA contributes to recognition and selection.
3-methyladenine DNA glycosylase I; fluorescence measurements; ITC; DNA repair; recognition
Recently a fluorination enzyme was identified and isolated from Streptomyces cattleya, as the first committed step on the metabolic pathway to the fluorinated metabolites, fluoroacetate and 4-fluorothreonine. This enzyme, 5′-fluoro-5′-deoxy adenosine synthetase (FDAS), has been shown to catalyze C-F bond formation by nucleophilic attack of fluoride ion to S-adenosyl-L-methionine (SAM) with the concomitant displacement of L-methionine to generate 5′-fluoro-5′-deoxy adenosine (5′-FDA). Although the structures of FDAS bound to both SAM and products have been solved, the molecular mechanism remained to be elucidated. We now report site directed mutagenesis studies, structural analyses and isothermal calorimetry (ITC) experiments. The data establish the key residues required for catalysis and the order of substrate binding. Fluoride ion is not readily distinguished from water by protein X-ray crystallography, however using chloride ion (also a substrate) with mutants of low activity has enabled the halide ion to be located in non-productive co-complexes with SAH and SAM. The kinetic data suggest the positively charged sulfur of SAM is a key requirement in stabilizing the transition state. We propose a molecular mechanism for FDAS in which fluoride weakly associates with the enzyme exchanging two water molecules for protein ligation. The binding of SAM expels remaining water associated with fluoride ion and traps the ion in a pocket positioned to react with SAM, generating L-methionine and 5′-FDA. L-SAM then dissociates from the enzyme followed by 5′-FDA.
Inadequate liver regeneration (LR) is still an unsolved problem in major liver resection and small-for-size syndrome post-living donor liver transplantation. A number of microRNAs have been shown to play important roles in cell proliferation. Herein, we investigated the role of miR-26a as a pivotal regulator of hepatocyte proliferation in LR.
Adult male C57BL/6J mice, undergoing 70% partial hepatectomy (PH), were treated with Ad5-anti-miR-26a-LUC or Ad5-miR-26a-LUC or Ad5-LUC vector via portal vein. The animals were subjected to in vivo bioluminescence imaging. Serum and liver samples were collected to test liver function, calculate liver-to-body weight ratio (LBWR), document hepatocyte proliferation (Ki-67 staining), and investigate potential targeted gene expression of miR-26a by quantitative real-time PCR and Western blot. The miR-26a level declined during LR after 70% PH. Down-regulation of miR-26a by anti-miR-26a expression led to enhanced proliferation of hepatocytes, and both LBWR and hepatocyte proliferation (Ki-67+ cells %) showed an increased tendency, while liver damage, indicated by aspartate aminotransferase (AST), alanine aminotransferase (ALT) and total bilirubin (T-Bil), was reduced. Furthermore, CCND2 and CCNE2, as possible targeted genes of miR-26a, were up-regulated. In addition, miR-26a over-expression showed converse results.
MiR-26a plays crucial role in regulating the proliferative phase of LR, probably by repressing expressions of cell cycle proteins CCND2 and CCNE2. The current study reveals a novel miRNA-mediated regulation pattern during the proliferative phase of LR.
Motivation: Admixed populations offer a unique opportunity for mapping diseases that have large disease allele frequency differences between ancestral populations. However, association analysis in such populations is challenging because population stratification may lead to association with loci unlinked to the disease locus.
Methods and results: We show that local ancestry at a test single nucleotide polymorphism (SNP) may confound with the association signal and ignoring it can lead to spurious association. We demonstrate theoretically that adjustment for local ancestry at the test SNP is sufficient to remove the spurious association regardless of the mechanism of population stratification, whether due to local or global ancestry differences among study subjects; however, global ancestry adjustment procedures may not be effective. We further develop two novel association tests that adjust for local ancestry. Our first test is based on a conditional likelihood framework which models the distribution of the test SNP given disease status and flanking marker genotypes. A key advantage of this test lies in its ability to incorporate different directions of association in the ancestral populations. Our second test, which is computationally simpler, is based on logistic regression, with adjustment for local ancestry proportion. We conducted extensive simulations and found that the Type I error rates of our tests are under control; however, the global adjustment procedures yielded inflated Type I error rates when stratification is due to local ancestry difference.
Contact: email@example.com; firstname.lastname@example.org.
Supplementary information: Supplementary data are available at Bioinformatics online.
Although recent studies have attempted to dispel the confusion that exists in regard to the definition, analysis and interpretation of interaction in genetics, there still remain aspects that are poorly understood by non-statisticians. After a brief discussion of the definition of gene-gene interaction, the main part of this study addresses the fundamental meaning of statistical interaction and its relationship to measurement scale, disproportionate sample sizes in the cells of a two-way table and gametic phase disequilibrium.
Epistasis; Gametic phase disequilibrium; Interaction; Transformation
Next-generation sequencing technology provides new opportunities and challenges in the search for genetic variants that underlie complex traits. It will also presumably uncover many new rare variants, but exactly how these variants should be incorporated into the data analysis remains a question. Several papers in our group from Genetic Analysis Workshop 17 evaluated different methods of rare variant analysis, including single-variant, gene-based, and pathway-based analyses and analyses that incorporated biological information. Although the performance of some of these methods strongly depends on the underlying disease model, integration of known biological information is helpful in detecting causal genes. Two work groups demonstrated that use of a Bayesian network and a collapsing receiver operating characteristic curve approach improves risk prediction when a disease is caused by many rare variants. Another work group suggested that modeling local rather than global ancestry may be beneficial when controlling the effect of population structure in rare variant association analysis.
rare variant; association analysis; risk prediction model; population structure; biological information; receiver operating characteristic; Bayesian network
Motivation: Adjustment for population structure is necessary to avoid bias in genetic association studies of susceptibility variants for complex diseases. Population structure may differ from one genomic region to another due to the variability of individual ancestry associated with migration, random genetic drift or natural selection. Current association methods for correcting population stratification usually involve adjustment of global ancestry between study subjects.
Results: We suggest interrogating local population structure for fine mapping to more accurately locate true casual genes by better adjusting the confounding effect due to local ancestry. By extensive simulations on genome-wide datasets, we show that adjusting global ancestry may lead to false positives when local population structure is an important confounding factor. In contrast, adjusting local ancestry can effectively prevent false positives due to local population structure and thus can improve fine mapping for disease gene localization. We applied the local and global adjustments to the analysis of datasets from three genome-wide association studies, including European Americans, African Americans and Nigerians. Both European Americans and African Americans demonstrate greater variability in local ancestry than Nigerians. Adjusting local ancestry successfully eliminated the known spurious association between SNPs in the LCT gene and height due to the population structure existed in European Americans.
Supplementary information: Supplementary data are available at Bioinformatics online.
Because obstructive sleep apnea (OSA) is associated with increased levels of inflammatory cytokines, we examined the relationship between OSA and polymorphisms for interleukin-6 (IL-6).
6 single nucleotide polymorphisms (SNPs) within IL-6 were genotyped in 259 African-Americans from the Cleveland Family Study with replication conducted in the Cardiovascular Health Study (n=124). OSA was dichotomized into apnea hypopnea index (AHI)>15 or on treatment vs. absent: AHI<5. Logistic regression was conducted, adjusting for age and sex in models with and without body mass index (BMI).
SNP IL6-6021 was associated with a decreased risk of OSA after adjusting for BMI (Odds Ratio for T allele 0.24; 95%CI [0.09–0.67]; p=0.006; q=0.07) under an additive model. This same allele was associated with increased BMI. The results from the replication sample were consistent in direction though not statistically significant (p=0.23). The SNPs were studied in European-Americans, although the minor allele frequency in IL6-6021 was too low (4%) for meaningful comparisons.
A synonymous SNP within the IL-6 coding region was protective of OSA in African-Americans; with qualitatively similar findings observed in another cohort. This suggests that variants in IL-6 may influence the risk of OSA in a pathway that is not explained by obesity.
Genetic Analysis Workshop 17 (GAW17) provided a platform for evaluating existing statistical genetic methods and for developing novel methods to analyze rare variants that modulate complex traits. In this article, we present an overview of the 1000 Genomes Project exome data and simulated phenotype data that were distributed to GAW17 participants for analyses, the different issues addressed by the participants, and the process of preparation of manuscripts resulting from the discussions during the workshop.
The Genetic Analysis Workshop 17 data we used comprise 697 unrelated individuals genotyped at 24,487 single-nucleotide polymorphisms (SNPs) from a mini-exome scan, using real sequence data for 3,205 genes annotated by the 1000 Genomes Project and simulated phenotypes. We studied 200 sets of simulated phenotypes of trait Q2. An important feature of this data set is that most SNPs are rare, with 87% of the SNPs having a minor allele frequency less than 0.05. For rare SNP detection, in this study we performed a least absolute shrinkage and selection operator (LASSO) regression and F tests at the gene level and calculated the generalized degrees of freedom to avoid any selection bias. For comparison, we also carried out linear regression and the collapsing method, which sums the rare SNPs, modified for a quantitative trait and with two different allele frequency thresholds. The aim of this paper is to evaluate these four approaches in this mini-exome data and compare their performance in terms of power and false positive rates. In most situations the LASSO approach is more powerful than linear regression and collapsing methods. We also note the difficulty in determining the optimal threshold for the collapsing method and the significant role that linkage disequilibrium plays in detecting rare causal SNPs. If a rare causal SNP is in strong linkage disequilibrium with a common marker in the same gene, power will be much improved.
We found from our analysis of the Genetic Analysis Workshop 17 data that the population structure of the 697 unrelated individuals was an important confounding factor for association studies, even if it was not explicitly considered when simulating the phenotypes. We uncovered structures beyond the reported ethnicities and found ample evidence of phenotype–population structure associations. The first 10 principal components of the genotype data of the 697 individuals demonstrated much stronger associations with Q1, Q2, and the disease than did the individuals’ ethnicities. In addition, we observed that population structure was a confounding factor for the Q1-gene association when identifying the significant genes both with and without adjusting for the causal single-nucleotide polymorphisms, the ethnicities, and the principal components. Many false discoveries remained after adjusting for the causal single-nucleotide polymorphisms. Adjusting for the principal components appeared more effective than did adjusting for ethnicity in terms of preventing false discoveries. This analysis was performed with knowledge of the causal loci.
Gene-based and single-nucleotide polymorphism (SNP) set association studies provide an important complement to SNP analysis. Kernel-based nonparametric regression has recently emerged as a powerful and flexible tool for this purpose. Our goal is to explore whether this approach can be extended to incorporate and test for interaction effects, especially for genes containing rare variant SNPs. Here, we construct nonparametric regression models that can be used to include a gene-environment interaction effect under the framework of the least-squares kernel machine and examine the performance of the proposed method on the Genetic Analysis Workshop 17 unrelated individuals data set. Two hundred simulated replicates were used to explore the power for detecting interaction. We demonstrate through a genome scan of the quantitative phenotype Q1 that the simulated gene-environment interaction effect in the data can be detected with reasonable power by using the least-squares kernel machine method.
For the family data from Genetic Analysis Workshop 17, we obtained heritability estimates of quantitative traits Q1 and Q4 using the ASSOC program in the S.A.G.E. software package. ASSOC is a family-based method that estimates heritability through the estimation of variance components. The covariate-adjusted mean heritability was 0.650 for Q1 and 0.745 for Q4. For the unrelated individuals data, we estimated the heritability of Q1 as the proportion of total variance that can be accounted for by all single-nucleotide polymorphisms under an additive model. We examined a novel ordinary least-squares method, a naïve restricted maximum-likelihood method, and a calibrated restricted maximum-likelihood method. We applied the different methods to all 200 replicates for Q1. We observed that the ordinary least-squares method yielded many estimates outside the interval [0, 1]. The restricted maximum-likelihood estimates were more stable than the ordinary least-squares estimates. The naïve restricted maximum-likelihood method yielded an average estimate of 0.462 ± 0.1, and the calibrated restricted maximum-likelihood method yielded an average of 0.535 ± 0.121. Our results demonstrate discrepancies in heritability estimates using the family data and the unrelated individuals data.
Next-generation sequencing allows for a new focus on rare variant density for conducting analyses of association to disease and for narrowing down the genomic regions that show evidence of functionality. In this study we use the 1000 Genomes Project pilot data as distributed by Genetic Analysis Workshop 17 to compare rare variant densities across seven populations. We made the comparisons using regressions of rare variants on total variant counts per gene for each population and Tajima’s D values calculated for each gene in each population, using data on 3,205 genes. We found that the populations clustered by continent for both the regression slopes and Tajima’s D values, with the African populations (Yoruba and Luhya) showing the highest density of rare variants, followed by the Asian populations (Han and Denver Chinese followed by the Japanese) and the European populations (CEPH [European-descent] and Tuscan) with the lowest densities. These significant differences in rare variant densities across populations seem to translate to measures of the rare variant density more commonly used in rare variant association analyses, suggesting the need to adjust for ancestry in such analyses. The selection signal was high for AHNAK, HLA-A, RANBP2, and RGPD4, among others. RANBP2 and RGPD4 showed a marked difference in rare variant density and potential selection between the Luhya and the other populations. This may suggest that differences between populations should be considered when delimiting genomic regions according to functionality and that these differences can create potential for disease heterogeneity.