In family-based association analysis, each family is typically ascertained from a single proband, which renders the effects of ascertainment bias heterogeneous among family members. This is contrary to case–control studies, and may introduce sample or ascertainment bias. Statistical efficiency is affected by ascertainment bias, and careful adjustment can lead to substantial improvements in statistical power. However, genetic association analysis has often been conducted using family-based designs, without addressing the fact that each proband in a family has had a great influence on the probability for each family member to be affected.
We propose a powerful and efficient statistic for genetic association analysis that considered the heterogeneity of ascertainment bias among family members, under the assumption that both prevalence and heritability of disease are available. With extensive simulation studies, we showed that the proposed method performed better than the existing methods, particularly for diseases with large heritability.
We applied the proposed method to the genome-wide association analysis of Alzheimer’s disease. Four significant associations with the proposed method were found.
Our significant findings illustrated the practical importance of this new analysis method.
Electronic supplementary material
The online version of this article (doi:10.1186/s12881-015-0198-6) contains supplementary material, which is available to authorized users.
Family-based association analysis; Ascertainment; Liability model
Owing to recent improvement of genotyping technology, large-scale genetic data can be utilized to identify disease susceptibility loci and this successful finding has substantially improved our understanding of complex diseases. However, in spite of these successes, most of the genetic effects for many complex diseases were found to be very small, which have been a big hurdle to build disease prediction model. Recently, many statistical methods based on penalized regressions have been proposed to tackle the so-called “large P and small N” problem. Penalized regressions including least absolute selection and shrinkage operator (LASSO) and ridge regression limit the space of parameters, and this constraint enables the estimation of effects for very large number of SNPs. Various extensions have been suggested, and, in this report, we compare their accuracy by applying them to several complex diseases. Our results show that penalized regressions are usually robust and provide better accuracy than the existing methods for at least diseases under consideration.
For a family-based sample, the phenotypic variance-covariance matrix can be parameterized to include the variance of a polygenic effect that has then been estimated using a variance component analysis. However, with the advent of large-scale genomic data, the genetic relationship matrix (GRM) can be estimated and can be utilized to parameterize the variance of a polygenic effect for population-based samples. Therefore narrow sense heritability, which is both population and trait specific, can be estimated with both population- and family-based samples. In this study we estimate heritability from both family-based and population-based samples, collected in Korea, and the heritability estimates from the pooled samples were, for height, 0.60; body mass index (BMI), 0.32; log-transformed triglycerides (log TG), 0.24; total cholesterol (TCHL), 0.30; high-density lipoprotein (HDL), 0.38; low-density lipoprotein (LDL), 0.29; systolic blood pressure (SBP), 0.23; and diastolic blood pressure (DBP), 0.24. Furthermore, we found differences in how heritability is estimated—in particular the amount of variance attributable to common environment in twins can be substantial—which indicates heritability estimates should be interpreted with caution.
Many disease phenotypes are outcomes of the complicated interplay between multiple genes, and multiple phenotypes are affected by a single or multiple genotypes. Therefore, joint analysis of multiple phenotypes and multiple markers has been considered as an efficient strategy for genome-wide association analysis, and in this work we propose an omnibus family-based association test for the joint analysis of multiple genotypes and multiple phenotypes.
The proposed test can be applied for both quantitative and dichotomous phenotypes, and it is robust under the presence of population substructure, as long as large-scale genomic data is available. Using simulated data, we showed that our method is statistically more efficient than the existing methods, and the practical relevance is illustrated by application of the approach to obesity-related phenotypes.
The proposed method may be more statistically efficient than the existing methods. The application was developed in C++ and is available at the following URL: http://healthstat.snu.ac.kr/software/mfqls/.
Family-based association analysis; Multiple variants; Multiple phenotypes
Longitudinal data enables detecting the effect of aging/time, and as a repeated measures design is statistically more efficient compared to cross-sectional data if the correlations between repeated measurements are not large. In particular, when genotyping cost is more expensive than phenotyping cost, the collection of longitudinal data can be an efficient strategy for genetic association analysis. However, in spite of these advantages, genome-wide association studies (GWAS) with longitudinal data have rarely been analyzed taking this into account. In this report, we calculate the required sample size to achieve 80% power at the genome-wide significance level for both longitudinal and cross-sectional data, and compare their statistical efficiency. Furthermore, we analyzed the GWAS of eight phenotypes with three observations on each individual in the Korean Association Resource (KARE). A linear mixed model allowing for the correlations between observations for each individual was applied to analyze the longitudinal data, and linear regression was used to analyze the first observation on each individual as cross-sectional data. We found 12 novel genome-wide significant disease susceptibility loci that were then confirmed in the Health Examination cohort, as well as some significant interactions between age/sex and SNPs.
longitudinal data; cross-sectional data; Korean Association Resource (KARE) cohort; Health Examinee (HEXA) cohort
Although it has been suggested that rare coding variants could explain the substantial missing heritability, very few sequencing studies have been performed in rheumatoid arthritis (RA). We aimed to identify novel functional variants with rare to low frequency using targeted exon sequencing of RA in Korea.
We analyzed targeted exon sequencing data of 398 genes selected from a multifaceted approach in Korean RA patients (n = 1,217) and controls (n = 717). We conducted a single-marker association test and a gene-based analysis of rare variants. For meta-analysis or enrichment tests, we also used ethnically matched independent samples of Korean genome-wide association studies (GWAS) (n = 4,799) or immunochip data (n = 4,722).
After stringent quality control, we analyzed 10,588 variants of 398 genes from 1,934 Korean RA case controls. We identified 13 nonsynonymous variants with nominal association in single-variant association tests. In a meta-analysis, we did not find any novel variant with genome-wide significance for RA risk. Using a gene-based approach, we identified 17 genes with nominal burden signals. Among them, VSTM1 showed the greatest association with RA (P = 7.80 × 10−4). In the enrichment test using Korean GWAS, although the significant signal appeared to be driven by total genic variants, we found no evidence for enriched association of coding variants only with RA.
We were unable to identify rare coding variants with large effect to explain the missing heritability for RA in the current targeted resequencing study. Our study raises skepticism about exon sequencing of targeted genes for complex diseases like RA.
Electronic supplementary material
The online version of this article (doi:10.1186/s13075-014-0447-7) contains supplementary material, which is available to authorized users.
Multiple genetic factors are associated with chronic obstructive pulmonary disease (COPD). The association of gene encoding vitamin D binding protein (VDBP, GC) with COPD has been controversial. We sought to investigate the types of GC variants in the Korean population and determine the association of GC variants with COPD and lung function in the Korean population.
Materials and Methods
The study cohort consisted of 203 COPD patients and 157 control subjects. GC variants were genotyped by the restriction fragment-length polymorphism method. Repeated measures of lung function data were analyzed using a linear mixed model including sex, age, height, and pack-years of smoking to investigate the association of GC genetic factors and lung function.
GC1F variant was most frequently observed in COPD (46.1%) and controls (42.0%). GC1S variant (29.0% vs. 21.4%; p=0.020) and genotype 1S-1S (8.3% vs. 3.4%; p=0.047) were more commonly detected in control than COPD. According to linear mixed model analysis including controls and COPD, subjects with genotype 1S-1S had 0.427 L higher forced expiratory volume in 1 second (FEV1) than those with other genotypes (p=0.029). However, interaction between the genotype and smoking pack-year was found to be particularly significant among subjects with genotype 1S-1S; FEV1 decreased by 0.014 L per smoking pack-year (p=0.001).
This study suggested that GC polymorphism might be associated with lung function and risk of COPD in Korean population. GC1S variant and genotype 1S-1S were more frequently observed in control than in COPD. Moreover, GC1S variant was more common in non-decliners than in rapid decliners among COPD.
Vitamin D binding protein; polymorphism; lung function; chronic obstructive pulmonary disease
Genome-wide association studies have identified many genetic loci associated with blood pressure (BP). Genetic effects on BP can be altered by environmental exposures via multiple biological pathways. Especially, obesity is one of important environmental risk factors that can have considerable effect on BP and it may interact with genetic factors. Given that, we aimed to test whether genetic factors and obesity may jointly influence BP.
We performed meta-analyses of genome-wide association data for systolic blood pressure (SBP) and diastolic blood pressure (DBP) that included analyses of interaction between single nucleotide polymorphisms (SNPs) and the obesity-related anthropometric measures, body mass index (BMI), height, weight, and waist/hip ratio (WHR) in East-Asians (n = 12,030).
We identified that rs13390641 on 2q12.1 demonstrated significant association with SBP when the interaction between SNPs and BMI was considered (P < 5 × 10 -8). The gene located nearest to rs13390641, TMEM182, encodes transmembrane protein 182. In stratified analyses, the effect of rs13390641 on BP was much stronger in obese individuals (BMI ≥ 30) than non-obese individuals and the effect of BMI on BP was strongest in individuals with the homozygous A allele of rs13390641.
Our analyses that included interactions between SNPs and environmental factors identified a genetic variant associated with BP that was overlooked in standard analyses in which only genetic factors were included. This result also revealed a potential mechanism that integrates genetic factors and obesity related traits in the development of high BP.
Blood pressure; Genome-wide scan; Gene-environment interaction; Meta-analysis; Obesity
For the meta-analysis of genome-wide association studies, we propose a new method to adjust for the population stratification and a linear mixed approach that combines family-based and unrelated samples. The proposed approach achieves similar power levels as a standard meta-analysis which combines the different test statistics or p values across studies. However, by virtue of its design, the proposed approach is robust against population admixture and stratification, and no adjustments for population admixture and stratification, even in unrelated samples, are required. Using simulation studies, we examine the power of the proposed method and compare it to standard approaches in the meta-analysis of genome-wide association studies. The practical features of the approach are illustrated with a meta-analysis of three genome-wide association studies for Alzheimer's disease. We identify three single nucleotide polymorphisms showing significant genome-wide association with affection status. Two single nucleotide polymorphisms are novel and will be verified in other populations in our follow-up study.
Meta-analysis; Genome-wide study; Population stratification
Motivation: For the analysis of rare variants in sequence data, numerous approaches have been suggested. Fixed and flexible threshold approaches collapse the rare variant information of a genomic region into a test statistic with reduced dimensionality. Alternatively, the rare variant information can be combined in statistical frameworks that are based on suitable regression models, machine learning, etc. Although the existing approaches provide powerful tests that can incorporate information on allele frequencies and prior biological knowledge, differences in the spatial clustering of rare variants between cases and controls cannot be incorporated. Based on the assumption that deleterious variants and protective variants cluster or occur in different parts of the genomic region of interest, we propose a testing strategy for rare variants that builds on spatial cluster methodology and that guides the identification of the biological relevant segments of the region. Our approach does not require any assumption about the directions of the genetic effects.
Results: In simulation studies, we assess the power of the clustering approach and compare it with existing methodology. Our simulation results suggest that the clustering approach for rare variants is well powered, even in situations that are ideal for standard methods. The efficiency of our spatial clustering approach is not affected by the presence of rare variants that have opposite effect size directions. An application to a sequencing study for non-syndromic cleft lip with or without cleft palate (NSCL/P) demonstrates its practical relevance. The proposed testing strategy is applied to a genomic region on chromosome 15q13.3 that was implicated in NSCL/P etiology in a previous genome-wide association study, and its results are compared with standard approaches.
Availability: Source code and documentation for the implementation in R will be provided online. Currently, the R-implementation only supports genotype data. We currently are working on an extension for VCF files.
Cachexia, whether assessed by body mass index (BMI) or fat-free mass index (FFMI), affects a significant proportion of patients with chronic obstructive pulmonary disease (COPD), and is an independent risk factor for increased mortality, increased emphysema, and more severe airflow obstruction. The variable development of cachexia among patients with COPD suggests a role for genetic susceptibility. The objective of the present study was to determine genetic susceptibility loci involved in the development of low BMI and FFMI in subjects with COPD. A genome-wide association study (GWAS) of BMI was conducted in three independent cohorts of European descent with Global Initiative for Chronic Obstructive Lung Disease stage II or higher COPD: Evaluation of COPD Longitudinally to Identify Predictive Surrogate End-Points (ECLIPSE; n = 1,734); Norway-Bergen cohort (n = 851); and a subset of subjects from the National Emphysema Treatment Trial (NETT; n = 365). A genome-wide association of FFMI was conducted in two of the cohorts (ECLIPSE and Norway). In the combined analyses, a significant association was found between rs8050136, located in the first intron of the fat mass and obesity–associated (FTO) gene, and BMI (P = 4.97 × 10−7) and FFMI (P = 1.19 × 10−7). We replicated the association in a fourth, independent cohort consisting of 502 subjects with COPD from COPDGene (P = 6 × 10−3). Within the largest contributing cohort of our analysis, lung function, as assessed by forced expiratory volume at 1 second, varied significantly by FTO genotype. Our analysis suggests a potential role for the FTO locus in the determination of anthropomorphic measures associated with COPD.
chronic obstructive pulmonary disease genetics; chronic obstructive pulmonary disease epidemiology; chronic obstructive pulmonary disease metabolism; genome-wide association study
We propose a new approach for the analysis of copy number variants (CNVs)for genome-wide association studies in family-based designs. Our new overall association test combines the between-family component and the within-family component of the data so that the new test statistic is fully efficient and, at the same time, achieves the complete robustness against population-admixture and stratification, as classical family-based association tests that are based only on the between-family component. Although all data are incorporated into the test statistic, an adjustment for genetic confounding is not needed, not even for the between-family component. The new test statistic is valid for testing either quantitative or dichotomous phenotypes. If external CNV data are available, the approach can also be used in completely ascertained samples. Similar to the approach by Ionita-Laza et al.(1), the proposed test statistic does not required a CNV-calling algorithm and is based directly on the CNV probe intensity data. We show, via simulation studies, that our methodology increases the power of the FBAT statistic to levels comparable to those of population-based designs. The advantages of the approach in practice are demonstrated by an application to a genome-wide association study for body mass index (BMI).
For genome-wide association studies in family-based designs, a new, universally applicable approach is proposed. Using a modified Liptak’s method, we combine the p-value of the family-based association test (FBAT) statistic with the p-value for the Van Steen-statistic. The Van Steen-statistic is independent of the FBAT-statistic and utilizes information that is ignored by traditional FBAT-approaches. The new test statistic takes advantages of all available information about the genetic association, while, by virtue of its design, it achieves complete robustness against confounding due to population stratification. The approach is suitable for the analysis of almost any trait type for which FBATs are available, e.g. binary, continuous, time to-onset, multivariate, etc. The efficiency and the validity of the new approach depend on the specification of a nuisance/tuning parameter and the weight parameters in the modified Liptak’s method. For different trait types and ascertainment conditions, we discuss general guidelines for the optimal specification of the tuning parameter and the weight parameters. Our simulation experiments and an application to an Alzheimer study show the validity and the efficiency of the new method, which achieves power levels that are comparable to those of population-based approaches.
FBAT; Liptak’s method; Tuning parameter
Next-generation sequencing has great potential for application in bacterial transcriptomics. However, unlike eukaryotes, bacteria have no clear mechanism to select mRNAs over rRNAs; therefore, rRNA removal is a critical step in sequencing-based transcriptomics. Duplex-specific nuclease (DSN) is an enzyme that, at high temperatures, degrades duplex DNA in preference to single-stranded DNA. DSN treatment has been successfully used to normalize the relative transcript abundance in mRNA-enriched cDNA libraries from eukaryotic organisms. In this study, we demonstrate the utility of this method to remove rRNA from prokaryotic total RNA. We evaluated the efficacy of DSN to remove rRNA by comparing it with the conventional subtractive hybridization (Hyb) method. Illumina deep sequencing was performed to obtain transcriptomes from Escherichia coli grown under four growth conditions. The results clearly showed that our DSN treatment was more efficient at removing rRNA than the Hyb method was, while preserving the original relative abundance of mRNA species in bacterial cells. Therefore, we propose that, for bacterial mRNA-seq experiments, DSN treatment should be preferred to Hyb-based methods.
Current ongoing genome-wide association studies represent a powerful approach to uncover common unknown genetic variants causing common complex diseases. The discovery of these genetic variants offers an important opportunity for early disease prediction, prevention and individualized treatment. We describe here a method of combining multiple genetic variants for early disease prediction, based on the optimality theory of the likelihood ratio. Such theory simply shows that the receiver operating characteristic (ROC) curve based on the likelihood ratio (LR) has maximum performance at each cutoff point and that the area under the ROC curve (AUC) so obtained is highest among that of all approaches. Through simulations and a real data application, we compared it with the commonly used logistic regression and classification tree approaches. The three approaches show similar performance if we know the underlying disease model. However, for most common diseases we have little prior knowledge of the disease model and in this situation the new method has an advantage over logistic regression and classification tree approaches. We applied the new method to the Type 1 diabetes genome-wide association data from the Wellcome Trust Case Control Consortium. Based on five single nucleotide polymorphisms (SNPs), the test reaches medium level classification accuracy. With more genetic findings to be discovered in the future, we believe a predictive genetic test for Type 1 diabetes can be successfully constructed and eventually implemented for clinical use.
Backward clustering; Classification tree; Cross validation; Logistic regression
In case-control Single Nucleotide Polymorphism (SNP) data, the Allele frequency, Hardy Weinberg Disequilibrium (HWD) and Linkage Disequilibrium (LD) contrast tests are three distinct sources of information about genetic association. While all three tests are typically developed in a retrospective context, we show that prospective logistic regression models may be developed that correspond conceptually to the retrospective tests. This approach provides a flexible framework for conducting a systematic series of association analyses using unphased genotype data and any number of covariates. For a single stage study, two single-marker tests and four two-marker tests are discussed. The true association models are derived and they allow us to understand why a model with only a linear term will generally fit well for a SNP in weak LD with a causal SNP, whatever the disease model, but not for a SNP in high LD with a non-additive disease SNP. We investigate the power of the association tests using real LD parameters from chromosome 11 in the HapMap CEU population data. Among the single-marker tests, the allelic test has on average the most power in the case of an additive disease; but, for dominant, recessive and heterozygote disadvantage diseases, the genotypic test has the most power. Among the six two-marker tests, the Allelic-LD contrast test, which incorporates linear terms for two markers and their interaction term, provides the most reliable power overall for the cases studied. Therefore, our result supports incorporating an interaction term as well as linear terms in multi-marker tests.
Allele frequency contrast test; LD contrast test; HWD contrast test; Genome-wide Association
Substantial evidence suggests that there is genetic susceptibility to chronic obstructive pulmonary disease (COPD). To identify common genetic risk variants, we performed a genome-wide association study in 2940 cases and 1380 smoking controls with normal lung function. We demonstrate a novel susceptibility locus at 4q22.1 in FAM13A (rs7671167, OR=0.76, P=8.6×10−8) and provide evidence of replication in one case-control and two family-based cohorts (for all studies, combined P=1.2×10−11).
The possible evidence for association comprises three types of information: differences between cases and controls in allele frequencies, in parameters for Hardy Weinberg disequilibrium (HWD), and in parameters for linkage disequilibrium (LD). LD between marker and disease alleles results in a difference in at least one of the three types of parameters [Won and Elston, 2008]. However, the parameters for LD require knowledge about phase, which is usually unknown, making the LD contrast test without modification infeasible in practice. Methods for handling phase uncertainty are: (1) the most probable haplotype pair for each individual can be considered as the true phase; (2) a weighted average of haplotypes can be used; (3) we can consider the composite LD, which does not require any information about phase. We compare these methods to handle phase uncertainty in terms of validity and efficiency, and the effect on them of HWD in the population, at the same time confirming results for the three types of information. When the LD between markers is high, the LD contrast test that uses a weighted average of haplotypes or the most probable haplotypes to calculate the LD is recommended, but otherwise the LD contrast test that uses the composite LD is recommended. We conclude that, even though the difference in allele frequencies is usually the most informative test except in the case of a recessive disease, the LD contrast test can be more powerful if the markers are dense enough.
linkage disequilibrium; haplotype phase; self replication
While recently performed genome-wide association studies have advanced the identification of genetic variants predisposing to type 2 diabetes (T2D), the potential application of these novel findings for disease prediction and prevention has not been well studied. Diabetes prediction and prevention have become urgent issues owing to the rapidly increasing prevalence of diabetes and its associated mortality, morbidity, and health care cost. New prediction approaches using genetic markers could facilitate early identification of high risk sub-groups of the population so that appropriate prevention methods could be effectively applied to delay, or even prevent, disease onset.
This paper assessed 18 recently identified T2D loci for their potential role in diabetes prediction. We built a new predictive genetic test for T2D using the Framingham Heart Study dataset. Using logistic regression and 15 additional loci, the new test was slightly improved over the existing test using just three loci. A formal comparison between the two tests suggests no significant improvement. We further formed a predictive genetic test for identifying early onset T2D and found higher classification accuracy for this test, not only indicating that these 18 loci have great potential for predicting early onset T2D, but also suggesting that they may play important roles in causing early-onset T2D.
To further improve the test's accuracy, we applied a newly developed nonparametric method capable of capturing high order interactions to the data, but it did not outperform a logistic regression that only considers single-locus effects. This could be explained by the absence of gene-gene interactions among the 18 loci.
For genome-wide association studies in family-based designs, we propose a new, universally applicable approach. The new test statistic exploits all available information about the association, while, by virtue of its design, it maintains the same robustness against population admixture as traditional family-based approaches that are based exclusively on the within-family information. The approach is suitable for the analysis of almost any trait type, e.g. binary, continuous, time-to-onset, multivariate, etc., and combinations of those. We use simulation studies to verify all theoretically derived properties of the approach, estimate its power, and compare it with other standard approaches. We illustrate the practical implications of the new analysis method by an application to a lung-function phenotype, forced expiratory volume in one second (FEV1) in 4 genome-wide association studies.
In genome-wide association studies, the multiple testing problem and confounding due to population stratification have been intractable issues. Family-based designs have considered only the transmission of genotypes from founder to nonfounder to prevent sensitivity to the population stratification, which leads to the loss of information. Here we propose a novel analysis approach that combines mutually independent FBAT and screening statistics in a robust way. The proposed method is more powerful than any other, while it preserves the complete robustness of family-based association tests, which only achieves much smaller power level. Furthermore, the proposed method is virtually as powerful as population-based approaches/designs, even in the absence of population stratification. By nature of the proposed method, it is always robust as long as FBAT is valid, and the proposed method achieves the optimal efficiency if our linear model for screening test reasonably explains the observed data in terms of covariance structure and population admixture. We illustrate the practical relevance of the approach by an application in 4 genome-wide association studies.
Fisher  was the first to suggest a method of combining the p-values obtained from several statistics and many other methods have been proposed since then. However, there is no agreement about what is the best method. Motivated by a situation that now often arises in genetic epidemiology, we consider the problem when it is possible to define a simple alternative hypothesis of interest for which the expected effect size of each test statistic is known and we determine the most powerful test for this simple alternative hypothesis. Based on the proposed method, we show that information about the effect sizes can be used to obtain the best weights for Liptak’s method of combining p-values. We present extensive simulation results comparing methods of combining p-values and illustrate for a real example in genetic epidemiology how information about effect sizes can be deduced.
Fisher; Liptak; effect size
Tuberculosis (TB), caused by Mycobacterium tuberculosis (Mtb), is an enduring public health problem globally, particularly in sub-Saharan Africa. Several studies have suggested a role for host genetic susceptibility in increased risk for TB but results across studies have been equivocal. As part of a household contact study of Mtb infection and disease in Kampala, Uganda, we have taken a unique approach to the study of genetic susceptibility to TB, by studying three phenotypes. First, we analyzed culture confirmed TB disease compared to latent Mtb infection (LTBI) or lack of Mtb infection. Second, we analyzed resistance to Mtb infection in the face of continuous exposure, defined by a persistently negative tuberculin skin test (PTST-); this outcome was contrasted to LTBI. Third, we analyzed an intermediate phenotype, tumor necrosis factor-alpha (TNFα) expression in response to soluble Mtb ligands enriched with molecules secreted from Mtb (culture filtrate). We conducted a full microsatellite genome scan, using genotypes generated by the Center for Medical Genetics at Marshfield. Multipoint model-free linkage analysis was conducted using an extension of the Haseman-Elston regression model that includes half sibling pairs, and HIV status was included as a covariate in the model. The analysis included 803 individuals from 193 pedigrees, comprising 258 full sibling pairs and 175 half sibling pairs. Suggestive linkage (p<10−3) was observed on chromosomes 2q21-2q24 and 5p13-5q22 for PTST-, and on chromosome 7p22-7p21 for TB; these findings for PTST- are novel and the chromosome 7 region contains the IL6 gene. In addition, we replicated recent linkage findings on chromosome 20q13 for TB (p = 0.002). We also observed linkage at the nominal α = 0.05 threshold to a number of promising candidate genes, SLC11A1 (PTST- p = 0.02), IL-1 complex (TB p = 0.01), IL12BR2 (TNFα p = 0.006), IL12A (TB p = 0.02) and IFNGR2 (TNFα p = 0.002). These results confirm not only that genetic factors influence the interaction between humans and Mtb but more importantly that they differ according to the outcome of that interaction: exposure but no infection, infection without progression to disease, or progression of infection to disease. Many of the genetic factors for each of these stages are part of the innate immune system.
Among the various linkage-disequilibrium (LD) fine-mapping methods, two broad classes have received considerable development recently: those based on coalescent theory and those based on haplotype clustering. Using Genetic Analysis Workshop 15 Problem 3 simulated data, the ability of these two classes to localize the causal variation were compared. Our results suggest that a haplotype-clustering-based approach performs favorably, while at the same time requires much less computing than coalescent-based approaches. Further, we observe that 1) when marker density is low, a set of equally spaced single-nucleotide polymorphisms (SNPs) provides better localization than a set of tagging SNPs of equal number; 2) denser sets of SNPs generally lead to better localization, but the benefit diminishes beyond a certain density; 3) larger sample size may do more harm than good when poor selection of markers results in biased LD patterns around the disease locus. These results are explained by how well the set of selected markers jointly approximates the expected LD pattern around a disease locus.
Clustering of related haplotypes in haplotype-based association mapping has the potential to improve power by reducing the degrees of freedom without sacrificing important information about the underlying genetic structure. We have modified a generalized linear model approach for association analysis by incorporating a density-based clustering algorithm to reduce the number of coefficients in the model. Using the GAW 15 Problem 3 simulated data, we show that our novel method can substantially enhance power to detect association with the binary rheumatoid arthritis (RA) phenotype at the HLA-DRB1 locus on chromosome 6. In contrast, clustering did not appreciably improve performance at locus D, perhaps a consequence of a rare susceptibility allele and of the overwhelming effect of HLA-DRB1/locus C, 5 cM distal. Optimization of parameters governing the clustering algorithm identified a set of parameters that delivered nearly ideal performance in a variety of situations. The cluster-based score test was valid over a wide range of haplotype diversity, and was robust to severe departures from Hardy-Weinberg equilibrium encountered near HLA-DRB1 in RA case-control samples.