We propose an omnibus family-based association test (MFBAT), that can be applied to multiple markers and multiple phenotypes and that has only 1 degree of freedom. The proposed test statistic extends current FBAT methodology to incorporate multiple markers as well as multiple phenotypes. Using simulation studies, power estimates for the proposed methodology are compared with the standard methodologies. Based on these simulations, we find that MFBAT substantially outperforms other methods including some haplotypic approaches and doing multiple tests with single SNPs and single phenotypes. The practical relevance of the approach is illustrated by an application to asthma where SNPs/phenotype combinations are identified and reach overall significance that would not have been identified using other approaches. This methodology is directly applicable to cases where there are multiple SNPs, such as candidate gene studies, cases where there are multiple phenotypes, such as expression data, and cases where there are multiple phenotypes and genotypes, such as genome-wide association studies that incorporate expression profiles as phenotypes. This program is available in the PBAT analysis package1.
Family-based association testing (FBAT); genome-wide association studies; FBAT-PC; multiple marker; multiple phenotypes; multiple testing
For a dense set of genetic markers such as single nucleotide polymorphisms (SNPs) on high linkage disequilibrium within a small candidate region, a haplotype-based approach for testing association between a disease phenotype and the set of markers is attractive in reducing the data complexity and increasing the statistical power. However, due to unknown status of the underlying disease variant, a comprehensive association test may require consideration of various combinations of the SNPs, which often leads to severe multiple testing problems. In this paper, we propose a latent variable approach to test for association of multiple tightly linked SNPs in case-control studies. First, we introduce a latent variable into the penetrance model to characterize a putative disease susceptible locus (DSL) that may consist of a marker allele, a haplotype from a subset of the markers, or an allele at a putative locus between the markers. Next, through using of a retrospective likelihood to adjust for the case-control sampling ascertainment and appropriately handle the Hardy-Weinberg equilibrium constraint, we develop an expectation-maximization (EM)-based algorithm to fit the penetrance model and estimate the joint haplotype frequencies of the DSL and markers simultaneously. With the latent variable to describe a flexible role of the DSL, the likelihood ratio statistic can then provide a joint association test for the set of markers without requiring an adjustment for testing of multiple haplotypes. Our simulation results also reveal that the latent variable approach may have improved power under certain scenarios comparing with classical haplotype association methods.
haplotype association; retrospective likelihood; latent variable; logistic mixture model; EM algorithm
Complex diseases are believed to be the results of many genes and environmental factors. Hence, multi-marker methods that can use the information of markers from different genes are appropriate for mapping complex disease genes. There already have been several multi-marker methods proposed for case-control studies. In this article, we propose a multi-marker test called a Multi-marker Pedigree Disequilibrium Test (MPDT) to analyze family data from genome-wide association studies. If the parental phenotypes are available, we also propose a two-stage test in which a genomic screening test is used to select SNPs, and then the MPDT is used to test the association of the selected SNPs.
We use simulation studies to evaluate the performance of the MPDT and the two-stage approach. The results show that the MPDT constantly outperforms the single marker transmission/disequilibrium test (TDT) . Comparing the power of the two-stage approach with that of the one-stage approach, which approach is more powerful depends on the value of the prevalence; when the prevalence is no less than 10%, the two-stage approach may be more powerful than the one-stage approach. Otherwise, the one-stage approach is more powerful.
The proposed MPDT, is more powerful than the single marker TDT. When the parental phenotypes are available and the prevalence is no less than 10%, the proposed two-stage approach is more powerful than the one-stage approach.
Several family-based approaches have been previously proposed to enhance the power for testing genetic association when the traits are measured longitudinally or repeatedly. In this paper, we show that some of these FBAT approaches can be easily extended to accommodate incomplete data and remain unbiased tests. We also show that because of the nature of FBAT approaches, we can impute the missing phenotypes without biasing our tests and achieve higher power. We propose two imputation techniques based on E-M algorithm and the conditional mean model, respectively. Through simulation studies, these two imputation techniques are shown to have correct false positive rate and generally achieve higher power than complete case analysis or simple mean-imputation. Application of these approaches for testing an association between Body Mass Index and a previously reported candidate SNP confirms our results.
FBAT; Longitudinal Phenotype; Missing Data
SNAP25 occurs on chromosome 20p12.2, which has been linked to schizophrenia in some samples, and recently linked to latent classes of psychotic illness in our sample. SNAP25 is crucial to synaptic functioning, may be involved in axonal growth and dendritic sprouting, and its expression may be decreased in schizophrenia. We genotyped 18 haplotype-tagging SNPs in SNAP25 in a sample of 270 Irish high-density families. Single marker and haplotype analyses were performed in FBAT and PDT. We adjusted for multiple testing by computing q values. Association was followed up in an independent sample of 657 cases and 411 controls. We tested for allelic effects on the clinical phenotype by using the method of sequential addition and 5 factor-derived scores of the OPCRIT. Nine of 18 SNPs had P values <0.05 in either FBAT or PDT for one or more definitions of illness. Several two-marker haplotypes were also associated. Subjects inheriting the risk alleles of the most significantly associated two-marker haplotype were likely to have higher levels of hallucinations and delusions. The most significantly associated marker, rs6039820, was observed to perturb 12 transcription-factor binding sites in in silico analyses. An attempt to replicate association findings in the case–control sample resulted in no SNPs being significantly associated. We observed robust association in both single marker and haplotype-based analyses between SNAP25 and schizophrenia in an Irish family sample. Although we failed to replicate this in an independent sample, this gene should be further tested in other samples.
schizophrenia; SNAP25; polymorphism; association; clinical features
Alcoholism is a complex disease. As with other common diseases, genetic variants underlying alcoholism have been illusive, possibly due to the small effect from each individual susceptible variant, gene × environment and gene × gene interactions and complications in phenotype definition. We conducted association tests, the family-based association tests (FBAT) and the backward haplotype transmission association (BHTA), on the Collaborative Study of the Genetics of Alcoholism (COGA) data provided by Genetic Analysis Workshop (GAW) 14. Efron's local false discovery rate method was applied to control the proportion of false discoveries. For FBAT, we compared the results based on different types of genetic markers (single-nucleotide polymorphisms (SNPs) versus microsatellites) and different phenotype definitions (clinical diagnoses versus electrophysiological phenotypes). Significant association results were found only between SNPs and clinical diagnoses. In contrast, significant results were found only between microsatellites and electrophysiological phenotypes. In addition, we obtained the association results for SNPs and microsatellites using COGA diagnosis as phenotype based on BHTA. In this case, the results for SNPs and microsatellites are more consistent. Compared to FBAT, more significant markers are detected with BHTA.
Genome-wide association studies have been extensively conducted, searching for markers for biologically meaningful outcomes and phenotypes. Penalization methods have been adopted in the analysis of the joint effects of a large number of SNPs (single nucleotide polymorphisms) and marker identification. This study is partly motivated by the analysis of heterogeneous stock mice dataset, in which multiple correlated phenotypes and a large number of SNPs are available. Existing penalization methods designed to analyze a single response variable cannot accommodate the correlation among multiple response variables. With multiple response variables sharing the same set of markers, joint modeling is first employed to accommodate the correlation. The group Lasso approach is adopted to select markers associated with all the outcome variables. An efficient computational algorithm is developed. Simulation study and analysis of the heterogeneous stock mice dataset show that the proposed method can outperform existing penalization methods.
Testing multiple markers simultaneously not only can capture the linkage disequilibrium patterns but also can decrease the number of tests and thus alleviate the multiple-testing penalty. If a gene is associated with a phenotype, subjects with similar genotypes in this gene should also have similar phenotypes. Based on this concept, we have developed a general framework that is applicable to continuous traits. Two similarity-based tests (namely, SIMc and SIMp tests) were derived as special cases of the general framework. In our simulation study, we compared the power of the two tests with that of the single-marker analysis, a standard haplotype regression, and a popular and powerful kernel machine regression. Our SIMc test outperforms other tests when the average r-square (a measure of linkage disequilibrium) between the causal variant and the surrounding markers is larger than 0.3 or when the causal allele is common (say, frequency = 0.3). Our SIMp test outperforms other tests when the causal variant was introduced at common haplotypes (the maximum frequency of risk haplotypes > 0.4). We also applied our two tests to an adiposity data set to show their utility.
Haplotype; Similarity; Genomic distance; Linkage disequilibrium; Multi-marker test; Body-mass index; CPE gene
Expression QTL mapping by integrating genome-wide gene expression and genotype data is a promising approach for identifying functional genetic variation, but is hampered by the large number of multiple comparisons inherent to such studies. A novel approach for addressing multiple testing problems in genome-wide family-based association studies is screening candidate markers using heritability or conditional power. We apply these methods for the setting in which microarray gene expression data are used as phenotypes, screening for SNPs near the expressed genes. We perform association analyses for phenotypes using a univariate approach. Simulations were also performed on trios with large numbers of causal SNPs to determine the optimal number of markers to use in a screen. We demonstrate that our family-based screening approach performs well in the analysis of integrative genomic datasets, and that screening using either heritability or conditional power produce similar, though not identical, results.
gene expression; association study; SNP; power; heritability; screening; multiple testing
Recently, a genomic distance-based regression for multilocus associations was proposed (Wessel and Schork  Am. J. Hum. Genet. 79:792–806) in which either locus or haplotype scoring can be used to measure genetic distance. Although it allows various measures of genomic similarity and simultaneous analyses of multiple phenotypes, its power relative to other methods for case-control analyses is not well known. We compare the power of traditional methods with this new distance-based approach, for both locus-scoring and haplotype-scoring strategies. We discuss the relative power of these association methods with respect to five properties: (1) the marker informativity; (2) the number of markers; (3) the causal allele frequency; (4) the preponderance of the most common high-risk haplotype; (5) the correlation between the causal single-nucleotide polymorphism (SNP) and its flanking markers. We found that locus-based logistic regression and the global score test for haplotypes suffered from power loss when many markers were included in the analyses, due to many degrees of freedom. In contrast, the distance-based approach was not as vulnerable to more markers or more haplotypes. A genotype counting measure was more sensitive to the marker informativity and the correlation between the causal SNP and its flanking markers. After examining the impact of the five properties on power, we found that on average, the genomic distance-based regression that uses a matching measure for diplotypes was the most powerful and robust method among the seven methods we compared.
canonical correlation; genomic distance; genotyping errors; linkage disequilibrium
The availability of a large number of dense SNPs, high-throughput genotyping and computation methods promotes the application of family-based association tests. While most of the current family-based analyses focus only on individual traits, joint analyses of correlated traits can extract more information and potentially improve the statistical power. However, current TDT-based methods are low-powered. Here, we develop a method for tests of association for bivariate quantitative traits in families. In particular, we correct for population stratification by the use of an integration of principal component analysis and TDT. A score test statistic in the variance-components model is proposed. Extensive simulation studies indicate that the proposed method not only outperforms approaches limited to individual traits when pleiotropic effect is present, but also surpasses the power of two popular bivariate association tests termed FBAT-GEE and FBAT-PC, respectively, while correcting for population stratification. When applied to the GAW16 datasets, the proposed method successfully identifies at the genome-wide level the two SNPs that present pleiotropic effects to HDL and TG traits.
In this paper, we develop a powerful test for identifying SNP-sets that are predictive of survival with data from genome-wide association studies (GWAS). We first group typed SNPs into SNP-sets based on genomic features and then apply a score test to assess the overall effect of each SNP-set on the survival outcome through a kernel machine Cox regression framework. This approach uses genetic information from all SNPs in the SNP-set simultaneously and accounts for linkage disequilibrium (LD), leading to a powerful test with reduced degrees of freedom when the typed SNPs are in LD with each other. This type of test also has the advantage of capturing the potentially non-linear effects of the SNPs, SNP-SNP interactions (epistasis), and the joint effects of multiple causal variants. By simulating SNP data based on the LD structure of real genes from the HapMap project, we demonstrate that our proposed test is more powerful than the standard single SNP minimum p-value based test for association studies with censored survival outcomes. We illustrate the proposed test with a real data application.
cox model; genetic studies; gene-based analysis; kernel machine; multi-locus test; score test; single nucleotide polymorphism
Linkage disequilibrium (LD)-based association mapping is often performed by analyzing either individual SNPs or block-based multi-SNP haplotypes. Sliding windows of several fixed sizes (in terms of SNP numbers) were also applied to a few simulated or real data sets. In comparison, exhaustively testing based on variable sized sliding windows (VSW) of all possible sizes of SNPs over a genomic region has the best chance to capture the optimum markers (single SNPs or haplotypes) that are most significantly associated with the traits under study. However, the cost is the increased number of multiple tests and computation. Here a strategy of VSW of all possible sizes is proposed and its power is examined, in comparison with those using only haplotype blocks (BLK) or single SNP loci (SGL) tests. Critical values for statistical significance testing that account for multiple testing are simulated. We demonstrated that, over a wide range of parameters simulated, VSW increased power for the detection of disease variants by ∼1-15% over the BLK and SGL approaches. The improved performance was more significant in regions with high recombination rates. In an empirical data set, VSW obtained the most significant signal and identified the LRP5 gene as strongly associated with osteoporosis. With the use of computational techniques such as parallel algorithms and clustering computing, it is feasible to apply VSW to large genomic regions or those regions preliminarily identified by traditional SGL/BLK methods.
sliding window; association mapping; statistical power
Linkage disequilibrium (LD)-based association mapping is often performed by analyzing either individual SNPs or block-based multi-SNP haplotypes. Sliding windows of several fixed sizes (in terms of SNP numbers) were also applied to a few simulated or real data sets. In comparison, exhaustively testing based on variable-sized sliding windows (VSW) of all possible sizes of SNPs over a genomic region has the best chance to capture the optimum markers (single SNPs or haplotypes) that are most significantly associated with the traits under study. However, the cost is the increased number of multiple tests and computation. Here, a strategy of VSW of all possible sizes is proposed and its power is examined, in comparison with those using only haplotype blocks (BLK) or single SNP loci (SGL) tests. Critical values for statistical significance testing that account for multiple testing are simulated. We demonstrated that, over a wide range of parameters simulated, VSW increased power for the detection of disease variants by ∼1–15% over the BLK and SGL approaches. The improved performance was more significant in regions with high recombination rates. In an empirical data set, VSW obtained the most significant signal and identified the LRP5 gene as strongly associated with osteoporosis. With the use of computational techniques such as parallel algorithms and clustering computing, it is feasible to apply VSW to large genomic regions or those regions preliminarily identified by traditional SGL/BLK methods.
sliding window; association mapping; statistical power
Motivation: Most complex diseases involve multiple genes and their interactions. Although genome-wide association studies (GWAS) have shown some success for identifying genetic variants underlying complex diseases, most existing studies are based on limited single-locus approaches, which detect single nucleotide polymorphisms (SNPs) essentially based on their marginal associations with phenotypes.
Results: In this article, we propose an ensemble approach based on boosting to study gene–gene interactions. We extend the basic AdaBoost algorithm by incorporating an intuitive importance score based on Gini impurity to select candidate SNPs. Permutation tests are used to control the statistical significance. We have performed extensive simulation studies using three interaction models to evaluate the efficacy of our approach at realistic GWAS sizes, and have compared it with existing epistatic detection algorithms. Our results indicate that our approach is valid, efficient for GWAS and on disease models with epistasis has more power than existing programs.
Genome-wide association studies raise study-design and analytical issues that are still being debated. Among them, stands the issue of reducing the number of markers to be genotyped without loss of efficiency in identifying trait loci, which can reduce the cost of studies and minimize the multiple testing problem. With this aim, we proposed a two-step strategy based on two analytical methods suited to examine sets of markers rather than single markers: the local score, which screens the genome to select candidate regions in Step 1, and FBAT-LC, a multiple-marker family-based association test used to obtain significance levels of regions at step 2. The performance of this strategy was evaluated on all replicates of Genetic Analysis Workshop 15 Problem 3 simulated data, using the answers to that problem. Overall, seven of the nine generated trait loci were detected in at least 87% of the replicates using a framework designed to handle either association with the disease or association with the severity of disease. This multiple-marker strategy was compared to the single-marker approach. By considering regions instead of single markers, this strategy minimizes the multiple testing problem and the number of false-positive results.
Multiple testing is a problem in genome-wide or region-wide association studies. In this report, we consider a study design given by the Genetic Analysis Workshop 15 (GAW15) Problem 3 – nuclear families (parents with their affected children) and unrelated controls. Based on this design, we propose three two-stage approaches to deal with the problem of multiple testing. The tests in the first stage, statistically independent of the association test used in the second stage, are used to screen or select single-nucleotide polymorphisms (SNPs). Then, in the second stage, a family-based association test is performed on a much smaller set of selected SNPs. Thus, the problem of multiple testing is much less severe. Our simulation studies and application to the dense SNP data of chromosome 6 in the GAW15 Problem 3 show that the two-stage methods are more powerful than the one-stage method (using the family-based association test only).
Recent studies have shown that quantitative phenotypes may be influenced not only by multiple single nucleotide polymorphisms (SNPs) within a gene but also by the interaction between SNPs at unlinked genes. We propose a new statistical approach that can detect gene-gene interactions at the allelic level which contribute to the phenotypic variation in a quantitative trait. By testing for the association of allelic combinations at multiple unlinked loci with a quantitative trait, we can detect the SNP allelic interaction whether or not it can be detected as a main effect. Our proposed method assigns a score to unrelated subjects according to their allelic combination inferred from observed genotypes at two or more unlinked SNPs, and then tests for the association of the allelic score with a quantitative trait. To investigate the statistical properties of the proposed method, we performed a simulation study to estimate type I error rates and power and demonstrated that this allelic approach achieves greater power than the more commonly used genotypic approach to test for gene-gene interaction. As an example, the proposed method was applied to data obtained as part of a candidate gene study of sodium retention by the kidney. We found that this method detects an interaction between the calcium-sensing receptor gene (CaSR), the chloride channel gene (CLCNKB) and the Na, K, 2Cl cotransporter gene (CLC12A1) that contributes to variation in diastolic blood pressure.
quantitative trait loci; allelic test; interaction effect; blood pressure
We introduce a stepwise approach for family-based designs for selecting a set of markers in a gene that are independently associated with the disease. The approach is based on testing the effect of a set of markers conditional on another set of markers. Several likelihood-based approaches have been proposed for special cases, but no model-free based tests have been proposed. We propose two types of tests in a family-based framework that are applicable to arbitrary family structures and completely robust to population stratification. We propose methods for ascertained dichotomous traits and unascertained quantitative traits. We first propose a completely model-free extension of the FBAT main genetic effect test. Then, for power issues, we introduce two model-based tests, one for dichotomous traits and one for continuous traits. Lastly, we utilize these tests to analyze a continuous lung function phenotype as a proxy for asthma in the Childhood Asthma Management Program. The methods are implemented in the free R package fbati.
Binary trait; Candidate gene analysis; Family-based association tests; FBAT-C; Linkage disequilibrium (LD); Model-based test; Model-free test; Nuclear families; Quantitative trait
There has been a growing interest in developing strategies for identifying single-nucleotide polymorphisms (SNPs) that explain a linkage signal by joint modeling of linkage and association. We compare several existing methods and propose a new method called the homozygote sharing transmission-disequilibrium test (HSTDT) to detect linkage and association or to identify SNPs explaining the linkage signal on chromosome 6 for rheumatoid arthritis using 100 replicates of the Genetic Analysis Workshop (GAW) 15 simulated affected sib-pair data. Existing methods considered included the family-based tests of association implemented in FBAT, a transmission-disequilibrium test, a conditional logistic regression approach, a likelihood-based approach implemented in LAMP, and the homozygote sharing test (HST). We compared the type I error rates and power for tests classified into three categories according to their null hypotheses: 1) no association in the presence of linkage (i.e., a SNP explains none of the linkage evidence), 2) no linkage adjusting for the association (i.e., a SNP explains all linkage evidence), and 3) no linkage and no association. For testing association in the presence of linkage, we found similar power among all tests except for the homozygote sharing test that had lower power. When testing linkage adjusting for association, similar power was observed between LAMP and HST, but lower power for the conditional logistic regression method. When testing linkage or association, the conditional logistic regression method was more powerful than FBAT.
Complex diseases or phenotypes may involve multiple genetic variants and interactions between genetic, environmental and other factors. Current genome-wide association studies (GWAS) mostly used single-locus analysis and had identified genetic effects with multiple confirmations. Such confirmed single-nucleotide polymorphism (SNP) effects were likely to be true genetic effects and ignoring this information in testing new effects of the same phenotype results in decreased statistical power due to increased residual variance that has a component of the omitted effects. In this study, a multi-locus association test (MLT) was proposed for GWAS analysis conditional on SNPs with confirmed effects to improve statistical power. Analytical formulae for statistical power were derived and were verified by simulation for MLT accounting for confirmed SNPs and for single-locus test (SLT) without accounting for confirmed SNPs. Statistical power of the two methods was compared by case studies with simulated and the Framingham Heart Study (FHS) GWAS data. Results showed that the MLT method had increased statistical power over SLT. In the GWAS case study on four cholesterol phenotypes and serum metabolites, the MLT method improved statistical power by 5% to 38% depending on the number and effect sizes of the conditional SNPs. For the analysis of HDL cholesterol (HDL-C) and total cholesterol (TC) of the FHS data, the MLT method conditional on confirmed SNPs from GWAS catalog and NCBI had considerably more significant results than SLT.
Several family-based approaches for testing genetic association with traits obtained from longitudinal or repeated measurement studies have been previously proposed. These approaches utilize the multivariate data more efficiently by using estimated optimal weights to combine univariate tests. We show that these FBAT approaches are still robust against hidden population stratification, but their power can be heavily affected since the estimated weights might provide poor approximation of the true theoretical optimal weights with the presence of population stratification. We introduce a permutation-based approach FBAT-MinP and an equal combination approach FBAT-EW, both of which do not involve the use of estimated weights. Through simulation studies, FBAT-MinP and FBAT-EW are shown to be powerful even in the presence of population stratification, when other approaches may substantially lose their power. An application of these approaches to the Childhood Asthma Management Program (CAMP) study data for testing an association between body mass index and a previously reported candidate SNP is given as an example.
Genome-wide association (GWA) studies that use population-based association approaches may identify spurious associations in the presence of population admixture. In this paper, we propose a novel three-stage approach that is computationally efficient and robust to population admixture and more powerful than the family-based association test (FBAT) for GWA studies with family data.
We propose a three-stage approach for GWA studies with family data. The first stage is to perform linear regression ignoring phenotypic correlations among family members. SNPs with a first stage p-value below a liberal cut-off (e.g. 0.1) are then analyzed in the second stage that employs a linear mixed effects (LME) model that accounts for within family correlations. Next, SNPs that reach genome-wide significance (e.g. 10-6 for 34,625 genotyped SNPs in this paper) are analyzed in the third stage using FBAT, with correction of multiple testing only for SNPs that enter the third stage. Simulations are performed to evaluate type I error and power of the proposed method compared to LME adjusting for 10 principal components (PC) of the genotype data. We also apply the three-stage approach to the GWA analyses of uric acid in Framingham Heart Study's SNP Health Association Resource (SHARe) project.
Our simulations show that whether or not population admixture is present, the three-stage approach has no inflated type I error. In terms of power, using LME adjusting PC is only slightly more powerful than the three-stage approach. When applied to the GWA analyses of uric acid in the SHARe project of FHS, the three-stage approach successfully identified and confirmed three SNPs previously reported as genome-wide significant signals.
For GWA analyses of quantitative traits with family data, our three-stage approach provides another appealing solution to population admixture, in addition to LME adjusting for genetic PC.
Genome-wide association studies (GWAS) aim to identify causal variants and genes for complex disease by independently testing a large number of SNP markers for disease association. Although genes have been implicated in these studies, few utilise the multiple-hit model of complex disease to identify causal candidates. A major benefit of multi-locus comparison is that it compensates for some shortcomings of current statistical analyses that test the frequency of each SNP in isolation for the phenotype population versus control.
Here we developed and benchmarked several protocols for GWAS data analysis using different in-silico gene prediction and prioritisation methodologies. We adopted a high sensitivity approach to the data, using less conservative statistical SNP associations. Multiple gene search spaces, either of fixed-widths or proximity-based, were generated around each SNP marker. We used the candidate disease gene prediction system Gentrepid to identify candidates based on shared biomolecular pathways or domain-based protein homology. Predictions were made either with phenotype-specific known disease genes as input; or without a priori knowledge, by exhaustive comparison of genes in distinct loci. Because Gentrepid uses biomolecular data to find interactions and common features between genes in distinct loci of the search spaces, it takes advantage of the multi-locus aspect of the data.
Results suggest testing multiple SNP-to-gene search spaces compensates for differences in phenotypes, populations and SNP platforms. Surprisingly, domain-based homology information was more informative when benchmarked against gene candidates reported by GWA studies compared to previously determined disease genes, possibly suggesting a larger contribution of gene homologs to complex diseases than Mendelian diseases.
Detecting the patterns of DNA sequence variants across the human genome is a crucial step for unraveling the genetic basis of complex human diseases. The human HapMap constructed by single nucleotide polymorphisms (SNPs) provides efficient sequence variation information that can speed up the discovery of genes related to common diseases. In this article, we present a generalized linear model for identifying specific nucleotide variants that encode complex human diseases. A novel approach is derived to group haplotypes to form composite diplotypes, which largely reduces the model degrees of freedom for an association test and hence increases the power when multiple SNP markers are involved. An efficient two-stage estimation procedure based on the expectation-maximization (EM) algorithm is derived to estimate parameters. Non-genetic environmental or clinical risk factors can also be fitted into the model. Computer simulations show that our model has reasonable power and type I error rate with appropriate sample size. It is also suggested through simulations that a balanced design with approximately equal number of cases and controls should be preferred to maintain small estimation bias and reasonable testing power. To illustrate the utility, we apply the method to a genetic association study of large for gestational age (LGA) neonates. The model provides a powerful tool for elucidating the genetic basis of complex binary diseases.
Nucleotide sequence; complex disease; EM algorithm; logistic regression; haplotype.