The presence of linkage disequilibrium violates the underlying assumption of linkage equilibrium in most traditional multipoint linkage approaches. Studies have shown that such violation leads to bias in qualitative trait linkage analysis when parental genotypes are unavailable. Appropriate handling of marker linkage disequilibrium can avoid such false positive evidence. Using the rheumatoid arthritis simulated data from Genetic Analysis Workshop 15, we examined and compared the following three approaches to handle linkage disequilibrium among dense markers in both qualitative and quantitative trait linkage analyses: a simple algorithm; SNPLINK, methods for marker selection; and MERLIN-LD, a method for modeling linkage disequilibrium by creating marker clusters. In analysis ignoring linkage disequilibrium between markers, we observed LOD score inflation only in the affected sib-pair linkage analysis without parental genotypes; no such inflation was present in the quantitative trait locus linkage analysis with severity as our phenotype with or without parental genotypes. Using methods to model or adjust for linkage disequilibrium, we found a substantial reduction of inflation of LOD score in affected sib-pair linkage analysis. Greater LOD score reduction was observed by decreasing the amount of tolerable linkage disequilibrium among markers selected or marker clusters using MERLIN-LD; the latter approach showed most reduction. SNPLINK performed better with selected markers based on the D' measure of linkage disequilibrium as opposed to the r2 measure and outperformed the simple algorithm. Our findings reiterate the necessity of properly handling dense markers in linkage analysis, especially when parental genotypes are unavailable.
It is well known that conventional association tests can lead to excessive false positives when there is population stratification. We propose a new test for detecting genetic association with a case-control study design. Unlike some other methods for handling population stratification, we treat the cases as a population and the controls as another one even though each of them may be a mixture of several sub-populations. A likelihood-ratio test is used to test whether the allele frequency of a testing single-nucleotide polymorphism in the case population is the same as that in the control population. This new test is applied to the Genetic Analysis Workshop 16 Problem 1 data on rheumatoid arthritis. Compared with the Pearson chi-square genotype test, the association strength of many single-nucleotide polymorphisms is decreased while the signal at the HLA region on 6p21 is maintained.
The central issue for Genetic Analysis Workshop 14 (GAW14) is the question, which is the better strategy for linkage analysis, the use of single-nucleotide polymorphisms (SNPs) or microsatellite markers? To answer this question we analyzed the simulated data using Duffy's SIB-PAIR program, which can incorporate parental genotypes, and our identity-by-state – identity-by-descent (IBS-IBD) transformation method of affected sib-pair linkage analysis which uses the matrix transformation between IBS and IBD. The advantages of our method are as follows: the assumption of Hardy-Weinberg equilibrium is not necessary; the parental genotype information maybe all unknown; both IBS and its related IBD transformation can be used in the linkage analysis; the determinant of the IBS-IBD transformation matrix provides a quantitative measure of the quality of the marker in linkage analysis. With the originally distributed simulated data, we found that 1) for microsatellite markers there are virtually no differences in types I and II error rates when parental genotypes were or were not used; 2) on average, a microsatellite marker has more power than a SNP marker does in linkage detection; 3) if parental genotype information is used, SNP markers show lower type I error rates than microsatellite markers; and 4) if parental genotypes are not available, SNP markers show considerable variation in type I error rates for different methods.
The inference of the hidden structure of a population is an essential issue in population genetics. Recently, several methods have been proposed to infer population structure in population genetics.
In this study, a new method to infer the number of clusters and to assign individuals to the inferred populations is proposed. This approach does not make any assumption on Hardy-Weinberg and linkage equilibrium. The implemented criterion is the maximisation (via a simulated annealing algorithm) of the averaged genetic distance between a predefined number of clusters. The performance of this method is compared with two Bayesian approaches: STRUCTURE and BAPS, using simulated data and also a real human data set.
The simulations show that with a reduced number of markers, BAPS overestimates the number of clusters and presents a reduced proportion of correct groupings. The accuracy of the new method is approximately the same as for STRUCTURE. Also, in Hardy-Weinberg and linkage disequilibrium cases, BAPS performs incorrectly. In these situations, STRUCTURE and the new method show an equivalent behaviour with respect to the number of inferred clusters, although the proportion of correct groupings is slightly better with the new method. Re-establishing equilibrium with the randomisation procedures improves the precision of the Bayesian approaches. All methods have a good precision for FST ≥ 0.03, but only STRUCTURE estimates the correct number of clusters for FST as low as 0.01. In situations with a high number of clusters or a more complex population structure, MGD performs better than STRUCTURE and BAPS. The results for a human data set analysed with the new method are congruent with the geographical regions previously found.
This new method used to infer the hidden structure in a population, based on the maximisation of the genetic distance and not taking into consideration any assumption about Hardy-Weinberg and linkage equilibrium, performs well under different simulated scenarios and with real data. Therefore, it could be a useful tool to determine genetically homogeneous groups, especially in those situations where the number of clusters is high, with complex population structure and where Hardy-Weinberg and/or linkage equilibrium are present.
Data for Problem 3 of the Genetic Analysis Workshop 15 were generated by computer simulation in an attempt to mimic some of the genetic and epidemiological features of rheumatoid arthritis (RA) such as its population prevalence, sex ratio, risk to siblings of affected individuals, association with cigarette smoking, the strong effect of genotype in the HLA region and other genetic effects. A complex genetic model including epistasis and genotype-by-environment interaction was applied to a population of 1.9 million nuclear families of size four from which we selected 1500 families with both offspring affected and 2000 unrelated, unaffected individuals all of whose first-degree relatives were unaffected. This process was repeated to produce 100 replicate data sets. In addition, we generated marker data for 22 autosomes consisting of a genome-wide set of 730 simulated STRP markers, 9187 SNP markers and an additional 17,820 SNP markers on chromosome 6. Appropriate linkage disequilibrium between markers and between trait loci and markers was modelled using HapMap Phase 1 data . The code base for this project was written primarily in the Octave programming language, but it is being ported to the R language and developed into a larger project for general genetic simulation called GenetSim . All of the source code that was used to generate the GAW 15 Problem 3 data is freely available for download at .
Testing Hardy-Weinberg Equilibrium (HWE) in the control group is commonly used to detect genotyping errors in genetic association studies. We propose a likelihood ratio test for testing HWE in the study population using both case and control samples. This test incorporates underlying association models. Another feature is that, when we infer the disease-genotype association, we explicitly incorporate HWE or a possible departure from Hardy-Weinberg Equilibrium (DHWE) into the model. Our unified framework enables us to infer the disease-genotype association when a detected DHWE needs to be part of the model after causes for the DHWE are explored. Real datasets are used to illustrate the application of the methodology and its implication in genetic association studies. Our analysis and interpretation touch on genotyping errors, population selection, population stratification, or the study sampling plan, all delicate issues that could be the cause of DHWE.
Likelihood ratio test; Hardy Weinberg equilibrium; case-control study; genotype-disease association
The century-old Hardy–Weinberg law remains fundamental to population genetics. Typically Hardy–Weinberg equilibrium is tested in unrelated individuals using a χ2 goodness-of-fit test that compares expected and observed numbers of heterozygotes and homozygotes. In this report, we propose a likelihood ratio test for Hardy–Weinberg equilibrium that accommodates a mixture of pedigree and random sample data. The underlying statistical model depends on a parameter γ determining the ratio of heterozygous genotypes to homozygous genotypes among pedigree founders. As our heterozygous–homozygous test accommodates markers with dominant and recessive alleles, it can handle the phase ambiguities encountered in combining several linked single nucleotide polymorphisms into a single supermarker. No prior haplotyping is necessary. Our experience on real and simulated data suggests that the heterozygous–homozygous test has good type-one error and power.
genetic equilibrium; population genetics; likelihood estimation; pedigree analysis
Accounting for interactions with environmental factors in association studies may improve the power to detect genetic effects and may help identifying important environmental effect modifiers. The power of unphased genotype-versus haplotype-based methods in regions with high linkage disequilibrium (LD), as measured by D', for analyzing gene × environment (gene × sex) interactions was compared using the Genetic Analysis Workshop 15 (GAW15) simulated data on rheumatoid arthritis with prior knowledge of the answers. Stepwise and regular conditional logistic regression (CLR) was performed using a matched case-control sample for a HLA region interacting with sex. Haplotype-based analyses were performed using a haplotype-sharing-based Mantel statistic and a test for haplotype-trait association in a general linear model framework. A step-down minP algorithm was applied to derive adjusted p-values and to allow for power comparisons. These methods were also applied to the GAW15 real data set for PTPN22.
For markers in strong LD, stepwise CLR performed poorly because of the correlation/collinearity between the predictors in the model. The power was high for detecting genetic main effects using simple CLR models and haplotype-based methods and for detecting joint effects using CLR and Mantel statistics. Only the haplotype-trait association test had high power to detect the gene × sex interaction.
In the PTPN22 region with markers characterized by strong LD, all methods indicated a significant genotype × sex interaction in a sample of about 1000 subjects. The previously reported R620W single-nucleotide polymorphism was identified using logistic regression, but the haplotype-based methods did not provide any precise location information.
We present a new method for fine-mapping a disease susceptibility locus using a case-control design. The new method, termed the weighted average (WA) statistic, averages the Cochran-Armitage (CA) trend test statistic and the difference between the Hardy-Weinberg disequilibrium test statistic for cases and controls (the HWD trend). The main characteristics of the WA statistic are that it improves on the weaknesses, and maintains the strengths, of both the CA trend test and the HWD trend test. Data from three different populations in the Genetic Analysis Workshop 14 (GAW14) simulated dataset (Aipotu, Karangar, and Danacaa) were first subjected to model-free linkage analysis to find regions exhibiting linkage. Then, for fine-scale mapping, 140 SNPs within the significant linkage regions were analyzed with the WA test statistic on replicates of the three populations, both separately and combined. The regions that were significant in the multipoint linkage analysis were also significant in this fine-scale mapping. The most significant regions that were obtained using the WA statistic were regions in chromosome 3 (B03T3056–B03T3058, p-value < 1 × 10-10 ) and chromosome 9 (B09T8332–B09T8334, p-value 1 × 10-6 ). Based on the results of the simulated GAW14 data, the WA test statistic showed good performance and could narrow down the region containing the susceptibility locus. However, the strength of the signal depends on both the strength of the linkage disequilibrium and the heterozygosity of the linked marker.
While genetic and environmental factors and their interactions influence susceptibility to rheumatoid arthritis (RA), causative genetic variants have not been identified. The purpose of the present study was to assess the effects of covariates and genotype × sex interactions on the genome-wide association analysis (GWAA) of RA using Genetic Analysis Workshop 16 Problem 1 data and a logistic regression approach as implemented in PLINK. After accounting for the effects of population stratification, effects of covariates and genotype × sex interactions on the GWAA of RA were assessed by conducting association and interaction analyses. We found significant allelic associations, covariate, and genotype × sex interaction effects on RA. Several top single-nucleotide polymorphisms (SNPs) (~22 SNPs) showed significant associations with strong p-values (p < 1 × 10-4 - p < 1 × 10-24). Only three SNPs on chromosomes 4, 13, and 20 were significant after Bonferroni correction, and none of these three SNPs showed significant genotype × sex interactions. Of the 30 top SNPs with significant (p < 1 × 10-4 - p < 1 × 10-6) interactions, ~23 SNPs showed additive interactions and ~5 SNPs showed only dominance interactions. Those SNPs showing significant associations in the regular logistic regression failed to show significant interactions. In contrast, the SNPs that showed significant interactions failed to show significant associations in models that did not incorporate interactions. It is important to consider interactions of genotype × sex in addition to associations in a GWAA of RA. Furthermore, the association between SNPs and RA susceptibility varies significantly between men and women.
Related cases may be included in case-control association studies if correlations between related individuals due to identity-by-descent (IBD) sharing are taken into account. We derived a framework to test for association in a case-control design including affected sibships and unrelated controls. First, a corrected variance for the allele frequency difference between cases and controls was directly calculated or estimated in two ways on the basis of the fixation index FST and the inbreeding coefficient. Then the correlation-corrected association test including controls and affected sibs was carried out. We applied the three strategies to 20 candidate genes on the Genetic Analysis Workshop 15 rheumatoid arthritis data and to 9187 single-nucleotide polymorphisms of replicate one of the Genetic Analysis Workshop 15 simulated data with knowledge of the "answers". The three strategies used to correct for correlation give only minor differences in the variance estimates and yield an almost correct type I error rate for the association tests. Thus, all strategies considered to correct the variance performed quite well.
Traditional transmission disequilibrium test (TDT) based methods for genetic association analyses are robust to population stratification at the cost of a substantial loss of power. We here describe a novel method for family-based association studies that corrects for population stratification with the use of an extension of principal component analysis (PCA). Specifically, we adopt PCA on unrelated parents in each family. We then infer principal components for children from those for their parents through a TDT-like strategy. Two test statistics within variance-components model are proposed for association tests. Simulation results show that the proposed tests have correct type I error rates regardless of population stratification, and have greatly improved power over two popular TDT-based methods: QTDT and FBAT. The application to the Genetic Analysis Workshop 16 (GAW16) data sets attests to the feasibility of the proposed method.
Family Based Association Tests (FBATs); Transmission Disequilibrium Test (TDT); Principal Component Analysis (PCA); Variance-Components
The incorporation of disease-associated covariates into studies aiming to identify susceptibility genes for complex human traits is a challenging problem. Accounting for such covariates in genetic linkage and association analyses may help reduce the genetic heterogeneity inherent in these complex phenotypes. For Genetic Analysis Workshop 15 (GAW15) Problem 3 simulated data, our goal was to compare the power of several two-stage study designs to identify rheumatoid arthritis-related genes on chromosome 9 (disease severity), 11 (IgM), and 18 (anti-cyclic citrinullated protein), with knowledge of the answers. Five study designs incorporating an initial linkage step, followed by a case-selection scheme and case-control association analysis by logistic regression, were considered. The linkage step was either qualitative-trait linkage analysis as implemented in MERLIN-nonparametric linkage (NPL), or quantitative-trait locus analysis as implemented in MERLIN-REGRESS. A set of cases representing either one case from each available family, one case per linked family (NPL ≥ 0), or one case from each family identified by ordered-subset analysis was chosen for comparison with the full set of 2000 simulated controls. As expected, the performance of these study designs depended on the disease model used to generate the data, especially the simulated allele frequency difference between cases and controls. The quantitative trait loci analysis performed well in identifying these loci, and the power to identify disease-associated alleles was increased by using ordered-subset analysis as a case selection tool.
In this report, we compared haplotyping approaches using families and unrelated individuals on the simulated rheumatoid arthritis (RA) data in Problem 3 from Genetic Analysis Workshop (GAW) 15. To investigate these two approaches, we picked two representative programs: PedPhase and fastPHASE, respectively, for each approach. PedPhase is a rule-based method focusing on the haplotyping constraints within each pedigree and solving them using integer linear programming. fastPHASE is a statistical method based on the clustering property of haplotypes in a population over short regions. It is believed that with family information, one can obtain more accurate phasing results with considerably more cost for genotyping additional family members. Our results indicate that, though only relying on the constraints within each family (with four members) individually, PedPhase has better phasing accuracy than fastPHASE, even when the total numbers of genotyped individuals are the same. But for missing genotype imputation, fastPHASE performs better than PedPhase by taking population information into consideration. The relative influence of family constraints and population information on haplotyping accuracy as shown in this report provides some empirical bases on assessing the trade-off of genotyping family data under different settings.
Using the Genetic Analysis Workshop 14 simulated datasets we carried out nonparametric linkage analyses and applied a log-linear method for analysis of case-parent-triad data with stratification on parental mating type. We proposed and applied a random effect modelling approach to explore the impact of population heterogeneity on tests of association between genetic markers and disease status. The estimated genetic effect may appear to be strongly significant in one population but nonsignificant in another population, leading to confusion about interpretation. However, when results are interpreted in the light of a random effects model, both studies may be making similar statements about a genetic effect that varies depending on environment and background.
Population structure analysis is important to genetic association studies and evolutionary investigations. Parametric approaches, e.g. STRUCTURE and L-POP, usually assume Hardy-Weinberg equilibrium (HWE) and linkage equilibrium among loci in sample population individuals. However, the assumptions may not hold and allele frequency estimation may not be accurate in some data sets. The improved version of STRUCTURE (version 2.1) can incorporate linkage information among loci but is still sensitive to high background linkage disequilibrium. Nowadays, large-scale single nucleotide polymorphisms (SNPs) are becoming popular in genetic studies. Therefore, it is imperative to have software that makes full use of these genetic data to generate inference even when model assumptions do not hold or allele frequency estimation suffers from high variation.
We have developed point-and-click software for non-parametric population structure analysis distributed as an R package. The software takes advantage of the large number of SNPs available to categorize individuals into ethnically similar clusters and it does not require assumptions about population models. Nor does it estimate allele frequencies. Moreover, this software can also infer the optimal number of populations.
Our software tool employs non-parametric approaches to assign individuals to clusters using SNPs. It provides efficient computation and an intuitive way for researchers to explore ethnic relationships among individuals. It can be complementary to parametric approaches in population structure analysis.
We studied rheumatoid arthritis (RA) in the North American Rheumatoid Arthritis Consortium (NARAC) data (1499 subjects; 757 families). Identical methods were applied for studying RA in the Genetic Analysis Workshop 15 (GAW15) simulated data (with a prior knowledge of the simulation answers). Fifty replications of GAW15 simulated data had 3497 ± 20 subjects in 1500 nuclear families. Two new statistical methods were applied to transform the original phenotypes on these data, the item response theory (IRT) to create a latent variable from nine classifying predictors and a Blom transformation of the anti-CCP (anti-cyclic citrinullated protein) variable. We performed linear mixed-effects (LME) models to study the additive associations of 404 Illumina-genotyped single-nucleotide polymorphisms (SNPs) on the NARAC data, and of 17,820 SNPs of the GAW15 simulated data. In the GAW15 simulated data, the association with anti-CCP Blom transformation showed a 100% sensitivity for SNP1 located in the major histocompatibility complex gene. In contrast, the association of SNP1 with the IRT latent variable showed only 24% sensitivity. From the simulated data, we conclude that the Blom transformation of the anti-CCP variable produced more reliable results than the latent variable from the qualitative combination of a group of RA risk factors. In the NARAC data, the significant RA-SNPs associations found with both phenotype-transformation methods provided a trend that may point toward dynein and energy control genes. Finer genotyping in the NARAC data would grant more exact evidence for the contributions of chromosome 6 to RA.
The evaluation of associations between genotypes and diseases in a case-control framework plays an important role in genetic epidemiology. This paper focuses on the evaluation of the homogeneity of both genotypic and allelic frequencies. The traditional test that is used to check allelic homogeneity is known to be valid only under Hardy-Weinberg equilibrium, a property that may not hold in practice.
We first describe the flaws of the traditional (chi-squared) tests for both allelic and genotypic homogeneity. Besides the known problem of the allelic procedure, we show that whenever these tests are used, an incoherence may arise: sometimes the genotypic homogeneity hypothesis is not rejected, but the allelic hypothesis is. As we argue, this is logically impossible. Some methods that were recently proposed implicitly rely on the idea that this does not happen. In an attempt to correct this incoherence, we describe an alternative frequentist approach that is appropriate even when Hardy-Weinberg equilibrium does not hold. It is then shown that the problem remains and is intrinsic of frequentist procedures. Finally, we introduce the Full Bayesian Significance Test to test both hypotheses and prove that the incoherence cannot happen with these new tests. To illustrate this, all five tests are applied to real and simulated datasets. Using the celebrated power analysis, we show that the Bayesian method is comparable to the frequentist one and has the advantage of being coherent.
Contrary to more traditional approaches, the Full Bayesian Significance Test for association studies provides a simple, coherent and powerful tool for detecting associations.
Allelic homogeneity test; Bayesian methods; Chi-squared test; Hardy-Weinberg equilibrium; FBST; Monotonicity
We used the simulated data set from Genetic Analysis Workshop 15 Problem 3 to assess a two-stage approach for identifying single-nucleotide polymorphisms (SNPs) associated with rheumatoid arthritis (RA). In the first stage, we used random forests (RF) to screen large amounts of genetic data using the variable importance measure, which takes into account SNP interaction effects as well as main effects without requiring model specification. We used the simulated 9187 SNPs mimicking a 10 K SNP chip, along with covariates DR (the simulated DRB1 gentoype), smoking, and sex as input to the RF analyses with a training set consisting of 750 unrelated RA cases and 750 controls. We used an iterative RF screening procedure to identify a smaller set of variables for further analysis. In the second stage, we used the software program CaMML for producing Bayesian networks, and developed complex etiologic models for RA risk using the variables identified by our RF screening procedure. We evaluated the performance of this method using independent test data sets for up to 100 replicates.
A new type of test is presented for genome-wide association studies using a case-control design. It is referred to as the adaptive two-stage (ATS) analysis, being based on both the Hardy-Weinberg disequilibrium trend test (HWDTT) and the Cochran-Armitage trend test (CATT). The procedure for the ATS is to screen single-nucleotide polymorphisms (SNPs) using the HWDTT in a first stage, and then test a reduced number of SNPs that pass the screening step in a second stage using the CATT. In the Genetic Analysis Workshop 15 simulated data set, this ATS analysis captured, after Bonferroni correction, the region from 32447.149 kb to 32859.819 kb and the region around 37363.880 kb that are close to the actual trait loci on chromosome 6. We compared the ATS with other ways of combining the p-values of the HWDTT and the CATT, the classical form of Fisher's test and a weighted form of Fisher's test. Results showed that the proposed ATS has good performance and could detect the regions containing a susceptibility locus.
We have used the genome-wide marker genotypes from Genetic Analysis Workshop 15 Problem 2 to explore joint evidence for genetic linkage to rheumatoid arthritis across several samples. The data consisted of four high-density genome scans on samples selected for rheumatoid arthritis. We cleaned the data, removed intermarker linkage disequilibrium, and assembled the samples onto a common genetic map using genome sequence positions as a reference for map interpolation. The individual studies were combined first at the genotype level (mega-analysis) prior to a multipoint linkage analysis on the combined sample, and second using the genome scan meta-analysis method after linkage analysis of each sample. The two approaches were compared, and give strong support to the HLA locus on chromosome 6 as a susceptibility locus. Other regions of interest include loci on chromosomes 11, 2, and 12.
Here we present two new computer tools, PREMIM and EMIM, for the estimation of parental and child genetic effects, based on genotype data from a variety of different child-parent configurations. PREMIM allows the extraction of child-parent genotype data from standard-format pedigree data files, while EMIM uses the extracted genotype data to perform subsequent statistical analysis. The use of genotype data from the parents as well as from the child in question allows the estimation of complex genetic effects such as maternal genotype effects, maternal-foetal interactions and parent-of-origin (imprinting) effects. These effects are estimated by EMIM, incorporating chosen assumptions such as Hardy-Weinberg equilibrium or exchangeability of parental matings as required.
In application to simulated data, we show that the inference provided by EMIM is essentially equivalent to that provided by alternative (competing) software packages such as MENDEL and LEM. However, PREMIM and EMIM (used in combination) considerably outperform MENDEL and LEM in terms of speed and ease of execution.
Together, EMIM and PREMIM provide easy-to-use command-line tools for the analysis of pedigree data, giving unbiased estimates of parental and child genotype relative risks.
Case/parent trio; Maternal-fetal interaction; Parent-of-origin; Genome-wide association study
Accurate inference of genetic discontinuities between populations is an essential component of intraspecific biodiversity and evolution studies, as well as associative genetics. The most widely-used methods to infer population structure are model-based, Bayesian MCMC procedures that minimize Hardy-Weinberg and linkage disequilibrium within subpopulations. These methods are useful, but suffer from large computational requirements and a dependence on modeling assumptions that may not be met in real data sets. Here we describe the development of a new approach, PCO-MC, which couples principal coordinate analysis to a clustering procedure for the inference of population structure from multilocus genotype data.
PCO-MC uses data from all principal coordinate axes simultaneously to calculate a multidimensional “density landscape”, from which the number of subpopulations, and the membership within subpopulations, is determined using a valley-seeking algorithm. Using extensive simulations, we show that this approach outperforms a Bayesian MCMC procedure when many loci (e.g. 100) are sampled, but that the Bayesian procedure is marginally superior with few loci (e.g. 10). When presented with sufficient data, PCO-MC accurately delineated subpopulations with population Fst values as low as 0.03 (G'st>0.2), whereas the limit of resolution of the Bayesian approach was Fst = 0.05 (G'st>0.35).
We draw a distinction between population structure inference for describing biodiversity as opposed to Type I error control in associative genetics. We suggest that discrete assignments, like those produced by PCO-MC, are appropriate for circumscribing units of biodiversity whereas expression of population structure as a continuous variable is more useful for case-control correction in structured association studies.
Inference of population structure from genetic markers is helpful in diverse situations, such as association and evolutionary studies. In this paper, we describe a two-stage strategy in inferring population structure using multilocus genotype data. In the first stage, we use dimension reduction methods such as singular value decomposition to reduce the dimension of the data, and in the second stage, we use clustering methods on the reduced data to identify population structure. The strategy has the ability to identify population structure and assign each individual to its corresponding subpopulation. The strategy does not depend on any population genetics assumptions (such as Hardy-Weinberg equilibrium and linkage equilibrium between loci within populations) and can be used with any genotype data. When applied to real and simulated data, the strategy is found to have similar or better performance compared with STRUCTURE, the most popular method in current use. Therefore, the proposed strategy provides a useful alternative to analyse population data.
population structure; subpopulation; singular value decomposition; dimension reduction; clustering
Parametric linkage methods for quantitative trait locus mapping require explicit specification of the probability model of the quantitative trait and hence can lead to misleading linkage inferences when the model assumptions are not valid. Ghosh and Majumder developed a nonparametric regression method based on kernel-smoothing for linkage mapping of quantitative trait locus using squared differences in trait values of independent sib pairs, which is relatively more robust than parametric methods with respect to violations in distributional assumptions. In this study, we modify the above mentioned nonparametric regression method by considering local linear polynomials instead of the Nadaraya-Watson estimator and squared sums of sib-pair trait values in addition to squared differences to perform a genome-wide scan of rheumatoid factor-IgM levels on sib pairs in the Genetic Analysis Workshop 15 simulated data set. We obtain significant evidence of linkage very close to the quantitative trait locus controlling for RF-IgM. We find that the simultaneous use of squared differences and squared sums increases the power to detect linkage compared to using only squared differences. However, because of all the sib pairs are selected for rheumatoid arthritis, there is reduced variance of RF-IgM values, and empirical power to detect linkage is not very high. We also compare the performance of our method with two linear regression approaches: the classical Haseman-Elston method using squared sib-pair trait differences and its extension proposed by Elston et al. using mean-corrected sib-pair cross-products. We find that the proposed nonparametric method yields more power than the linear regression approaches.