|Home | About | Journals | Submit | Contact Us | Français|
In human case-control association studies, population heterogeneity is often present and can lead to increased false-positive results. Various methods have been proposed and are in current use to remedy this situation.
We assume that heterogeneity is due to a relatively small number of individuals whose allele frequencies differ from those of the remainder of the sample. For this situation, we propose a new method of handling heterogeneity by removing outliers in a controlled manner. In a coordinate system of the c largest principal components in multidimensional scaling (MDS), we systematically remove one after another of the most extreme outlying individuals and each time recompute the largest association test statistic. The smallest p value obtained within M removals serves as our test statistic whose significance level is assessed in randomization samples.
In power simulations of our method and three methods in current use, averaged over several different scenarios, the best method turned out to be logistic regression analysis (based on all individuals) with MDS components as covariates.
Our proposed method ranked closely behind logistic regression analysis with MDS components but ahead of other commonly used approaches. In analyses of real datasets our method performed best.
Population admixture (cryptic heterogeneity) represents a potentially serious problem in case-control association studies . Allele frequencies tend to differ between countries and even between different regions in a single country [2,3]. Disregarding such differences tends to inflate the χ2 association statistic , which ‘sees’ heterogeneity as a deviation from the null hypothesis of homogeneity. One of the first methods to deal with the deleterious effects of heterogeneity is genomic control (GC) , which assesses the extent of inflation of χ2 in terms of the GC factor, λ, and then divides each χ2 by λ. Additional methods have since been introduced, notably principal components analysis  and logistic regression with components of multidimensional scaling (MDS) as covariates . Here, we propose a novel approach based on deleting individuals that appear as outliers. This approach specifically addresses the situation of a relatively small number of individuals that do not belong to the main portion of the study sample.
It is intuitive that one way to deal with heterogeneity is to remove individuals not belonging to a sample. Such an approach might be seen as more appropriate than ‘punishing’ all individuals by rolling back all test statistics as it is done in the GC method. However, removing outliers has to be carried out in a statistically satisfactory manner. To decide how many and which individuals to remove, we proceed as follows: based on the commonly used identity by state (IBS) metric, similarity between two individuals is defined as the IBS between two individuals, averaged over all SNPs. In the coordinate system of the c largest MDS components (here we use c = 4 throughout), each individual is at some distance from the center. That individual with the largest distance from the center is considered a potential outlier.
Initially, the Pearson χ2 is computed in a 2 × 2 contingency table for each SNP, where the two rows correspond to cases and controls, and the two columns represent the SNP alleles. After retaining the p value, p0, for the largest χ2 over all the SNPs, the first potential outlier is removed and another largest χ2 is computed (at whatever SNP in the genome it occurs) leading to p1, and so on. We proceed until a predefined maximum number, M, of individuals has been removed. The sequence of p values (p0, p1, …, pM) initially either decreases (p0 > p1) or increases (p0 < p1). In the first case, assume that the smallest p value, pmin, among the M + 1 values occurs at step k, that is, after k outliers have been removed. We then take T = pmin as our overall test statistic. In the latter case, we search for the first (local) minimum p value, T, or, if none occurs, we retain T = pM, with T again being our test statistic. In each of a sufficiently large set of randomization samples (labels case and control are randomly permuted), the whole approach is repeated, and we obtain the significance level associated with T as the proportion of randomization samples with T values at least as small as the observed T. Note that there may be a different SNP with largest χ2 in different steps of outlier removal.
The technique of finding the smallest p value among several model assumptions and obtaining the (genome-wide) significance level associated with this smallest p value is not new. We previously applied this principle in comparing disease association of sets of SNPs, where each set contains different numbers of SNPs. This has led to our Set Association method , which is more powerful than SNP by SNP analysis [8,9] and has successfully been applied in various studies [10,11,12].
By design, our approach always removes at least one individual. In this sense, it furnishes trimmed results. Trimming is well known in classical statistics as a procedure for eliminating outliers [13,14]. In particular, such methods have been developed for small numbers of outlying observations . Here we apply this principle to case-control association studies.
For a simple power comparison, we assume a total of 1,000 independent SNPs, with the last SNP conferring disease susceptibility. We further assume a total sample size of 200 individuals, of which 10 are outliers. The 190 non-outliers are equally divided into cases and controls while we consider 3 scenarios for the 10 outliers: (1) 5 cases and 5 controls, (2) 2 cases and 8 controls, and (3) no cases and 10 controls, where the latter scenario represents the (perhaps common) situation that controls tend to be chosen from a different population segment than that furnishing cases. For the 999 non-disease SNPs with alleles A and B, allele frequencies P(A) are randomly picked between 0.10 and 0.50 for non-outliers, and between 0.10 and 0.90 for outliers (for details, see online suppl. material; for all online suppl. material, see www.karger.com/doi/10.1159/000320422).
The disease (functional) SNP has alleles D and d, with the former conferring disease susceptibility. Its allele frequency, P(D), is set to 0.30 in non-outliers and is chosen randomly from 0.10 through 0.90 in outliers. Genotype frequencies are given according to the Hardy Weinberg equilibrium. We consider dominant and recessive inheritance, with h denoting the penetrance for non-susceptibility genotypes, while the penetrance for disease conferring genotypes is given by rh. Disease prevalence is taken to be 1%. Power to detect the disease SNP is computed as a function of the penetrance ratio, r = rh/h, where r = 1 represents the null hypothesis of no genetic effect. The maximum number of outliers to be removed is set at M = 20 (10% of the sample size of 200).
We compare the following 4 test procedures, where each is applied to the disease SNP. The remaining SNPs are independent of the disease SNP.
At r = 1, for each of the 4 methods, 5,000 datasets are generated under dominant and recessive inheritance, and critical thresholds for the test statistics are chosen such that the resulting significance level (proportion of significant results) is exactly equal to 0.05. Resulting thresholds are then used to estimate power at penetrance ratios r > 1.
Power of the different analysis methods was somewhat dependent on model assumptions, but the Logistic-MDS method overall did best, followed by our Outliers method. Table Table11 shows results for dominant inheritance and outliers consisting of 2 cases and 8 controls (all results of power simulations are given in online suppl. table S1); these results are fairly typical of the overall picture. Figure Figure11 shows power figures in graphical form.
We combined results for each value of r and 6 model assumptions (dominant/recessive, 3 splits of cases versus controls in outliers) and computed average power over these 36 conditions (online suppl. table S1). As the last row of table table11 shows, this ranking makes the Logistic-MDS method the winner, closely followed by our Outliers method. This power simulation is rather simple and is mainly designed to demonstrate that our Outliers method is competitive. In particular, only one disease SNP was assumed and any significant result is a true positive. Additional power simulations are provided in the online supplementary material, for example, for a trait influenced by two susceptibility loci and for different population structures. The Pearson-GC method presumably suffers from the potentially severe protection from false-positive results. In fact, computing p values from χ2 tables for the Pearson-GC method leads to type I errors much smaller than 0.05 (details not shown) but, as mentioned, in our simulations the type I error was constant for all methods.
To demonstrate our Outliers method, we applied it and the 3 other approaches discussed here to a published dataset on Parkinson disease with approximately 540 case and control individuals and approximately 408,000 SNPs genome wide . To make results comparable and allow for genome-wide correction for multiple testing, p values were estimated in permutation samples. In this analysis, we applied the standard Pearson χ2 test without GC correction.
As table table22 shows, the Outliers method furnished the smallest p value of 0.076, which is not formally significant, although nearly so. The smallest nominal p value in the Outliers method was obtained after 3 individuals had been removed as outliers (fig. (fig.2).2). The significance level associated with this smallest p value is estimated to be 0.076. Without removing outliers, the p value of the largest test statistic (χ2) is equal to 0.120. Thus, the Outliers method resulted in a considerable improvement, although it did not furnish a significant result. If, for argument's sake, we transform p values into χ2 with 2 d.f. , we find χ2 of c1 = 5.15 for p = 0.076, and c2 = 4.24 for p = 0.120. As χ2 is proportional to sample size, the ratio, c1/c2 = 1.22, reflects a virtual gain of 22% in sample size obtained by our method. Of course, this argument is artificial since we do not know whether these p values reflect true or false positives.
So-called ‘obvious’ outliers are often removed in an ad-hoc manner, and there may not be good statistical justifications for doing so. In particular, if outliers are removed by trial and error, that is, if they are removed only when this leads to a reduction in p value, then such a procedure clearly tends to increase the false-positive rate of results. Here, we developed a statistically rigorous procedure for removing outliers while maintaining correct type I error.
We carried out additional power simulations under various conditions and also analyzed one more real dataset. All these results may be found in the online supplementary material. These simulations confirm our conclusions based on results shown in table table1;1; they also show that the Outliers method often does best with recessive modes of inheritance. In addition, at least in the two real datasets analyzed here, for the best SNPs, the Outliers method yields the smallest p values.
As is well known, an alternative to removing outliers is to allow for them in the analysis, which may be done by including principal components as covariates in logistic regression analysis . The two approaches may do equally well in practice, although our power calculations have given the logistic regression approach (with MDS components) a slight advantage.
This work has been supported by NSFC grants from the Chinese government (project numbers 30730057 and 30700442) and by U.S. NIH grants AG026916 and HL084410. This study used data from the SNP Database at the NINDS Human Genetics Resource Center DNA and Cell Line Repository (http://ccr.coriell.org/ninds), as well as clinical data. The original genotyping was performed in the laboratories of Drs. Singleton and Hardy (NIA, LNG), Bethesda, Md., USA.