Home | About | Journals | Submit | Contact Us | Français |

**|**Hum Hered**|**PMC2880722

Formats

Article sections

- Abstract
- Introduction
- Description of Bias
- Simulation Studies Demonstrating Bias
- Ranking Bias vs. Significance Bias
- Discussion
- Conclusions
- References

Authors

Related links

Hum Hered. 2009 March; 67(4): 267–275.

Published online 2009 January 27. doi: 10.1159/000194979

PMCID: PMC2880722

Neal O. Jeffries^{*}

Office of Biostatistics Research, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Md., USA

*Neal O. Jeffries, Office of Biostatistics Research, National Heart, Lung, and Blood Institute, MSC 7913, 6701 Rockledge Drive, National Institutes of Health, Bethesda, MD 20892 (USA), Tel. +1 301 451 2178, Fax +1 301 480 1862, E-Mail vog.hin.iblhn@ffejlaen

Received 2008 April 24; Accepted 2008 August 18.

Copyright © 2009 by S. Karger AG, Basel

This article has been cited by other articles in PMC.

It is widely appreciated that genomewide association studies often yield overestimates of the association of a marker with disease when attention focuses upon the marker showing the strongest relationship. For example, in a case-control setting the largest (in absolute value) estimated odds ratio has been found to typically overstate the association as measured in a second, independent set of data. The most common reason given for this observation is that the choice of the most extreme test statistic is often conditional upon first observing a significant *p* value associated with the marker. A second, less appreciated reason is described here. Under common circumstances it is the multiple testing of many markers and subsequent focus upon those with most extreme test statistics (i.e. highly ranked results) that leads to bias in the estimated effect sizes.

This bias, termed ranking bias, is separate from that arising from conditioning on a significant *p* value and may often be a more important factor in generating bias. An analytic description of this bias, simulations demonstrating its extent, and identification of some factors leading to its exacerbation are presented.

As high dimensional assay technology has become more available for genomic investigations there has been a corresponding increase in methodology to limit false positive findings [1, 2]. Until recently, less attention has been given to the effect sizes of those comparisons that are determined to reach statistical significance. In many instances this may be appropriate as effect sizes may not be of great importance per se; however in other circumstances it may be important to accurately assess the magnitude of the associated effects.

As an example, there has been increased recognition that some form of replication is desirable to support a claim of association found as a single result among a large panel of genetic markers [3]. However, using estimates of effect size derived from an initial study of as many as a million markers can lead to an underpowered evaluation as the initial estimate is likely biased toward more extreme results than are true.

A common rationale given for this bias is that it results from conditioning on the initial estimate meeting *p* value criteria for declaring significance – often referred to as selection or truncation bias. The term ‘significance bias’ will be used here to stress the conditioning upon a significant *p* value. Garner [4] describes this in the context of genomewide association scans; Allison et al. [5] describe it in the context of quantitative trait loci of paired siblings. This conditioning on a significant result has the effect of truncating the distribution of the marker's sampling distribution and creating an associated bias. For example, given a normally distributed variable *X* with mean μ and standard deviation ν, one can show the expected value of *X* conditional upon *X* > *c* (where *c* may be chosen to generate a significant *p* value) is given by

$$E\left[X|X>c\right]=\mu +\frac{{\nu}^{2}\xb7\phi \left(c;\mu ,{\nu}^{2}\right)}{\text{Prob}\left[X>c\right]}$$

(1.1)

where (·) denotes the probability density function of normally distributed variable with mean μ and standard deviation ν. The second term on the right hand side of (1.1) captures the bias from conditioning. This description of bias does not recognize the effect multiple testing may exert on creating bias beyond its role in requiring a *p* value threshold less than 0.05. In this paper multiple testing and focusing on extreme results (i.e. highly ranked results) is shown to generate overestimation in situations where significance bias may have little effect. The terms ‘significance bias’ and ‘ranking bias’ are not in common use – the phrases ‘truncation bias’ and ‘selection bias’ are used more often but are insufficiently specific and fail to draw the distinction made in this paper.

Though a detailed description is given in the next section the idea of ranking bias is briefly described here. Given a number of test statistics, each one is composed of a trend term (related to true underlying effect size) and some random variation. When looking at the test statistic with the largest positive observed effect size it is likely (under many circumstances) that the random variation component is positive as well – because the large, positive observed effect is the sum of a trend and the random component. This random component captures the difference between the true and observed effect size and positive values correspond to a type of overestimation bias – called ranking bias here.

Allison et al. [5] briefly describe ranking bias but focus the work on significance bias. Göring et al. [6] discuss both significance bias and ranking bias. They make an analogy between the ranking bias discussed here with that corresponding to bias in model selection procedures.

Though they discuss ranking bias their modeling and analytical results are derived solely from models of significance bias. Zöllner and Pritchard [7] use the term ‘Winner's Curse’ to describe significance bias in a single marker and present an analytical approach toward correcting this bias. Sun and Bull [8] explicitly describe significance bias and more implicitly describe ranking bias in the context of a mathematical model with a number of non-informative markers and 1 informative marker. In the present work we present more general models with a number of informative markers and the relationship between bias (both significance and ranking) and the pattern of the markers' effect sizes is explicitly explored.

The modeling proposed here is relatively simple both because it is easy to see how to apply this situation to others and because it allows for a transparent partitioning of a statistic into trend and random error terms that facilitates understanding ranking bias. Let *N* denote the common number of cases and controls drawn in population-based study, *G* denote the number of biallelic markers under consideration, *G*_{1} the number of markers with different allele frequencies between cases and controls, and *G*_{0} the number with no true difference between cases and controls, i.e. *G = G*_{1} + *G*_{0}. For simplicity's sake a recessive pattern is assumed to influence the likelihood of disease for the informative markers. For each of the *G* markers the log odds ratio will be computed from an associated 2 × 2 table (table (table11).

For the *j*-th marker let θ_{j} denote the log odds ratio and ${\stackrel{\u02c6}{\theta}}_{j}$ its estimate, log(*ad/bc*). Then

$${\mathcal{Z}}_{j}=\frac{{\stackrel{\u02c6}{\theta}}_{j}-{\theta}_{j}}{{\stackrel{\u02c6}{se}}_{j}}\sim N\left(0,1\right)$$

(2.1)

where

$${\stackrel{\u02c6}{se}}_{j}=\sqrt{1/a+1/c+1/b+1/d}$$

and ~ *N*(0,1) indicates an approximate Gaussian distribution with mean 0 and variance 1. Given the observed log odds and associated test statistic

$${T}_{j}=\frac{{\stackrel{\u02c6}{\theta}}_{j}}{{\stackrel{\u02c6}{se}}_{j}}$$

(2.2)

one can use the relation in (2.1) to obtain *p* values for the hypothesis that θ_{j} = 0 as well as to generate confidence intervals

$${\stackrel{\u02c6}{\theta}}_{j}\pm {\Phi}_{1-\alpha /2}^{-1}\xb7{\stackrel{\u02c6}{se}}_{j}$$

where 1 – α is confidence associated with the interval and ${\Phi}_{1-\alpha /2}^{-1}$ is the associated percentile from a *N*(0,1) distribution.

To investigate how bias arises it is useful to rewrite the relationship between (2.1) and (2.2). First, write

$$\stackrel{\u02c6}{sd}=\sqrt{\frac{1}{{\stackrel{\u02c6}{p}}_{2/\text{Case}}}+\frac{1}{1-{\stackrel{\u02c6}{p}}_{2/\text{Case}}}+\frac{1}{{\stackrel{\u02c6}{p}}_{2/\text{Control}}}+\frac{1}{1-{\stackrel{\u02c6}{p}}_{2/\text{Control}}}}$$

(2.3)

where

$${\stackrel{\u02c6}{p}}_{2/\text{Case}}=a/N\hspace{0.17em}\text{and}\hspace{0.17em}{\stackrel{\u02c6}{p}}_{2/\text{Control}}=b/N$$

(2.4)

denote the empirically observed probability of two minor alleles in the case and control populations, respectively. The term $\stackrel{\u02c6}{sd}$ is an estimate of σ, which is defined by equation (2.3) after replacing the estimated probabilities of having both minor alleles by the true unknown proportions of having both minor alleles. Then we can decompose the test statistic *T*_{j} as

$${T}_{j}=\frac{\sqrt{N{\stackrel{\u02c6}{\theta}}_{j}}}{{\stackrel{\u02c6}{sd}}_{j}}=\frac{\sqrt{N\left({\stackrel{\u02c6}{\theta}}_{j}-\theta j\right)}}{{\stackrel{\u02c6}{sd}}_{j}}+\frac{\sqrt{N{\theta}_{j}}}{{\stackrel{\u02c6}{sd}}_{j}}$$

(2.5)

$$={\mathcal{Z}}_{j}+\frac{\sqrt{N{\theta}_{j}}}{{\stackrel{\u02c6}{sd}}_{j}}\hspace{0.17em}\text{where}\hspace{0.17em}{\mathcal{Z}}_{j}=\frac{\sqrt{N}\left({\stackrel{\u02c6}{\theta}}_{j}-{\theta}_{j}\right)}{{\stackrel{\u02c6}{sd}}_{j}}.$$

(2.6)

The decomposition shows the test statistic *T*_{j} is composed of a trend term,

$$\frac{\sqrt{N}{\theta}_{j}}{{\stackrel{\u02c6}{sd}}_{j}}$$

and a mean 0 random fluctuation term, ${\mathcal{Z}}_{j}$. Note that ${\mathcal{Z}}_{j}$ and *T*_{j} differ in that ${\mathcal{Z}}_{j}$ has an approximately normal distribution centered about 0 while for *T*_{j} this is only true when θ_{j} = 0. The ${\mathcal{Z}}_{j}$ term captures the difference between estimated log odds and true log odds and its distribution is of primary interest in assessing overestimation bias.

For each *j*, ${\mathcal{Z}}_{j}$ has approximately an unconditional *N*(0,1) distribution. A critical question concerns the distribution of ${\mathcal{Z}}_{j}$ when *j* corresponds to most extreme observed *T*_{j} statistic. We will focus upon the case of largest *T*_{j} – the case for smallest, i.e. most negative, statistic is similar and generalizations involving it and other extreme statistics are discussed below.

Let *r*_{1},*r*_{2}, …, *r*_{G} order the *T* test statistics where

$${T}_{j}=\frac{\sqrt{N}{\stackrel{\u02c6}{\theta}}_{j}}{{\stackrel{\u02c6}{sd}}_{j}}\hspace{0.17em}\text{and}\hspace{0.17em}{T}_{{r}_{1}}\le {T}_{{r}_{2}}\le \dots \le {T}_{{r}_{G}}.$$

Typically, the distribution of

$${\mathcal{Z}}_{{r}_{G}}=\frac{\sqrt{N}\left({\stackrel{\u02c6}{\theta}}_{{r}_{G}}-{\theta}_{{r}_{G}}\right)}{{\stackrel{\u02c6}{sd}}_{{r}_{G}}}$$

(2.7)

is not that of a normal distribution with mean 0 – this will be demonstrated with simulations below. The reason for the bias is that if a number of markers have similar sized trend components

$$\frac{\sqrt{N}{\theta}_{j}}{{\stackrel{\u02c6}{sd}}_{j}}$$

then the random components ${\mathcal{Z}}_{j}$ are likely to play a larger role in determining which marker has the maximal observed value, ${T}_{{r}_{G}}$. Under these conditions of multiple markers with similar trend values the contribution of ${\mathcal{Z}}_{j}$ for that with largest ${T}_{j}$ value is likely to be positive. From the definition of ${\mathcal{Z}}_{{r}_{G}}$ in (2.7) we see a positive value corresponds to overestimation of the underlying log odds ratio. The expected value of the difference between ${\stackrel{\u02c6}{\theta}}_{{r}_{G}}$ and ${\stackrel{\u02c6}{\theta}}_{{r}_{G}}$ is what is meant by ranking bias, i.e.

$$\text{Ranking Bias}=E\left[{\stackrel{\u02c6}{\theta}}_{{r}_{G}}-{\theta}_{{r}_{G}}\right].$$

As discussed further below, it is important to note that *r*_{G}, the index of the marker with greatest observed effect size, is random and can change in repeated sampling from the relevant populations.

To illustrate the ranking bias a simple set of simulations are presented. In this case-control study there are *G* = 500 markers and there are *N* = 500 cases and 500 controls. In these simulation studies the markers are generated as independent variables. Populations of cases and controls were generated with 480 of the markers having no differences in allele frequencies. For the remaining 20 markers the odds ratios for diseased relative to healthy were 1.05, 1.10, …, 1.95, and 2.0. The probability of having 2 minor alleles in the control group was randomly chosen for each marker to lie between 0.05 and 0.25 (i.e. the minor allele frequencies varied between 0.224 and 0.5).

Given a simulated sample of cases and controls the following steps were performed:

For each of *G* = 500 markers find

$${T}_{j}=\frac{\sqrt{N}{\stackrel{\u02c6}{\theta}}_{j}}{{\stackrel{\u02c6}{sd}}_{j}},j\u220a\{1,\dots ,G\}.$$

Because the true log odds ratios (θ_{j}) are known, one can calculate a realization of

$${\mathcal{Z}}_{{r}_{G}}=\frac{\sqrt{N}\left({\stackrel{\u02c6}{\theta}}_{{r}_{G}}-{\theta}_{{r}_{G}}\right)}{{\stackrel{\u02c6}{sd}}_{{r}_{G}}}.$$

After making many such samples (1000 simulations) the resulting distribution of ${\mathcal{Z}}_{{r}_{G}}$ is shown along with the standard normal distribution in figure figure1.1. Of note are the facts that the ${\mathcal{Z}}_{{r}_{G}}$ distribution is not centered about 0 (indicating ${\stackrel{\u02c6}{\theta}}_{{r}_{G}}$ typically overestimates ${\theta}_{{r}_{G}}$) and that the distribution appears bimodal. Also, the distribution is relatively narrow compared to the standard normal density – this is perhaps surprising given asymptotic efficiency properties of normal approximations. However, the standard normal distribution is not relevant here as it neglects to take into account that *r*_{G}, the index of the maximal *T* statistic, is random and changes in different samples drawn from the underlying populations. For example, in the 1000 simulations approximately 34% of the time *r*_{G} corresponded to that marker with the odds ratio of 2.0, 32% of the time *r*_{G} was associated with the 1.95 odds ratio, and 8% of the time with that associated with 1.90, and so forth. Because different markers are chosen as the simulated datasets change a mixture distribution arises as suggested by the non-unimodal character of the distribution.

Distribution of ${\mathcal{Z}}_{{r}_{G}}=N\left({\stackrel{\u02c6}{\theta}}_{{r}_{G}}-{\theta}_{{r}_{G}}\right)/{\stackrel{\u02c6}{sd}}_{{r}_{G}}$ and a standard normal distributions; maximum odds ratio of 2.0.

The observed and true log-odds values associated with the largest test statistic (i.e. ${\stackrel{\u02c6}{\theta}}_{{r}_{G}}$ and ${\theta}_{{r}_{G}}$) for each of the 1000 simulations are shown in figure figure2.2. The discrete vertical bands correspond to the 12 distinct values of ${\theta}_{{r}_{G}}$ that were obtained: {0, log(1.45), log(1.50), …, log(2.0) = 0.69}. The degree to which the points lie above the diagonal 45 degree line indicate the amount of bias in each simulation (fig. (fig.22).

Thus far there has been no reliance upon a *p* value threshold as a means for creating the observed bias. Truncation bias related to conditioning on a significant *p* value (i.e. significance bias) is a common rationale given for overestimation bias in marker studies and is an independent contributing factor. The bias demonstrated thus far is not conditioned on the largest *T* statistic first exceeding some threshold, e.g. one that corresponds to a *p* value = 0.025/*G*. Hence it cannot be significance bias. To clarify the difference significance bias is now further discussed.

Recent work involving significance bias focuses upon the effect in a single marker [4, 7]. In our setting we illustrate the idea using the marker with largest true effect size. Let *r*^{0}_{G} designate this marker, i.e.

$$\frac{{\theta}_{{r}_{G}^{0}}}{{\sigma}_{{r}_{G}^{0}}}>\frac{{\theta}_{j}}{{\sigma}_{j}}\hspace{0.17em}\text{for all}\hspace{0.17em}j\ne {r}_{G}^{0},j\u220a\left\{1,\dots ,G\right\}.$$

Because *r*^{0}_{G} is fixed (though unknown) it is the case that for large sample size *N*

$${\mathcal{Z}}_{{r}_{G}^{0}}=\frac{\sqrt{N}\left({\stackrel{\u02c6}{\theta}}_{{r}_{G}^{0}}-{\theta}_{{r}_{G}^{0}}\right)}{{\stackrel{\u02c6}{sd}}_{{r}_{G}^{0}}}\sim \hspace{0.17em}\text{Normal}\hspace{0.17em}(0,1)$$

(4.1)

and hence $E\left[{\stackrel{\u02c6}{\theta}}_{{r}_{G}^{0}}\right]={\theta}_{{r}_{G}^{0}}.$ However, a bias is introduced if one conditions on ${T}_{{r}_{G}^{0}}=\sqrt{N}{\stackrel{\u02c6}{\theta}}_{{r}_{G}^{0}}/{\stackrel{\u02c6}{sd}}_{{r}_{G}^{0}}$ exceeding a thres hold value so that the associated *p* value is sufficiently small. For instance, if ${T}_{{r}_{G}^{0}}>3.90$ then the associated *p* value will be less than 0.025/500 and in our simulations the marker would be judged showing significant association using a Bonferroni criterion. However, using standard derivations concerning conditional expectations of normally distributed random variables one can show that

$$E\left[{\mathcal{Z}}_{{r}_{G}^{0}}|{T}_{{r}_{G}^{0}}>3.90\right]\approx E\left[{\mathcal{Z}}_{{r}_{G}^{0}}>3.90-\sqrt{N}{\theta}_{{r}_{G}^{0}}/{\sigma}_{{r}_{G}^{0}}\right]$$

(4.2)

$$=\frac{\phi \left(3.90-\sqrt{N}{\theta}_{{r}_{G}^{0}}/{\sigma}_{{r}_{G}^{0}}\right)}{1-\Phi \left(3.90-\sqrt{N}{\theta}_{{r}_{G}^{0}}/{\sigma}_{{r}_{G}^{0}}\right)}$$

(4.3)

where and Φ denote the density and cumulative distribution function of a standard normal distribution. Consequently it follows from (4.1) that

$$E\left[{\stackrel{\u02c6}{\theta}}_{{r}_{G}^{0}}|{T}_{{r}_{G}^{0}}>3.90\right]={\theta}_{{r}_{G}^{0}}+\text{significance bias}$$

(4.4)

$$\approx \frac{{\sigma}_{{r}_{G}^{0}}}{\sqrt{N}}.\left(\frac{\phi \left(3.90-\sqrt{N}{\theta}_{{r}_{G}^{0}}/{\sigma}_{{r}_{G}^{0}}\right)}{1-\Phi \left(3.90-\sqrt{N}{\theta}_{{r}_{G}^{0}}/{\sigma}_{{r}_{G}^{0}}\right)}\right).$$

(4.5)

The approximation arises because $\stackrel{\u02c6}{s}{d}_{{r}_{G}^{0}}$ has been replaced by ${\sigma}_{{r}_{G}^{0}}$ and the asymptotic normal distribution was used. From the result in (4.5) one can show how changes in sample size, odds ratio, allele frequency (affecting the σ term), and *p* value threshold change the significance bias. Such an assessment was performed by Garner [4].

However, in most circumstances *r*^{0}_{G} is not known, i.e. in a study with a large number of markers and little a priori knowledge of which markers have the strongest effects only *r*_{G} is observed and the preceding development based upon *r*^{0}_{G} is not really applicable in practice. What is of practical interest is $E\left[{\stackrel{\u02c6}{\theta}}_{{r}_{G}}|{T}_{{r}_{G}}>3.90\right].$ However analytical evaluation of this expression is more complicated than assessing bias for *r*^{0}_{G} because *r*_{G} is random. Consequently, simulations may be used to explore the following types of overestimation bias
significance bias related to ${\stackrel{\u02c6}{\theta}}_{{r}_{G}^{0}}:E[{\stackrel{\u02c6}{\theta}}_{{r}_{G}^{0}}|T{r}_{G}^{0}>3.90]-{\theta}_{{r}_{G}^{0}},$ ranking bias related to ${\stackrel{\u02c6}{\theta}}_{{r}_{G}}:E[{\stackrel{\u02c6}{\theta}}_{{r}_{G}}-{\theta}_{{r}_{G}}],$ and the combination of ranking bias and significance bias related to ${\stackrel{\u02c6}{\theta}}_{{r}_{G}}:E[{\stackrel{\u02c6}{\theta}}_{{r}_{G}}-{\theta}_{{r}_{G}}|{T}_{{r}_{G}}>3.90].$

Sets of simulations were performed to examine the relative size of biases across varying conditions. Each design has 500 markers with a smaller number of informative markers (*G*_{1}) ranging between 20 and 1; the results are shown in table table2.2. Scenario 1 corresponds to the simulations generating figure figure1,1, the maximum true odds ratio is 2 (log odds ratio of 0.69) and the maximum effect size is given by θ*r*^{0}_{G}/σ*r*^{0}_{G} = 0.69/3.465 where 3.465 is the associated σ value derived from the underlying allele probabilities (the probability of 2 minor alleles is 0.1685 in the control group) and odds ratio. An estimate of σ is given by $\stackrel{\u02c6}{sd}$ in equation (2.3). From equation (4.5) an analytical estimate of the significance bias applied to this particular marker is 0.072 as shown in the first column of table table22 (all biases in this section are reported on a log odds ratio scale). In the 1000 simulations the observed bias for this marker arising from conditioning on its *T* statistic exceeding 3.90 was 0.075 – in good agreement with the analytic result. The simulations' estimate of ranking bias related to ${\stackrel{\u02c6}{\theta}}_{{r}_{G}}$ is about 0.191 in this case – approximately 2.5 times larger than the significance bias of ${\stackrel{\u02c6}{\theta}}_{{r}_{G}^{0}}$. As an aside it is worth noting that other markers with smaller true effect sizes will have larger significance bias, but smaller likelihood of reaching the threshold. Finally we can get an estimate of the combined effects of ranking bias and significance bias for ${\stackrel{\u02c6}{\theta}}_{{r}_{G}}$ by examining just those simulations in which *T*_{rG} > 3.90. In these 1000 simulations *T*_{rG} > 3.90 in 99% of the occasions so the additional significance bias induced by conditioning on this common event is essentially negligible as it raises the bias only from 0.191 to 0.192 (table (table22).

Next we investigate how results change when the number of informative markers change from the design in Scenario 1. Scenario 2 has a similar structure but has 40 informative markers taking values 1.025, 1.05, …, 1.975, and 2.0. Because the sample size, standard deviations, and maximum odds ratio of 2 stay the same the significance bias associated with ${\stackrel{\u02c6}{\theta}}_{{r}_{G}^{0}}$ remains the same at 0.072. However, the changes in informative effect sizes induce changes in the ranking bias and combined ranking and significance bias associated with ${\stackrel{\u02c6}{\theta}}_{{r}_{G}}$. These figures have increased to 0.230 for both bias measures. Conversely, when the of informative markers are reduced to only 5 as in Scenario 3, taking values 1.20, …, 1.80, 2.0, then the two bias measures associated with ${\stackrel{\u02c6}{\theta}}_{{r}_{G}}$ are reduced to 0.113 and 0.129 though the significance bias associated with ${\stackrel{\u02c6}{\theta}}_{{r}_{G}^{0}}$ remains unchanged. These three scenarios show how bias in ${\stackrel{\u02c6}{\theta}}_{{r}_{G}}$ depends upon the distribution of all markers, not just the conditions associated with the single marker having the largest true effect size.

In Scenario 4 the more extreme case of a single informative marker with an odds ratio of 2 is considered. Here the ranking bias of ${\stackrel{\u02c6}{\theta}}_{{r}_{G}}=0.069$ is considerably less because there is a relatively low probability that any other marker will be associated with *r _{G}* – in 92.7% of the simulations the marker with an odds ratio of 2 generated the largest

In Scenario 7 more modest effect sizes are considered. Here there are 10 informative ORs between 1.04 and 1.40. In this case 10,000 simulations were considered as there was low power (7.1%) for any of the 10 markers with informative ORs to exceed the Bonferroni threshold. Because the power is low, one might expect higher significance bias and this is in fact the case. The marker with OR = 1.4 had significance bias of 0.330 – indicating the observed log-odds ratio was typically twice as high as the true effect (log(1.4) = 0.336) when the criterion was met. However, in this case the ranking bias was extensive as well with values of 0.448 (unconditional) and 0.471 (conditional upon a significant statistic). Ranking bias remains high because those conditions that create large significance bias (e.g. a large Bonferroni threshold relative to true effect size) also tend to create large ranking bias (many different markers could potentially be highest ranked, with large bias for any one marker being highest ranked). In general it is difficult to find conditions in which significance bias will appreciably exceed the rank bias – either unconditional or conditional upon the highest ranked test statistic exceeding a threshold. Indeed, as it is usually the case that ${\stackrel{\u02c6}{\theta}}_{{r}_{G}}\ge {\stackrel{\u02c6}{\theta}}_{{r}_{G}^{0}}$ and ${\theta}_{{r}_{G}}\le {\theta}_{{r}_{G}^{0}}$ one would not expect significance bias, $E\left[{\stackrel{\u02c6}{\theta}}_{{r}_{G}^{0}}-{\theta}_{{r}_{G}^{0}}|{T}_{{r}_{G}^{0}}>3.90\right],$ to exceed the conditional ranking bias, $E\left[{\stackrel{\u02c6}{\theta}}_{{r}_{G}}-{\theta}_{{r}_{G}}|{T}_{{r}_{G}}>3.90\right].$

Finally, Scenario 8 shows the case when all the informative markers have the same effect size. As described in the Discussion and Appendix sections one should expect high ranking bias and this is in fact the case.

Among the conclusions to be drawn from these comparisons of different types of bias are (1) significance bias associated with a single, fixed marker may not be particularly relevant; (2) ranking bias is distinct from significance bias; (3) ranking bias is likely the greater of the two sources when many variables are considered. Finally, as shown by comparing the first 4 scenarios in table table22 that have many common design elements, the effect sizes of all markers play a role in determining ranking bias – efforts to understand bias associated with highly ranked markers must take this into account.

Appendix A shows that ranking bias (scaled by the marker's standard deviation and sample size) for independent markers is given approximately by

$$E\left[\frac{\sqrt{N}\left({\stackrel{\u02c6}{\theta}}_{{r}_{G}}-{\theta}_{{r}_{G}}\right)}{{\stackrel{\u02c6}{sd}}_{{r}_{G}}}\right]=E\left[{\mathcal{Z}}_{{r}_{G}}\right]\approx \frac{1}{\sqrt{2\pi}}\sum _{j=1}^{G}E\left|{e}^{\frac{-{M}_{-j}^{2}}{2}}\right|$$

(5.1)

where

$${M}_{-j}=\underset{k\ne j}{max}{T}_{k}-\sqrt{N}\frac{{\theta}_{j}}{{\sigma}_{j}}=\underset{k\ne j}{max}\left({\mathcal{Z}}_{k}+\sqrt{N}\left(\frac{{\theta}_{k}}{{\sigma}_{k}}-\frac{{\theta}_{j}}{{\sigma}_{j}}\right)\right)$$

(5.2)

for *j* in 1, …, *G*. In words, *M*_{–j} represents the largest of all *T _{k}* (besides

As the sample size increases then each of the *M _{–j}* increase toward infinity in absolute value (with probability 1 under common conditions) so that the bias tends to zero as we expect assuming there are some informative markers and one is more informative than the others. On the other hand, bias is considerably worse when the effect sizes, θ

$$E\left[{\mathcal{Z}}_{{r}_{G}}\right]\approx \frac{G}{\sqrt{2\pi}}E\left[{e}^{-\frac{1}{2}{\left(max\left\{{X}_{1},\dots ,{X}_{G-1}\right\}\right)}^{2}}\right]$$

An important related question that underlies equation (5.1) is: given repeated samples of size *N* from the underlying populations, how many different markers could reasonably be selected as *r _{G}*? If one marker has an effect size that is much larger than all the rest then in repeated samples we would expect

The results thus far presented only focus upon the marker with largest test statistic. A natural question is how much bias may be present if attention is paid to all markers that exceed a significance threshold. This corresponds to the common strategy of following-up on all markers that meet a stringent significance threshold. For each simulation under Scenario 1 in table table22 the number of markers with *T* scores > 13.90 was recorded and the difference between the observed δ^{^} and the true underlying θ was recorded. Table Table33 provides information about all highly ranked markers that exceed the threshold. In 99% of the simulations at least one marker met the threshold requirement. The bias associated with the highest ranking marker exceeding the threshold is 0.192 when measured on a log odds scale – this is essentially the same information as reported in the first row of table table2.2. Again it is worth noting that this is not reporting the results for the marker with true log-odds ratio of 2.0, but rather the average bias for the marker with the highest observed *T* statistic – in 34% of the simulations this did correspond to the marker with odds ratio of 2.0, in 32% of occasions the marker with odds ratio of 1.95, etc…. The other rows of table table33 provide new information: in 92.5% of the simulations the second highest ranked marker met the Bonferroni criteria and the bias for these instances was estimated as 0.143 log-odds units. Table Table33 gives information for the 3rd, 4th, 5th, 6th, and 7th mostly highly ranked markers (no more than 7 markers ever met the criteria) and we see the bias conditional upon exceeding the threshold remains considerable. This table indicates that the ranking bias is not just a problem for the most highly ranked marker – it can be present and substantial for all those markers meeting significance criteria.

In this report the focus has been on measuring bias associated with the largest effect size. Ranking bias is potentially present in any study with multiple tests where attention is drawn to outcomes associated with the most extreme observed effect sizes. Here we have seen that focus upon highly ranked results, rather than conditioning on significant *p* values, can be responsible for much of overestimation bias. Some authors have put forth methods to correct for both ranking and significance bias. The bootstrap/cross-validation approach put forth by Sun and Bull [8] can accommodate both types of bias. A similar bootstrap approach by Jeffries [9, 10] has been employed to correct for ranking bias in microarray and diagnostic modeling contexts. Analytic approaches have been put forth by Zöllner and Pritchard [7], Ghosh et al. [11], and Zhong and Prentice [12] however these appear to address significance bias but not ranking bias. The more extensive dependence of rank bias on potentially all effect sizes (as opposed to significance bias that depends only upon one marker's effect size) complicates analytical solutions. It is likely that analytic approaches to biased overestimation from genomic studies that ignore the need to consider all markers' effect sizes may be missing an important aspect of the problem, i.e. the distribution of the most extreme test statistic is a mixture distribution depending, in principle, upon all markers' parameters.

The modeling in this paper was very simple: a single stage study (i.e. no two-stage or higher stage designs) with a recessive genetic model and Wald test approach to evaluating significance. The simple approach allowed for a simple analytic description of the problem – more complicated designs are likely prone to the same types of biases and work remains for examining the role of ranking bias in these circumstances.

Here an effort is made to sketch the degree of bias that may be expected and link this magnitude to some factors such as sample size, distribution of true effect sizes, and the number of markers. Simplifying assumptions will be employed as necessary. Recall that the *j*-th test statistic *T _{j}* may be decomposed into a trend and random fluctuation component as

$${T}_{j}={\mathcal{Z}}_{j}+\sqrt{N}\frac{{\theta}_{j}}{{\stackrel{\u02c6}{sd}}_{j}}$$

where ${\mathcal{Z}}_{j}=\sqrt{N}\left({\stackrel{\u02c6}{\theta}}_{j}-{\theta}_{j}\right)/{\stackrel{\u02c6}{sd}}_{j}$ and ${\mathcal{Z}}_{j}$ is approximated by a standard normal distribution. Then we may calculate the expected value of ${\mathcal{Z}}_{rG}$(and hence obtain a measure of the bias of ${\stackrel{\u02c6}{\theta}}_{{r}_{G}}-{\theta}_{{r}_{G}}$) as

$$E\left[{\mathcal{Z}}_{{r}_{G}}\right]=\sum _{j=1}^{G}E\left[{\mathcal{Z}}_{{r}_{G}}|{r}_{G}=j\right]P\left[{r}_{G}=j\right]$$

(A.1)

and

$$E\left[{\mathcal{Z}}_{{r}_{G}}|{r}_{G}=j\right]P\left[{r}_{G}=j\right]=\left(\int {z}_{j}{f}_{\mathcal{Z}j|{r}_{G}=j}\left({z}_{j}\right)d{z}_{j}\right)P\left[{r}_{G}=j\right]$$

(A.2)

$$=\frac{\int {Z}_{j}{f}_{\mathcal{Z}j}\left({z}_{j},{r}_{G}=j\right)d{z}_{j}}{P\left[{r}_{G}=j\right]}P\left[{r}_{G}=j\right]$$

(A.3)

$$=\int {Z}_{j}{f}_{{\mathcal{Z}}_{j}}\left({z}_{j},{r}_{G}=j\right)d{z}_{j}$$

(A.4)

where ${f}_{{\mathcal{Z}}_{j|{r}_{G}=j}}$ is the conditional distribution of ${\mathcal{Z}}_{j}$ given *r _{G}* =

$$\hspace{0.17em}\text{Now}\hspace{0.17em}{r}_{G}=j\hspace{0.17em}\text{if and only if}\hspace{0.17em}\hspace{0.17em}{T}_{j}>\underset{k\ne j}{max{T}_{k}}$$

(A.5)

$$\text{if and only if}{\mathcal{Z}}_{j}>\underset{k\ne j}{max}\left({\mathcal{z}}_{k}+\sqrt{N}\left(\frac{{\theta}_{k}}{{\stackrel{\u02c6}{sd}}_{k}}-\frac{{\theta}_{j}}{{\stackrel{\u02c6}{sd}}_{j}}\right)\right).$$

(A.6)

Here we have used the decomposition in (2.5) to show how the bias depends on the random fluctuation terms, ${\mathcal{Z}}_{j}$ and ${\mathcal{Z}}_{k}$. To simplify we will approximate the *sd*^{^}_{j} and *sd*^{^}_{k} terms by σ_{j} and σ_{k} (the associated true population values). Then we obtain

$$E\left[{\mathcal{Z}}_{{r}_{G}}\right]=\sum _{j=1}^{G}\int {z}_{j}{f}_{{\mathcal{Z}}_{j}}\left({z}_{j},{r}_{G}=j\right)d{z}_{j}$$

(A.7)

$$\approx \sum _{j=1}^{G}E\left[{\mathcal{Z}}_{j}\xb7{I}_{\left[{\mathcal{Z}}_{j}>{M}_{-j}\right]}\right]$$

(A.8)

where

$${M}_{-j}=\underset{k\ne j}{max}\left({\mathcal{Z}}_{k}+\sqrt{N}\left(\frac{{\theta}_{k}}{{\sigma}_{k}}-\frac{{\theta}_{j}}{{\sigma}_{j}}\right)\right)$$

(A.9)

and

$$={I}_{\left[A\right]}\hspace{0.17em}\text{is a function}=1\hspace{0.17em}\text{if event A is true},0\hspace{0.17em}\text{otherwise}.$$

(A.10)

To proceed further we condition the expectation on *M*_{–j} and recall that ${\mathcal{Z}}_{j}$ is marginally distributed as a standard Gaussian random variable and thus, for independent markers,

$$E\left[{\mathcal{Z}}_{{r}_{G}}\right]\approx \sum _{j=1}^{G}E\left[{\mathcal{Z}}_{j}\xb7{I}_{\left[{\mathcal{Z}}_{j}>{M}_{-j}\right]}\right]$$

(A.11)

$$=\sum _{j=1}^{G}E\left[E\left[{\mathcal{Z}}_{j}\xb7{I}_{\left[{\mathcal{Z}}_{j}>{M}_{-j}\right]}|{M}_{-j}\right]\right]$$

(A.12)

$$=E\left[\sum _{j=1}^{G}{\int}_{{M}_{-j}}^{\infty}zf\mathcal{Z}(z)d\right]$$

(A.13)

*fz* denotes a Gaussian distribution and the expectation in (A.13) is necessary because *M _{–j}* contains random elements ${\mathcal{Z}}_{k}$. The integral may be explicitly rewritten as

$$E\left[{\mathcal{Z}}_{{r}_{G}}\right]\approx \frac{1}{\sqrt{2\pi}}\sum _{j=1}^{G}E\left[{e}^{\frac{-{M}_{-j}^{2}}{2}}\right].$$

(A.14)

From (A.14) one sees that bias is inversely related to the absolute value of the *M _{–j}* terms. Some consequences of this derivation are as follows.

Consider the effect of increasing the sample size holding all else constant. We focus upon the case in which there exists a single true maximum effect size and designate that marker's index by r^{0}_{G}, i.e.

$$\frac{{\theta}_{{r}_{G}^{0}}}{{\sigma}_{{r}_{G}^{0}}}>\frac{{\theta}_{j}}{{\sigma}_{j}}\hspace{0.17em}\text{for all}\hspace{0.17em}j\ne {r}_{G}^{0}.$$

It is worthwhile to examine *M _{–j}* for the case when

$$\underset{N\to \infty}{lim{M}_{-j}}=\underset{N}{lim}\underset{k\ne j}{max}\left({\mathcal{Z}}_{k}+\sqrt{N}\left(\frac{{\theta}_{k}}{{\sigma}_{k}}-\frac{{\theta}_{j}}{{\sigma}_{j}}\right)\right)$$

(A.15)

$$=-\infty \hspace{0.17em}\text{with probability}1\hspace{0.17em}\text{if}\hspace{0.17em}j={r}_{G}^{0}$$

(A.16)

$$=-\infty \hspace{0.17em}\text{if}\hspace{0.17em}j={r}_{G}^{0}.$$

(A.17)

In either case we have that *M*^{2}_{–j} → ∞ so from (A.14) one sees $limN\to \infty E\left[{\mathcal{Z}}_{{r}_{G}}\right]=0.$

The case for expanding the differences among effect sizes is similar – at least for the simplified example below. For a given pattern of effect sizes among the *G* variables (again with no ties for the largest effect size), consider a new pattern of effect sizes given by multiplying each original effect by a constant γ > 0. Then if *r*^{0}_{G} designates the marker with the most positive true effect size

$$\underset{\gamma \to \infty}{lim}{M}_{-j}=\underset{\gamma \to \infty}{lim}\underset{k\ne j}{max}\left({\mathcal{Z}}_{k}+\gamma \sqrt{N}\left(\frac{{\theta}_{k}}{{\sigma}_{k}}-\frac{{\theta}_{j}}{{\sigma}_{j}}\right)\right)$$

(A.18)

$$=-\infty \hspace{0.17em}\text{if}\hspace{0.17em}j={r}_{G}^{0}$$

(A.19)

$$=\infty \hspace{0.17em}\text{if}\hspace{0.17em}j\ne {r}_{G}^{0}.$$

(A.20)

Consequently the same conclusion of no bias follows. If one reverses the limiting action of γ so that γ → 0 from above then

$$\underset{\gamma \downarrow 0}{lim}{M}_{-j}=\underset{k\ne j}{max{\mathcal{Z}}_{k}}$$

(A.21)

where the ${\mathcal{Z}}_{k}$are standard normal statistics and the bias is then positive. This situation corresponds to the situation of no variables showing differential expression.

The case for increasing *G*, the number of variables is less clear cut as it depends upon the combination of effect sizes. Empirically it seems that adding variables with effect sizes at or near the size of the largest preexisting effect sizes exaggerates the bias effects for θ_{rG} In terms of figuring the change of *M _{–j}* terms as above there is more ambiguity as some terms

1. Benjamini Y, Hochberg Y. Controlling the false discovery rate – a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995;57:289–300.

2. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci USA. 2003;100:9440–9445. [PubMed]

3. Editorial Freely associating. Nat Genet. 1999;22:1–2. [PubMed]

4. Garner C. Upward bias in odds ratio estimates from genome-wide association studies. Genet Epidemiol. 2007;31:288–295. [PubMed]

5. Allison D, Fernandez JR, Heo M, Zhu S, Etzel C, Beasley TM, Amos CI. Bias in estimates of quantitative-trait locus effect in genome scans: demonstration of the phenomenon and a methodof-moments procedure for reducing bias. Am J Hum Genet. 2002;70:575–585. [PubMed]

6. Göring H, Terwilliger JD, Blanger J. Large upward bias in estimation of locus-specific effects from genomewide scans. Am J Hum Genet. 2001;69:1357–1363. [PubMed]

7. Zöllner S, Pritchard JK. Overcoming the winner's curse: estimating penetrance parameters from casecontrol data. Am J Hum Genet. 2007;80:605–615. [PubMed]

8. Sun L, Bull SB. Reduction of selection bias in genomewide studies by resampling. Genet Epidemiol. 2005;28:352–367. [PubMed]

9. Jeffries N. Multiple comparisons distortions of parameter estimates. Biostatistics. 2007;8:500–504. [PubMed]

10. Jeffries N. Performance of a genetic algorithm for mass spectrometry proteomics. BMC Bioinformatics. 2004;5:180. [PMC free article] [PubMed]

11. Ghosh A, Zou F, Wright FA. Estimating odds ratios in genome scans: an approximate conditional likelihood approach. Am J Hum Genet. 2008;82:1064–1074. [PubMed]

12. Zhong H, Prentice RL: Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics 2008, in press. [PMC free article] [PubMed]

Articles from Human Heredity are provided here courtesy of **Karger Publishers**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |