Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2827664

Formats

Article sections

- Summary
- 1 Introduction
- 2 Ranking Methods
- 3 Pre-posterior Distributions
- 4 Performance Comparisons
- 5 Trade-offs when Simulations are Needed
- 6 Discussion
- References

Authors

Related links

Biom J. Author manuscript; available in PMC 2010 December 1.

Published in final edited form as:

PMCID: PMC2827664

NIHMSID: NIHMS148730

Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore MD

Simulation-based assessment is a popular and frequently necessary approach to evaluation of statistical procedures. Sometimes overlooked is the ability to take advantage of underlying mathematical relations and we focus on this aspect. We show how to take advantage of large-sample theory when conducting a simulation using the analysis of genomic data as a motivating example. The approach uses convergence results to provide an approximation to smaller-sample results, results that are available only by simulation. We consider evaluating and comparing a variety of ranking-based methods for identifying the most highly associated SNPs in a genome-wide association study, derive integral equation representations of the pre-posterior distribution of percentiles produced by three ranking methods, and provide examples comparing performance. These results are of interest in their own right and set the framework for a more extensive set of comparisons.

Hans van Houwelingen’s impact on statistical theory and practice is both broad and deep. His many contributions are detailed elsewhere in this volume and we mention only a few. Hans has made signature methodological contributions to non- and semi-parametric empirical Bayes, to large sample theory in several contexts, to cross-validation and other sample reuse methods. His applied collaborations are extensive and, as with his methods work, his impact is “zeer groot.”

Simulation-based assessment is a popular and frequently necessary approach to evaluation of statistical procedures. Approaches range from the most straightforward, but generally inefficient (prepare a straightforward program and just blast off) to more sophisticated approaches using importance sampling (to enable generation of the appropriate stochastic sequences or to improve efficiency), rejection sampling or Markov Chain Monte-Carlo (to enable generation of the appropriate stochastic sequences). Carlin and Louis (2009) provide details on these and others. Sometimes overlooked is the ability to take advantage of underlying mathematical relations and we focus on this aspect. We show how to take advantage of large-sample convergence of empirical distributions to develop mathematical relations that provide an excellent approximation to the results of a large-scale simulation. When components of these relations are available analytically, the approximations can replace simulations. If some components are themselves available only by simulation, then in some cases the mathematical relations can be used to improve efficiency. This use of two types of CPUs (one biologic and one in silico) most definitely represents a strategic approach that Hans would take. We trust that this article, at least to a small degree, captures the spirit of his innovation, creativity and eclectic use of statistical theory and simulation-based evaluations.

Hans has made important contributions to genomics analysis, in particular with regard to statistical methodology for dimension reduction in association studies (Goeman et al., 2006; Heidema et al., 2007; Uh et al., 2007), and efficient statistical methods for linkage and association studies that comprehensively employ all information available (Houwing-Duistermaat et al., 2007; Lebrec et al., 2008). We motivate our approach and provide proof of concept in the setting of Genome-Wide Association studies (GWAs). Specifically, we consider evaluating and comparing performance of ranking-based methods that can be used to identify the most highly associated SNPs in a GWAs.

A single-nucleotide polymorphism (SNP) is a locus in the DNA where some appreciable variation in the nucleotide sequence between the members of a species occurs. In humans for example, each person has one pair of each of the autosomes (chromosomes 1–22), with one copy inherited from each parent. While almost all loci have identical nucleotides in each DNA strand for each person, about every 500 base pair positions some differences between people exist. Interestingly, the vast majority of all SNPs are bi-allelic, that is, only 2 out of the 4 nucleotides occur in the population. For a locus with variation to be defined as a SNP, the frequency with which the minor allele occurs in the population is usually required to be above 1%. Since there are substantial differences in the genetic background between ethnic groups or geographically separated populations, the actualminor allele frequencies (MAFs) between sub-populations can differ significantly. The bi-allelic SNPs are typically encoded as nucleotide pairs within a person’s matched chromosome pairs (e. g. (CC,CT,TT)), or specifically for SNPs typed in genomic arrays, as a generic factor with three levels (e. g. (AA,AB,BB)). In the statistical modeling of SNP data, the SNPs are typically encoded as the number of variant alleles (0, 1, or 2), where the variant allele (as opposed to the common allele) is defined as the allele that occurs less frequent in the population. In a “homogeneous” population (and under certain assumptions such as random mating), the frequencies of the nucleotide pairs that define a SNP is governed by the Hardy-Weinberg equilibrium, producing prevalences of (*p*^{2}, 2*pq*, *q*^{2}), with *p* being the major and *q* = 1 − *p* the minor allele frequency. A typical GWAs assesses association between candidate genotypes and a disorder or a quantitative trait (the phenotype), often via Z-scores, testing the hypothesis of no association between the genotype (AA, AB, BB) and the phenotype. Then, these association measures are ranked and the most highly associated SNPs identified. See Ruczinski et al. (2009) for additional details. To detect the genetic causes of complex human disease, genome-wide association studies have become the method of choice for many researchers around the globe. Consistent with this is the sheer abundance of GWAs papers that have been published in the last couple of years.

The predominant strategy for analyzing GWAs data is to carry out SNP specific (marginal) tests such as the Cochran-Armitage trend test, sort the results from the smallest to the largest p-value, and declare significance based on a Bonferroni correction to limit the family-wise error rate at a fixed level (typically, 5%). The benefit of this approach is its straightforward and reproducible implementation, and has led to the identification of important genes for many diseases (Manolio et al., 2008, provide an overview). However, both from scientific and statistical perspectives, this approach is sub-optimal for various reasons.

Enforcing a tight control on the family-wise error rate, i.e. the probability of declaring at least one SNP significant that is not associated with the phenotype, comes at the expense of the truly associated SNPs which do not achieve ”genome wide” significance. It appears that this strategy works in strong opposition to the aims of a GWAs, typically declared to be *hypothesis generating* and *discovery* oriented. This stringent type I error control is particularly worrisome in the investigation of truly complex diseases such as asthma or schizophrenia, since many SNPs, each typically with a very modest effect size, contribute to the disease risk. Unless the sample size is huge, none of the SNPs might reach overall significance after multiple comparisons corrections. From a Bayesian perspective, a Bonferroni adjustment with very small *α* can be justified if one has a strong prior belief that all the nulls are true (Westfall et al., 1997), however, the opposite is the case in GWAs. Moreover, controlling the family wise error rate via Bonferroni completely ignores type II errors, and employs a uniform threshold for significance that ignores power. Using such a constant threshold in a frequentist setting leads to an inconsistent procedure, and viewed from a decision theory perspective, leads to an inadmissible procedure (Wakefield, 2007, 2008a,b).

Employing the family wise error rate is certainly reasonable if the confirmation of a list of candidate SNPs believed to be associated with the phenotype is the objective, for example in sequencing or fine mapping of genomic regions. However, for many complex diseases the a priori information of the genetic base is very limited, and GWAs are usually carried out as truly hypothesis generating. Thus, the objective is to generate a list of SNPs for further investigation about their role in the genetic underpinning of the disease, or their correlation with truly disease causing variants via fine mapping. The generation of such a SNP list obviously relies on proper ranking methods, and should take the conditional power of each variant into account. In this manuscript, we focus on such ranking procedures.

We focus SNP selection based on ranks with no direct attention to controlling type I error, which usually is the procedure of choice (see for example Manolio et al., 2008). We consider ranks based on hypothesis test Z-scores and those based on Bayes optimal ranks (Lin et al., 2006). Ultimately, we are interested in simulation-based comparison of the performance of these with the current approaches. However, a GWAs can include more than one million candidate SNPs (and the number is growing), so computational demands in a simulation-based approach can be substantial. However, fortunately, as we show when the number of candidate SNPs is large, asymptotic consistency of empirical distributions can be used to speed up the assessment. These results will enable future complete assessments.

The scope of this paper is to lay the foundation and to compare different ranking procedures in a setting where only one of many variables is associated with an outcome of interest. We motivate this problem within the context of SNP association studies, which will be our main future application. Thus, we use a *spiked* SNP to demonstrate the differences in ranking procedures, that is, a single genetic locus with simulated non-zero effect size. In section 2 we outline candidate ranking methods including Z-score bases and Bayesian optimal approaches. See Lin et al. (2006) for full details. Sections 2.3 and 3 derive distributions of percentile ranks for a spiked SNP when *K* is large. Section 4 reports on comparisons of candidate ranking methods for mathematically evaluable models. Section 5 presents some cautions and section 6 sums up.

A variety of ranking methods are in use (see Lin et al., 2006). We focus on ranks/percentiles based on hypothesis test Z-scores and two Bayes optimal approaches. We rank (*K* +1) SNPs, where *k* = 1, …, *K* are “null” SNPs (with null or moderate association with the phenotype), and *k* = 0 is the spiked SNP, a SNP with a stochastically larger association than for the null SNPs.

For each SNP, compute a Z-score which is used to test the hypothesis of no association, producing (*Z*_{0}, *Z*_{1}, …, *Z _{K}*). Rank these and divide the ranks by

As detailed by Carlin and Louis (2009) and many other authors, Bayesian analysis requires a full probability model consisting of a prior distribution for parameters and a data likelihood (sampling distribution), conditional on parameters. Observed data map the prior to the posterior distribution and the posterior is used to make inferences, sometimes guided by a loss function (e.g, decision theory). In applications, especially with a large *K*, empirical Bayes approaches are very attractive (Schwender and Ickstadt, 2008).

Lin et al. (2006) show that Bayesian ranking/percentiling procedures structured by a ranking-relevant loss function can outperform those produced by ranking Z-scores or MLEs. We report on two ranking procedures, one that is “general purpose” and one targeted at identifying a specific number of SNPs. We assume that *θ* is the true magnitude of the genotype/phenotype association and construct a model so that it is also the conditional mean function for observed data. For clarity, we assume that other parameters in the sampling model are known or well-estimated, that the sampling distribution is Gaussian and so the maximum likelihood estimate of *θ* (subsequently denoted by *Y _{k}*) is sufficient for computing the posterior distribution, and that we are interested in detecting only the large, rather than both the large and small

For the null SNPs we have, k = 1, …, K;

$$\begin{array}{lll}[{\theta}_{k},{\sigma}_{k}^{2}]\hfill & \mathit{iid}\hfill & H={H}_{1}({\theta}_{k}){H}_{21}\hfill \end{array}$$

(1)

We will use in the sequel that,

$$\text{E}({H}_{1k}(t{Y}_{k}))={H}_{1}(t),$$

(2)

the prior distribution. The expectation is with respect to *f _{k}*(

In a likelihood-based analysis, inputs *Y _{k}*(=

For Bayes optimal ranks use the following representation for the *K* null SNPs in the context of ranking a total of (*K* + 1) SNPs. The percentile for the spiked SNP is computed along with those for the null SNPs. The Bayesian approach uses as input the computation that would be made, if one were to know the *θ _{k}* without error. Specifically, for

$${R}_{k}(\mathit{\theta})=\sum _{\nu =0}^{K}{I}_{\{{\theta}_{k}\ge {\theta}_{\nu}\}};\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}{P}_{k}={R}_{k}/(K+2).$$

The smallest *θ* has rank 1. Inferences depend on the posterior distribution of the *R _{k}*(

First, compute the posterior expected ranks,

$${\overline{R}}_{k}=(\mathbf{Y},{\sigma}^{2})=E[{R}_{k}(\mathit{\theta})Y,{\sigma}^{2}]=\sum _{\nu}pr[{\theta}_{k}\ge {\theta}_{\nu}Y,{\sigma}^{2}].$$

Then, rank the * _{k}* to get

Select 0 < *γ* < 1, and let

$$\begin{array}{l}{\mathit{EDF}}_{K}(t\mathit{\theta})=\frac{1}{K}\sum {I}_{\{{\theta}_{k}\le t\}},\phantom{\rule{0.16667em}{0ex}}\text{the}\phantom{\rule{0.16667em}{0ex}}\text{edf}\phantom{\rule{0.16667em}{0ex}}\text{of}\phantom{\rule{0.16667em}{0ex}}\text{the}\phantom{\rule{0.16667em}{0ex}}{\theta}_{k}& {\overline{\mathit{EDF}}}_{K}(t\mathbf{Y},{\sigma}^{2})={E}_{H}[{\mathit{EDF}}_{K}(t\mathit{\theta})\mathbf{Y},{\sigma}^{2}]=\frac{1}{K}\sum {P}_{H}[{\theta}_{k}\le t\mathbf{Y},{\sigma}^{2}]\end{array}$$

Then,

$${R}_{k}^{}$$

The
${P}_{k}^{}$ (defined as the ranks divided by *K* +2) are ordered according to the posterior probability of *θ _{k}* being among the top (1 −

We need to map the trend test Z-score numerator and denominator into an estimate of the implicit *θ* (the log odds ratio for risk) and an estimate of its conditional variance. Using the model in section 2.1 and equation (21) in the appendix, we have for *θ* near 0 (omitting the subscript *k*),

$$\begin{array}{l}\text{E}(Z\theta ,\mathbf{w})=m(\theta ,\mathbf{w})\theta \sqrt{n\text{Var}(\mathbf{w})\widehat{\pi}(1-\widehat{\pi})}\text{Var}(Z\theta ,\mathbf{w})1\end{array}$$

(3)

So,

$$\begin{array}{l}\widehat{\theta}=Y=\frac{Z}{\sqrt{n\text{Var}(\mathbf{w})\widehat{\pi}(1-\widehat{\pi})}}\\ {\widehat{\sigma}}^{2}=\widehat{\text{Var}}(\widehat{\theta}\theta )=\widehat{\text{Var}}(Y\theta )=\frac{1}{n\text{Var}(\mathbf{w})\widehat{\pi}(1-\widehat{\pi})}.\end{array}$$

(4)

We also need the relation,

$$\text{E}(Z\theta ,\sigma )=m(\theta ,\sigma )=\frac{\theta}{\sigma}.$$

Therefore, for a given sample size (*n*), the power of a test or the performance of a ranking procedure depends on the MAF and on the fraction of cases in the sample (*π*).

The ranking method specific, cumulative distribution of the percentile for the spiked SNP provides a complete description of performance. A fully simulation-based approach to estimating this cdf would proceed by generating data from the assumed scenario for all (*K* + 1) SNPs, computing the percentiles and saving the percentile for the spiked SNP. After a sufficient number of simulation replications, the empirical distribution function (edf) of these percentiles can be used as the estimate and edfs from competing methods can be compared. For clarity, in the sequel we drop
${\sigma}_{k}^{2}$ and so “*Y _{k}*” stands for “(

Our goal is to compute pr(*P _{K}*(

$$1-\frac{[s(K+2)]+1}{K+1}\le \text{pr}({P}_{K}({Y}_{0})>s)\le 1-\frac{[s(K+2)]}{K+1}.$$

(5)

The foregoing holds for a fixed *s*. For *s* = *s _{K}* such that (1 −

In the following sections we approximate performance for large *K* by developing mathematical results for pr(*P*_{∞}(*Y*_{0}) > *s*).

In real association studies, there is usually a fair amount of information available that can be employed in the specification of the prior distribution *H*. While the vast majority of all SNPs will not be associated (in a scientifically meaningful way) with a particular phenotype, the number of truly associated SNPs and the respective effect sizes can differ substantially between various outcomes of interest (see for example the results published by the Wellcome Trust Case Control Consortium 2007 and Manolio et al. 2008). Moreover, a certain degree of controversy exists in the community about two distinct genetic models which can be described as “common disease - common variant” and “common disease - rare variant”. The former, assuming that a few common SNPs explain a substantial proportion of the phenotype variability in a population, is the basis of the GWAs. However, in particular for some psychiatric diseases such as schizophrenia and bipolar disorder, this paradigm has been challenged (McClellan et al., 2007). The “common disease - rare variant” models asserts that for certain heritable traits only one rare variant of large effect explains the phenotype in a particular family, but different rare variants with large effects (which may or may not be in the same gene) are present in other families (Walsh et al., 2008). Thus, according to this view, the disease status of the subjects in a population could largely be explained by a collection of individually rare variants, with large effects.

For the null SNPs we assume that the working and actual priors both equal to *H*. Therefore, using standard Bayesian computations, the pre-posterior expectation of the posterior distribution is the prior *H* and it can be used as the posterior ensemble posterior distribution for the null SNPs. It is the *K* = ∞ approximation to the finite *K* distribution. To see that this produces a good approximation for all but the most extreme percentiles, consider simulating *K* “null” SNPs (these are not necessarily unassociated with phenotype, but have generally low association) and a small number (we focus on 1) highly associated, “spiked” SNPs. If *K* is very large, for each simulation replication the empirical distribution function (edf) of the *K* simulated values used for ranking will be very close to the theoretical (the generating) distribution because this edf is the average of *iid* realizations with expectation the distribution of interest. So, for a fixed percentile, the average converges almost surely and uniformly to the distribution of interest and for *K* large, the edf is, for all practical purposes, equal to the induced distribution. For example, in the Z-score approach, if all *K* SNPs are truly null, then the edf of the Z-scores is essentially *N*(0, 1) for each replication; more generally the edf is essentially equal to the distribution induced by the sampling model.

The ensemble posterior for the null SNPs is

$${\overline{H}}_{1}(t\mathbf{Y})=\frac{1}{K}\sum _{k}{H}_{1k}(t{Y}_{k}).$$

(6)

The summands in (6) are *iid* bounded random variables with mean *H*_{1} (see equation (2)). So, the average converges almost surely and uniformly to *H*_{1}. For very large *K*, _{1}(*t* | **Y**) ≈ *H*_{1}(*t*). Note that if for *θ*, *H*_{1} is a point mass at 0, then so will be _{1}.

The foregoing assumes that *H* (and so *H*_{1}) is known, but if *K* is large, an empirical Bayes or Bayes empirical Bayes approach produces the same result so long as the true *H* is in the allowable space of estimated priors. For empirical Bayes, it is sufficient that the parametric family includes *H* or that a (smoothed) non-parametric approach is used. For Bayes empirical Bayes, it is sufficient that *H* is in the support of the hyper-prior. In this case, for all but the extreme percentiles, the ensemble posterior will be virtually identical to *H*. For example, if *H*_{1} is Gaussian and the assumed prior is also Gaussian with (*μ*, *τ*^{2}) estimated from the data, the result holds as it does for the (smoothed) non-parametric maximum likelihood approach.

By taking advantage of this near-constancy of the edf, the stochastic positioning of the spiked SNP among the “null” SNPs can be related either analytically or with minimal computation to this distribution, thereby efficiently computing the edf for its percentile. Evaluation proceeds by one-time generation of *K* realizations for the null SNPs and then repeated generation of realizations for the spiked SNP(s). Ranks and percentiles are computed by comparing these values to the relevant edf induced by the null SNPs. In some cases, for example when the null SNPs have mean 0 and Z-scores are used, the induced distribution is *N*(0, 1) and no simulation of the null SNPs is needed. Similarly, the spiked SNP may have a mathematically tractable distribution (for example, Gaussian) and a full mathematical treatment is possible. In any case, to approximate performance for a large *K*, at least for percentiles that are not too extreme a one-time simulation (generally generating realizations for substantially more than *K* SNPs) is needed for the null SNPs to produce the comparison distribution and repeated simulation for the spiked. However, see section 5 for a warning about situations wherein there is little to no saving in a simulation. Taking advantage of this near-constancy of the relevant edf for the *K* SNPs produces considerable efficiency gains and insights for the Z-score approach, the Bayesian approach and other candidate approaches to ranking.

For the spiked SNP (*k* = 0), we denote a generic prior by *G* rather than *H*. Further, we use *A* for the actual prior and *W* for the working (assumed) prior. As for the null SNPs, we assume a Gaussian sampling model and denote the sampling variance by
${\sigma}_{a}^{2}$ (it can be a random variable),

$$\begin{array}{l}[{Y}_{0}{\theta}_{0},{\sigma}_{a}^{2}]~\hfill & f({y}_{0}{\theta}_{0},{\sigma}_{a}^{2})=N({\theta}_{0},{\sigma}_{a}^{2})\hfill \hfill \end{array}$$

(7)

with *actual prior A* and marginal distribution *f _{A}*(

$$\begin{array}{lll}[{\theta}_{0},{\sigma}_{a}^{2}]\hfill & \mathit{iid}\hfill & A\hfill \\ {f}_{A}({y}_{0})\hfill & =\hfill & \int f({y}_{0}{\theta}_{0},{\sigma}_{a}^{2})dA({\theta}_{0},{\sigma}_{a}^{2}).\hfill \end{array}$$

The *working prior W* can be different from the actual prior *A*, producing,

$$\begin{array}{lll}[{\theta}_{0},{\sigma}_{a}^{2}]\hfill & \mathit{iid}\hfill & W\hfill \\ {f}_{W}({y}_{0})\hfill & =\hfill & \int f({y}_{0}{\theta}_{0},{\sigma}_{a}^{2})dW({\theta}_{0},{\sigma}_{a}^{2}).\hfill \end{array}$$

The Actual and the Working priors (*A* and *W*) may be different and one or both may equal H. We can set *dW* = *dθ*_{0} for a “frequentist” analysis. Setting *A* to a point mass at a single *θ*-value, allows assessment of the frequentist performance of the Bayesian analysis that assumes *θ* ~ *W*.

Conditioning on *Y*_{0} generates the working posterior distribution,

$${\theta}_{0}{Y}_{0}~{G}_{W}(t{Y}_{0})$$

with pre-posterior expectation, taken with respect to the actual prior

$${\overline{G}}_{W}(tA)=\text{E}({G}_{W}(t{Y}_{0})).$$

(8)

If *W* = *A*, *G _{W}* in equation (8) equals

We present in the context of the spiked SNP, but similar relations hold for the null SNPs. A “frequentist” (likelihood-based) analysis results from use of a flat, improper working prior, *dW*(*θ*) = *dθ*_{0}. The posterior is the normalized likelihood (the likelihood is integrable for the Gaussian sampling distribution). That is
$[{\theta}_{0}{Y}_{0},{G}_{W},{\sigma}_{w}^{2}]~N({Y}_{0},{\sigma}_{w}^{2})$. The marginal posterior for *θ*_{0} is produced by mixing over the distribution of
${\sigma}_{w}^{2}$ and the pre-posterior distribution by integrating with respect to the marginal distribution of *Y*_{0}.

We are interested in comparing performance of * _{k} _{k}*, and
${P}_{k}^{}$. For each, we want the pre-posterior tail distribution and to this end also need the pre-posterior distribution of Z-scores and of Bayesian posterior distributions computed from a working model.

We first consider the situation wherein all “null” SNPs have *θ* 0 and the more general case.

All Z-scores are N(0,1) and so for large *K* their edf is N(0,1) as is . In equations (15) and (19), *μ _{h}* = 0,
${\tau}_{h}^{2}=1$. For the spiked SNP, set

Assume that the *θ* for the null SNPs come from *H*. Conditional on (*θ*, *σ*), *Z* ~ *N*(*m*, 1) with *m* = *θ*/*σ. H* induces a distribution on *m*, call it (recall that the prior delivers both the *θ _{k}* and the
${\sigma}_{k}^{2}$). So,

$$H(u)=\int \mathrm{\Phi}(u-m)d\stackrel{~}{H}(m)$$

(9)

In general, *H* needs to be evaluated numerically, but a few special cases are available.

Case I: If
${\sigma}_{k}^{2}{\sigma}^{2}$ and *θ _{k} iid N*(

$$H(u)=N\left(\frac{\mu}{\sigma},\frac{{\sigma}^{2}+{\tau}^{2}}{{\sigma}^{2}}\right)$$

(10)

Case II: If *σ* is inverse gamma, *H* will be a t-distribution.

Here we want the posterior percentile of the spiked SNP relative to all others with the percentile defined as the expected number of competing SNPs that the spiked SNP beats. With *θ _{k}* denoting a typical null SNP, and using

$$\widehat{P}({Y}_{0})=\text{pr}({\theta}_{0}>{\theta}_{k}{Y}_{0}).$$

(11)

With *G _{W}*(

$$\text{pr}({\theta}_{0}>{\theta}_{k}{Y}_{0})=1-\int {G}_{W}(\xi {Y}_{0})h(\xi )d\xi =\int H(\xi ){g}_{W}(\xi {Y}_{0})d\xi .$$

(12)

For each *Y*_{0}, this is a value [0, 1] and we want its distribution. In general, it won’t have a closed form, but we can simulate *Y*_{0} from *f _{A}*(

$$\text{pr}(\widehat{P}({Y}_{0})>s)=1-{\overline{G}}_{W}({H}^{-1}(s)A).$$

(13)

Note that when *W* = *A* = *H* (the spiked SNP is also null), equation (13) equals 1 − *H*(*H*^{−1}(*s*)) = 1 − *s*, the right tail of the *U*(0, 1) distribution.

We need to “locate” the random variable that has distribution *G _{W}*(·|

Equation (13) can be evaluated numerically or in some situations (e.g., both and *H* are Gaussian) in “closed” form. In any case, *H* and * _{W}*(·|

The expected percentile is,

$$\text{E}(\widehat{P}({Y}_{0})A)=1-\int {\overline{G}}_{W}(\xi A)h(\xi )d\xi $$

(14)

and if *W* = *A* = *H* (spiked is also null)

$$=1-\int H(\xi )h(\xi )d\xi =\frac{1}{2}.$$

For the Gaussian sampling distribution with a variance that does not depend on
$k\phantom{\rule{0.16667em}{0ex}}({\sigma}_{k}^{2}{\sigma}_{h}^{2})$, we can obtain closed form when all of (*W,A,H*) are also Gaussian, but not necessarily the same Gaussian. In this case, all marginal and conditional distributions are Gaussian. If *θ* ~ *N*(*μ*, *τ*^{2}) and [*Y* | *θ*] ~ *N*(*θ*, *σ*^{2}), then [*θ* | *Y*] ~ *N*(*μ* + (1 − *B*)(*Y* − *μ*), (1 − *B*) *σ*^{2}) with *B* = *σ*^{2}/(*σ*^{2} + *τ*^{2}) and the marginal distribution of *Y* is *N*(*μ*, *σ*^{2} + *τ*^{2}) = *N*(*μ*, *σ*^{2}/*B*) = *N*(*μ*, *τ*^{2}/(1 − *B*)).

We use subscripts *w*, *a*, *h* to indicate each distribution and find,

$$\begin{array}{l}{\overline{G}}_{W}(\xb7A)=N({\mu}_{}{\mu}_{}& {\tau}_{2}^{=}\end{array}$$

(15)

Generally, ${\sigma}_{w}^{2}={\sigma}_{a}^{2}$, producing

$${\tau}_{2}^{=}$$

Now, *H*^{−1}(*s*) = *τ _{h}*Φ

$$\text{pr}(\widehat{P}(Y)\le s)={\overline{G}}_{W}({H}^{-1}(s)A)=\mathrm{\Phi}(\frac{{\tau}_{h}{\mathrm{\Phi}}^{-1}(s)+({\mu}_{h}-{\mu}_{}{\tau}_{}}{)}$$

(16)

For a frequentist evaluation of the (*W,H*) combination, set *τ _{a}* = 0 (

Let *t _{γ}*=

$${P}^{}$$

(17)

To compute *P*^{*}(*Y*_{0}), we need to find pr(*θ*_{0} > *t _{γ}*|

$${P}^{}$$

(18)

where the *H*(*t _{γ}*|

We have that pr(*H*(*t* | *Y _{k}*) >

$$\begin{array}{l}p{r}_{A}\{H(t{Y}_{k}){G}_{W}(t{Y}_{0})\}=p{r}_{A}\{S(H(t{Y}_{k})t)\le S({G}_{W}(t{Y}_{0})t)\}=p{r}_{A}\{U(0,1)\le S({G}_{W}(t{Y}_{0})t)\}=S({G}_{W}(t{Y}_{0})t).\end{array}$$

Equation 17 follows directly.

Note that if *W* = *A* = *H*, then *S*(*G _{H}*(

We use the notation as in section 3.2 and first find *S*. Under *H* (dropping the subscript *h* until the end) and letting *Z _{u}* = Φ

$$\begin{array}{l}[{\theta}_{0}{Y}_{0}]~N(B\mu +(1-B){Y}_{0},(1-B){\sigma}^{2})& H(t{Y}_{0})=\mathrm{\Phi}\left(\frac{t-(B\mu +(1-B){Y}_{0})}{\sigma \sqrt{1-B}}\right)\end{array}$$

For this to be > *u*, we need

$${Y}_{0}<\frac{t-B\mu}{1-B}-\frac{\sigma {Z}_{u}}{\sqrt{1-B}}.$$

But,

$${Y}_{0}~N(\mu ,{\sigma}^{2}/B).$$

So,

$$S(ut)=\mathrm{\Phi}\left(\left\{\frac{t-{\mu}_{h}}{1-{B}_{h}}-\frac{{\sigma}_{h}{Z}_{u}}{\sqrt{1-{B}_{h}}}\right\}\frac{\sqrt{{B}_{h}}}{{\sigma}_{h}}\right).$$

(19)

Use (15) and (19) to evaluate (17).

In this all-Gaussian situation, if *W* = *H* and
${\sigma}_{w}^{2}={\sigma}_{h}^{2}$, then the (apparently) optimal *P*^{*}(*Y*_{0}) percentiles (those based on the working model) don’t depend on *γ* and are equal to the (*Y*).

To compute (17) we need to find *S* and then sample *Y*_{0} from *F _{A}*(

We provide performance evaluation of the Z-score approach based on and * _{k}* to detecting a spiked SNP, comparisons of fully Bayesian ranking via (,

We compute the performance for Z-score based percentiles when *θ _{k}* 0 for various values of

For these comparisons, generically *θ* ~*N*(*μ*, *τ*^{2}), [*Y* | *θ*] ~ *N*(*θ*, *σ*^{2}) with the appropriate subscripts on (*μ*, *τ*^{2}, *σ*^{2}) to denote null SNPs or the spiked SNP. For the null SNPs, *μ* = 0 and we study performance as a function of *μ* for the spiked SNP.

The generic Z-score is *Z* = *Y*/*σ*^{2} and so the marginal distribution is *N*(*μ*/*σ*^{2}, 1/*B*) (recall that *τ*^{2} = *σ*^{2}(1 − *B*)/*B*). This distribution is equivalent to using
$N(\mu ,{\sigma}^{2}{\scriptstyle \frac{1}{B}})$ in comparisons. For the null SNPs, *μ* = 0.

Using the fully Bayes approach, the marginal posterior is the prior and so is $N(\mu ,{\tau}^{2})=N(\mu ,{\sigma}^{2}{\scriptstyle \frac{1-B}{B}})$.

The pre-posterior distributions for the Z-score and fully Bayesian approaches are Gaussian with the same mean. Therefore, comparison of * _{k}* with

We investigate performance of the foregoing working analyses for fixed values of *θ*. We assume that *θ*_{1} = *θ*_{2} = … = *θ _{K}* = 0 and study performance for a fixed

*Z* = *Y*/*σ*^{2} and so the marginal distribution is *N*(*θ*/*σ*^{2}, 1) This distribution is equivalent to using *N*(*θ*, *σ*^{2}) in comparisons. For the null SNPs, *θ* = 0.

Using the fully Bayes approach as the working model, the pre-posterior for a fixed *θ* is,

$$N[\theta +B(\mu -\theta ),{\sigma}^{2}(1-B)(2-B)].$$

For the null SNPs, *θ* = *μ* = 0 and the mean is 0.

The variance for the Bayesian approach is (1 − *B*)(2 − *B*) times that for the Z-score approach. However, the mean for the Bayesian approach is biased away from *θ* by *B*(*μ* − *θ*). Therefore, the “frequentist performance” (performance for specific *θ*-values) will be better than the Z-score approach for *θ* near *μ* and less good for *θ* far from *μ*. Of course, “on average” over the prior for *θ*, Bayes performance dominates.

Here, the working model for the Bayesian analysis is flat-prior, producing [*θ* | *Y*] = *N*(*Y*, *σ*^{2}). Conditioning on *θ* and mixing on *Y* produces *N*(*θ*, 2*σ*^{2}) and is inferior to the Z-score approach which produces *N*(*θ*, *σ*^{2}). Mixing on both *Y* and *θ* produces
$N(\mu ,2{\sigma}^{2}+{\tau}^{2})=N[\mu ,{\scriptstyle \frac{{\sigma}^{2}}{B}}(1+B)]$, with a variance (1+*B*) times that for the Z-score approach. Thus, frequentist/Bayes is never better than the Z-score approach, providing another example where use of an incorrect working model fails to achieve the Bayes advantage. We illustrate this point in Figure 3.

The *K* = ∞ mathematical representations provide an excellent approximation to performance for large *K* so long as *s* is not too close to 1. However, simulation will still be needed if, for example *H* or other distributions are not mathematically evaluable and the number of draws (*R*) from the distribution being estimated needs to be specified. If one wants to have a good approximation for an extreme percentile, *R* may be sufficiently large that there is only a modest savings in computations and a standard simulation. As is the case in most statistical contexts, “there is no free lunch.” However, as shown below, there may be a reduced-price lunch.

This need for a large *R* will is most apparent if one wants to estimate the far right-hand tail of the distribution of the percentile for the spiked SNP. Accuracy for this requires an accurate estimate of the right-hand tail of the relevant distribution for the null SNPs. To see this, consider estimating an extreme percentile of a U(0,1) distribution. This case is sufficient for the ranking context because the probability integral transform for the null SNPs can be used to produce the distribution. In a simulation of size *R*, the variance of the empirical *s ^{th}* percentile is approximately

Simulation is a very effective tool for assessing the properties of statistical procedures. However, in many situations use of large-sample statistical properties can facilitate evaluations. In this context, we have shown how to use large-sample convergence of empirical distribution functions to speed up assessments of ranking procedures in some large sample size situations. This approach facilitates rapid evaluation and comparison of approaches for a broad array of scenarios and we provide a few examples as a proof of concept. More extensive comparisons and inclusion of the *P*^{*} approach will build on the current results. In structuring the assessments, we have developed integral representations of the stochastic distribution for the percentile rank of a single “spiked” SNP in competition with a large number of “null” SNPs. In some settings the integral representations will require some numerical evaluation, but these are one-time operations. Generalizations include developing representations for the maximum percentile of more than one spiked SNP.

The large *K* context is ideal for the use of flexible, empirical priors including the non-parametric maximum likelihood (another area in which Hans has worked) or a smooth version of it. Use of such priors should be very efficient and robust and our integral representations will facilitate rapid assessments.

Though an attractive approach, the approximations are not a panacea. For extreme percentiles they provide a lower bound for the tail of the distribution of the spiked SNP and this bound can be very close to zero. Furthermore, as shown in section 5, if simulations are needed to evaluate components of the mathematical expressions, they may not need to be very large. Hans for your many contributions to statistics and to science, to your educational and mentoring success, and to your continued contributions we close with “dank u zeer.”

Support provided by grants R01 DK061662, from the U. S. NIH, National Institute of Diabetes, Digestive and Kidney Diseases, R01 HL090577 from the U. S. NIH, National Heart, Lung, and Blood Institute, and a CTSA grant to the Johns Hopkins Medical Institutions.

We outline the score test approach from Ruczinski et al. (2009) that accommodates uncertain (“fuzzy”) genotype calls. We consider only the dominant model and so use *g* {0, 1} to indicate the true genotype and in the trend test use scores *d _{g}* =

*X _{i}* = the disease status indicator for the e

*g _{i}* = the true genotype.

*G _{i}* = the random variable taking on values

*w _{i}* = pr(

**w** = the *n*-dimensional vector of the *w _{i}*.

To test the hypothesis of no genotype/phenotype association, *H*_{0}: *π*_{0} = *π*_{1} = *π* equivalently, *H*_{0}: *θ* = 0, use the statistic:

$$Z(\mathbf{w})=\frac{{\sum}_{i}{w}_{i}({X}_{i}-\widehat{\pi})}{\sqrt{n\text{Var}(\mathbf{w})\widehat{\pi}(1-\widehat{\pi})}}$$

(20)

with
$\widehat{\pi}=\overline{X}={\scriptstyle \frac{1}{n}}{\sum}_{i}{X}_{i}$ and *n*Var(**w**) = Σ* _{i}*(

Note that under *H*_{0}, *Z*(**w**) ~ *N*(0, 1), showing that the null distribution is correct irrespective of the working probabilities **w**. Using a local alternative (*θ* = *θ*_{0} close to 0), and a logistic form for *π* we obtain:

$$Z~N(m({\theta}_{0},\mathbf{w}),1)$$

with

$$m({\theta}_{0},\mathbf{w})={\theta}_{0}\sqrt{n\text{Var}(\mathbf{w})\widehat{\pi}(1-\widehat{\pi})}.$$

(21)

Squaring the Z-statistic produces a one degree of freedom, chi-square statistic with non-centrality *λ* = *m*^{2}(*θ*_{0}*,***w**)/2.

**No conflict of interest:** The authors have declared no conflict of interest.

- Carlin BP, Louis TA. Bayesian Methods for Data Analysis. 3. Chapman and Hall/CRC Press; Boca Raton, FL: 2009.
- Goeman JJ, van de Geer SA, van Houwelingen HC. Testing against a high dimensional alternative. Journal of the Royal Statistical Society: Series B. 2006;68:477–493.
- Heidema AG, Feskens EJM, Doevendans PAFM, Ruven HJT, van Houwelingen HC, Mariman ECM, Boer JMA. Analysis of multiple snps in genetic association studies: comparison of three multi-locus methods to prioritize and select SNPs. Genetic Epidemiology. 2007;31:910–921. [PubMed]
- Houwing-Duistermaat JJ, Uh HW, van Houwelingen HC. A new score statistic to test for association given linkage in affected sibling pair-control designs. BMC Proceedings. 2007;1(Suppl 1):S39. [PMC free article] [PubMed]
- Lebrec JJP, Putter H, Houwing-Duistermaat JJ, van Houwelingen HC. Influence of genotyping error in linkage mapping for complex traits–an analytic study. BMC Genetics. 2008;9:57. [PMC free article] [PubMed]
- Lin R, Louis TA, Paddock SM, Ridgeway G. Loss function based ranking in two-stage, hierarchical models. Bayesian Analysis. 2006;1:915–946. [PMC free article] [PubMed]
- Manolio TA, Brooks LD, Collins FS. A hapmap harvest of insights into the genetics of common disease. Journal of Clinical Investigation. 2008;118:1590–1605. [PMC free article] [PubMed]
- McClellan JM, Susser E, King MC. Schizophrenia: a common disease caused by multiple rare alleles. Br J Psychiatry. 2007;190:194–199. [PubMed]
- Ruczinski I, Li Q, Carvalho B, Fallin M, Irizarry RA, Louis TA. Association tests that accommodate genotyping errors. Johns Hopkins University, Dept of Biostatistics Working Papers; 2009. Working Paper 181.
- Schwender H, Ickstadt K. Empirical Bayes analysis of single nucleotide polymorphisms. BMC Bioinformatics. 2008;9:144. [PMC free article] [PubMed]
- Uh HW, Mertens BJ, van der Wijk HJ, Putter H, van Houwelingen HC, Houwing-Duistermaat JJ. Model selection based on logistic regression in a highly correlated candidate gene region. BMC Proceedings. 2007;1(Suppl 1):S114. [PMC free article] [PubMed]
- Wakefield J. A Bayesian measure of the probability of false discovery in genetic epidemiology studies. Am J Hum Genet. 2007;81:208–227. [PubMed]
- Wakefield J. Bayes factors for genome-wide association studies: comparison with P-values. Genet ic Epidemiology. 2008a;33:79–86. [PubMed]
- Wakefield J. Reporting and interpretation in genome-wide association studies. International Journal of Epidemiology. 2008b;37:641–653. [PubMed]
- Walsh T, McClellan JM, McCarthy SE, Addington AM, Pierce SB, Cooper GM, Nord AS, Kusenda M, Malhotra D, Bhandari A, Stray SM, Rippey CF, Roccanova P, Makarov V, Lakshmi B, Findling RL, Sikich L, Stromberg T, Merriman B, Gogtay N, Butler P, Eckstrand K, Noory L, Gochman P, Long R, Chen Z, Davis S, Baker C, Eichler EE, Meltzer PS, Nelson SF, Singleton AB, Lee MK, Rapoport JL, King MC, Sebat J. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science. 2008;320:539–543. [PubMed]
- Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. [PMC free article] [PubMed]
- Westfall PH, Johnson WO, Utts JM. A Bayesian perspective on the Bonferroni adjustment. Biometrika. 1997;84:419–427.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |