Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC4474746

Formats

Article sections

- Abstract
- Introduction
- Materials and Methods
- Simulations
- Results
- Application to the top 25 SNPs of MRD in ALL
- Application to the Mini-Exome Data of Genetic Analysis Workshop 17
- Discussion
- Supplementary Material
- References

Authors

Related links

Ann Hum Genet. Author manuscript; available in PMC 2016 July 1.

Published in final edited form as:

PMCID: PMC4474746

NIHMSID: NIHMS682347

Wenjian Bi,^{#}^{1} Guolian Kang,^{#}^{2,}^{#} Yanlong Zhao,^{1} Yuehua Cui,^{3} Song Yan,^{4} Yun Li,^{4,}^{5} Cheng Cheng,^{2} Stanley B. Pounds,^{2} Michael J. Borowitz,^{6} Mary V. Relling,^{7} Jun J. Yang,^{7} Zhifa Liu,^{2} Ching-Hon Pui,^{8,}^{11} Stephen P. Hunger,^{9} Christine M Hartford,^{10} Wing Leung,^{10,}^{11} and Ji-Feng Zhang^{1,}^{#}

The publisher's final edited version of this article is available free at Ann Hum Genet

For genetic association studies that involve an ordered categorical phenotype, we usually either regroup multiple categories of the phenotype into two categories (“cases” and “controls”) and then apply the standard logistic regression **(LG)**, or apply ordered logistic (**oLG**) or ordered probit (oPRB) regression which accounts for the ordinal nature of the phenotype. However, these approaches may lose statistical power or may not control type I error rate due to their model assumption and/or instable parameter estimation algorithm when the genetic variant is rare or sample size is limited. Here to solve this problem, we propose a set-valued (**SV**) system model, which assumes that an underlying continuous phenotype follows a normal distribution, to identify genetic variants associated with an ordinal categorical phenotype. We couple this model with a set-valued system identification algorithm to identify all the key system parameters. Simulations and two real data analyses show that **SV** and **LG** accurately controlled the Type I error rate even at a significance level of 10^{−6} but not **oLG** and **oPRB** in some cases. **LG** had significantly smaller power than the other three methods due to disregarding of the ordinal nature of the phenotype, and **SV** had similar or greater power than **oLG** and **oPRB**. For instance, in a simulation with data generated from an additive **SV** model with odds ratio of 7.4 for a phenotype with three categories, a single nucleotide polymorphism with minor allele frequency of 0.75% and sample size of 999 (333 per category), the power of **SV, oLG and LG** models were 70%, 40% and <1%, respectively, at a significance level of 10^{−6}. Thus, **SV** should be employed in genetic association studies for ordered categorical phenotype.

Genome-wide association studies (GWAS) have successfully identified many genetic variants that are associated with complex diseases over the past decades (Sladek et al., 2007; Welter et al., 2014). Many phenotypes studied in GWAS are either binary or continuous. The logistic regression (**LG**) and linear regression models are widely used to analyze the binary and continuous phenotype while adjusting for the effects of confounding covariates such as ancestry, age and sex. In cancer GWAS, considerable portion of phenotypes are survival (Innocentiet al., 2012) or relapse (Yang et al., 2012). The Cox proportional hazard regression model(Cox, 1972) and the Fine and Gray hazard rate regression(Fine and Gray, 1999) are the standard methods to analyze survival and relapse outcomes with adjusting for some confounding factors such as ancestry scores, treatment arms, clinical risk or prognostic factors, respectively.

In cancer pharmacogenetics/pharmacogenomics, we are interested in detecting genetic variations influencing drug toxicity or efficacy. The key phenotype referred to as the outcome could be multiple ordinal categories such as dosing of drugs, adverse events scored on scales using ordinal values (1-5) according to Common Terminology Criteria for Adverse Events developed by the US National Cancer Institute (Ingle et al., 2010), and effect of treatment on disease such as tumor response in which the metrics of tumor size is categorized as complete response, partial response, stable disease or progressive disease (Wheeler et al., 2013). Furthermore, some ordered phenotype may be defined by splitting a measured continuous variable such as four categories of underweight, normal weight, overweight and obese, based on body index mass, but most of them may be generated due to complicated unobservable or unobserved continuous variables such as the expression level of RNAs or proteins involved in an unknown biological process or stimulated by external environments.

For these ordered phenotypes, researchers often regroup multiple categories into two categories of “cases” and “controls” and then apply the standard **LG** model (Treviño et al., 2009; Ingle et al., 2010). However, this method may lose substantial power in that re-categorizing the phenotype does not take the ordinal nature of the phenotype into consideration (see Simulation Results section below). The non-parametric method of Spearman rank correlation (Yang et al., 2009) or the Jonckheere**–**Terpstra tests (Han et al., 2013) which accounts for the ordinal nature of the phenotype can be an attractive method. However, these methods cannot adjust for confounding factors. The parametric method of ordered/ordinal logistic regression (**oLG**) model (Png et al., 2011) borrows the basic idea of standard **LG** regression model to avoid these pitfalls. As the most popular model, generalized linear models (GLM), logistic approaches adopt link function of logit form, which brings many advantages. For example, the first derivative and the second derivative of the corresponding log-likelihood function are easy to compute, and the estimated parameter can explain the odds ratio directly. Nevertheless, we still think the logistic approach sometimes is overused. Above all, fitting the response data with the logit link function cannot be justified in many practical applications. The doubt has been confirmed in the case of binary outcome for which probit method has better performance than LG method under non-asymptotic situations (low MAF and small sample size) (Kang et al., 2014). All these two methods will lose statistical power or cannot maintain the type I error rate if the marker is rare and sample size is small due to their model assumptions and/or unstable parameter estimation algorithm. Another parametric method of the ordered probit regression method can also be used but like**oLG**, its performance is problematic when the sample size is small and the number of categories is large.

As for traditional system identification, the system input and continuous system output are usually assumed accessible/known. But in some cases, we can only know which set the system output lies in but not the exact continuous output information, which is called set-valued information (Kang et al., 2014). To model the relationship between system input and system output mathematically, a quantization process is adopted to generate the set-valued system from the corresponding continuous latent or unknown variable. Set-valued system identification (**SVSI**) was first investigated for sensor systems (Wang, et al., 2003). In contrast to the traditional system identification method, **SVSI** can estimate the model parameters by set-valued information rather than precise output information. It is technically more challenging, but appears in a wide range of applications such as sensor networks and telecommunications (Nair, et al., 2007). Many more motivating examples can be found in Wang et al. (2010). Finite impulse response model is a class of typical linear system model and can be used to approximate many actual physical systems. As an important research branch of **SVSI**, the identification of finite impulse response model with set-valued data attracts the attention of many researchers and some related results have been obtained (Godoy, et al., 2011; Chen, et al., 2012; Bi, et al., 2014).

In this study, we propose a specific set-valued (**SV**) system model, which can be considered as a finite impulse response system with set-valued output. The model considers the categorizing process of continuous phenotypes to model the relationship between the ordered outcome and possible genetic or non-genetic explanatory factors in GWAS or next-generation sequencing (NGS) studies. We estimate the parameter of interest by a **SVSI** approach and use a Wald test statistic for testing the null hypothesis of no association between genetic variant and ordinal phenotype. We perform extensive simulation studies to compare the type I error rate, the power and the computational cost of **SV** with those of **LG**, **oLG** and **oPRB** methods. Finally, we apply the **SV** method to the data about minimal residual disease (MRD) in acute lymphoblastic leukemia (ALL) (Yang et al., 2009) and the Genetic Analysis Workshop 17 (GAW17) data.

Assume that we have a cohort of *N* individuals and that the genetic polymorphism of interest is biallelic [e.g., single nucleotide polymorphism (SNP)]. The 2 alleles at a SNP are denoted as A and a, where A is the minor allele and together they form three genotypes denoted as AA, Aa, and aa. Suppose that observations (*s _{i}*,

We propose a novel set-valued (SV) model in which the phenotype information can be regarded as the set-valued observation of a continuous latent variable:

$$\{\begin{array}{c}{y}_{i}=f({G}_{i},{X}_{i})+{e}_{i},\hfill \\ {s}_{i}=\sum _{k=0}^{r}k\cdot {I}_{{A}_{k}}\left({y}_{i}\right),\phantom{\rule{1em}{0ex}}i=1,2,\dots N\hfill \end{array}\phantom{\}}$$

(1)

where *G _{i}* and

The most common simplified treatment of the set-valued process is to introduce thresholds {*c*_{1},*c*_{2},…,*c _{r}*} such that [

$$\{\begin{array}{c}{y}_{i}={\alpha}_{0}+\theta \cdot {G}_{i}+{\gamma}^{T}\cdot {X}_{i}+{e}_{i},\hfill \\ {s}_{i}=\sum _{k=0}^{r}k\cdot {I}_{[{c}_{k},{c}_{k+1}}\left(y\right),\phantom{\rule{1em}{0ex}}i=1,2,\dots N\hfill \end{array}\phantom{\}}$$

(2)

where *e _{i}* is the random noise which follows a normal distribution with a mean of 0 and a variance of

In equation (2), if *c*_{1}=0, then the **SV** model is the usual ordered probit model. If the *e _{i}* in equation (2) follows a logistic distribution in equation (2), then the

The system parameters in equation (1) can be estimated by maximizing the likelihood function through the EM algorithm. The estimation process is similar to (Chen, et al., 2012). Denote (*θ*, *γ*^{T}, *α*_{0})^{T} by an overall parameter Θ, ${({G}_{i},{X}_{i}^{T},1)}^{\mathrm{T}}$ by an overall input * _{i}*. The core iteration process is as following:

$${\widehat{\u03f4}}^{k+1}={\widehat{\u03f4}}^{k}-{\left(\sum _{i=1}^{N}{\phi}_{i}\cdot {\phi}_{i}^{T}\right)}^{-1}\left[\sum _{i=1}^{N}{\sigma}^{2}{\phi}_{i}\left(\sum _{j=0}^{r}{I}_{\{{s}_{j}=j\}}\cdot \frac{f(i,j+1)-f(i,j)}{F(i,j+1)-F(i,j)}\right)\right]$$

(3)

where $f(i,j)=f({c}_{j}-{\phi}_{i}^{T}\cdot {\widehat{\u03f4}}^{k})$ is the density function and $F(i,j)=\Phi ({c}_{j}-{\phi}_{i}^{T}\cdot {\widehat{\u03f4}}^{k})$ is the cumulative distribution function for a normal distribution with mean 0 and variance *σ*^{2} evaluated at ${c}_{j}-{\phi}_{i}^{T}\cdot {\widehat{\u03f4}}^{k}$. For more details of MLE, see section 1 in the supplementary material.

Suppose the iteration estimator converges to the MLE $\widehat{\u03f4}$, the observed Fisher information matrix of $\widehat{\u03f4}$ (denoted by $\mathrm{I}\left(\widehat{\u03f4}\right)$) can be obtained according to the following formula (see section 1 in the supplementary material for details)

$$\mathrm{I}\left(\widehat{\u03f4}\right)=-\mathrm{E}\left[\phantom{\mid}\frac{{\partial}^{2}}{\partial {\u03f4}^{2}}\mathrm{log}\phantom{\rule{thinmathspace}{0ex}}L\left(\u03f4\right)\mid \widehat{\u03f4}\right]=\sum _{i=1}^{N}\left(\sum _{j=0}^{N}\frac{{[f(i,j+1)-f(i,j)]}^{2}}{F(i,j+1)-F(i,j)}\right)\cdot {\phi}_{i}\cdot {\phi}_{i}^{T}$$

(4)

where *L(Θ)* is the likelihood function given *Θ*. Testing for no genetic effect of SNP on the phenotype, that is, *H*_{0}: *θ* = 0, can be constructed for the **SV** method from the Wald statistic

$$W=\frac{{\widehat{\theta}}^{2}}{\mathrm{I}{\left(\widehat{\u03f4}\right)}^{-1}[1,1]},$$

(5)

where $\mathrm{I}{\left(\widehat{\u03f4}\right)}^{-1}[1,1]$, the element at the first row and the first column of the inverse Fisher information matrix, represents the estimated variance of $\widehat{\theta}$. Asymptotically, *W* is distributed approximately as a central *χ*^{2} distribution with 1 degree of freedom under the null hypothesis of no association.

The estimation of parameters needs the knowledge of threshold vector *c*=(*c*_{1},*c*_{2},...,*c _{r}*).In some situations, the thresholds are available. For example, in leukemia, minimal residual disease (an assessment of decreasing leukemic burden in response to therapy such as chemotherapy for cancer treatment) can be categorized as negative (<0.01%), positive (≥0.01% but <1%) and high-positive (≥1%) using two thresholds of 0.01% and 1% (Yang et al., 2009). In other cases, the latent variable is unobserved and the thresholds are also unknown to us. In the case of binary phenotype, it is very easy to estimate the threshold along with other parameters by dealing with the threshold as a parameter (Kang et al., 2014). But in case of ordered categorical phenotype, we have to estimate them with some techniques. Fortunately, if we presume model parameters as fixed values and threshold as variable, the Hessian matrix of likelihood function is positive definite, which means the likelihood function has a unique maximum point. Here we adopt a switching operation for estimating parameters and thresholds. As for one iteration step, we first estimate model parameters (

$${\widehat{c}}_{j}^{k+1}={\widehat{c}}_{j}^{k}+\frac{1}{N}\left[\sum _{i=1}^{N}{I}_{\{{s}_{i}=j-1\}}\cdot \frac{f(i,j)}{F(i,j)-F(i,j-1)}-{I}_{\{{s}_{i}=j\}}\cdot \frac{f(i,j)}{F(i,j+1)-F(i,j)}\right],$$

(6)

The detailed algorithm implementation of the SVSI method is in Supplementary Section 2 and the proposed new **SV** method has been implemented in an R package which is available for free download from http://www.stjuderesearch.org/site/depts/biostats/software. The simulations adopting **SV** model and unbiased sampling show that the estimation of parameters and thresholds can converge close to the true value within 10 iterations and complete the convergence process within 100 iterations (see Table S1 and Figure S1).

We performed extensive simulation studies to evaluate the performance of the proposed **SV** method against the three competing alternatives including **LG** for the regrouped binary phenotype (recoding as 0 or greater than 0), **oLG**, and **oPRB**. We only considered an ordered phenotype with three categories (*s _{i}*= 0, 1 and 2) in our simulations.

Given the minor allele frequency(MAF) *p _{A}*, the genotype frequencies

The phenotype status was determined from the generated genotype and covariates data according to two models below similar to that for the binary phenotype simulation method by Kang et al., (2014) and Wu et al., (2011):

**LG**-based simulation method**(LGsimu)**:We controlled the proportions of individuals with the ordinal disease outcome$$\begin{array}{cc}\hfill & \mathrm{P}({s}_{i}=2\mid {G}_{i},{x}_{i1},{x}_{i2})=\frac{\mathrm{exp}({\alpha}_{1}+\theta {G}_{i}+0.5{x}_{i1}+0.5{x}_{i2})}{1+\mathrm{exp}({\alpha}_{1}+\theta {G}_{i}+0.5{x}_{i1}+0.5{x}_{i2})};\hfill \\ \hfill & \mathrm{P}({s}_{i}=0\mid {G}_{i},{x}_{i1},{x}_{i2})=1-\frac{\mathrm{exp}({\alpha}_{2}+\theta {G}_{i}+0.5{x}_{i1}+{0.5}_{i2})}{1+\mathrm{exp}({\alpha}_{2}+\theta {G}_{i}+0.5{x}_{i1}+0.5{x}_{i2})}.\hfill \end{array}$$*s*= 2, 1, 0 by α_{1}and α_{2}and set it to1:3:6, that is, 10% of individuals have*s*_{2}, 30% of those have*s*_{1}and 60% of those have*s*_{0}, in the case that all three regression coefficients for SNP,*x*_{i}_{1}, and*x*_{i}_{2}are 0.**SV**-based simulation method (**SVsimu**): First a continuous variable was generated from*y*0.5_{i}=θG_{i}+*x*0.5_{i1}+*x*, where_{i2}+e_{i}*e*follows a standard normal distribution. Given thresholds_{i}*(c*the individuals with a value of_{1},c_{2}),*y*higher than_{i}*c*_{2}have phenotype of 2 and ones with a value of*y*lower than_{i}*c*_{1}have phenotype of 0, the remaining have phenotype of 1. We controlled the proportions of individuals with the ordinal disease outcome*s*= 2, 1, 0 and set it to 1:3:6, that is, 10% of individuals have*s*_{2}, 30% of those have*s*_{1}and 60% of those have*s*_{0}, in the case that all three regression coefficients for SNP,*x*_{i}_{1}, and*x*_{i}_{2}are 0.

We select a cohort of *N* individuals to conduct further association analysis based on the following 2 sampling strategies to mimic two different designs for retrospective and prospective studies:

- Randomly sample
*N*/3 individuals per each category**(Same)**: we sample a fixed sample size of*N*/3 individuals from each category in the population of 2,000,000 individuals to mimic a retrospective design to maximize the power of association testing. Note that this strategy ensures that the sample size must be a multiple of 3, so that for example we may compare results obtained by sampling 999 subjects with the**Same**strategy to those obtained by sampling 1000 subjects with the**Rand**strategy described below. - Randomly sample
*N*individuals**(Rand)**: we randomly choose*N*individuals from the population of 2,000,000 individuals simulated above to mimic a prospective design.

Once the data is generated, for **LG**, we used *glm* function in R and fit the *glm* on the regrouped binary phenotype (new *s _{i}*= 0 if the original

Eight values for MAFs of SNPs were considered: 0.0025, 0.0075, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4 and 0.5. The ordered phenotype was determined from the generated genotype and covariate data by using the two models mentioned above, with *θ*= 0. To estimate the type I error rate of the **SV** method, 10,000,000 replicated datasets were simulated for each study, with a small sample size of 1000 (2500) and a large sample size of 2000 (5000) for the **Rand** sampling method for variants with MAF≥0.0075 (MAF=0.0025) and the corresponding numbers of 999 (2499) and 1998 (5001) for the **Same** sampling method, respectively. We considered larger significance levels *α* = 0.05 or 0.01 and stringent genome-wide levels *α* = 10^{−5} or 10^{−6} under the null hypothesis of *H*_{0}: *θ* = 0.

Three genetic disease models were considered: additive, dominant, and recessive with their respective genotype coding *G* (0, 1, 2), (0, 1, 1) and (0, 0, 1) when we simulated the phenotype. The ordered phenotype was determined from the generated genotype and covariate data according to the simulation methods given above, with *θ* varying from 0.3 to 2 at an increment of 0.1. Datasets were generated 10,000 times for each configuration. The three methods used for the type I error simulations were applied to each data-set, and power was estimated as the proportions of *p*-values less than *α* = 10^{−6}.

To mimic a phase II clinical trial, a small sample size of 150 was also used for common variants with MAFs of 0.2 and 0.05 to estimate the power of **SV** at a significance level of 1×10^{−4}.

Table 1 shows empirical type I error rates estimated for four methods. Regardless of significance levels, **SV** correctly maintained type I error control at the given levels for both common and rare variants. **LG** was conservative for stringent genome-wide levels if SNPs were rare because of large variance of parameter estimate (Table 2) (Kang et al., 2014). **oLG** and **oPRB** correctly controlled type I error rate at larger significance levels but did not control type I error rate at stringent genome-wide levels for rare variants when sample size was small because of instability of **oLG** and **oPRB** when there are some empty or small cells. Since **oPRB** cannot control type I error rate at *α* = 10^{−6} for rare SNP with MAF 0.0075 and the power of **SV** is almost identical to that of **oPRB** in most cases, the power of **oPRB** was omitted and was not included in the section below.

The ratio of the observed type I error rates of the set-valued (SV), logistic regression (LG and oLG), and the usual ordered Probit (oPRB) methods over the given significance levels *α* using SVsimu data generation method and random sampling scheme. **...**

Figures 1–2 show the power of the three methods as a function of effect size (*θ*) for an additive disease model. As expected, the power of **SV** and **oLG** increased with the increase in effect size regardless of distributions of noise, the genetic disease model and sampling methods. The power of three methods was generally higher for **Same** sampling method than that for **Rand** sampling method for the same parameter setup. This suggests that for a retrospective design, sampling all individuals with more extreme phenotype is preferred for assessing genetic effect. In some settings, both **SV** and **oLG** based on ranked sets performed better than **LG** based on the regrouped sets. The power difference between them could be more than 50% at a significance level of 10^{−6} depending on the scale of the sample size. As expected, for a SNP with MAF of 0.05, given a sample size of 1000 with **Rand** and 999 with **Same**, the power of **LG** for the regrouped binary outcome first increased to 100%, then decreased with increase in effect size (Figure 2A-2B). The drop of the power of **LG** method for the very large effect size given a small fixed sample size and a SNP with small MAF is due to the high probability of absence of individuals with phenotype 0 and carrying minor alleles (see population 3×3 tables in Supplementary matrix 1 for *θ* = 1 and 2, respectively), which leads to a very large estimated standard error of $\widehat{\theta}$ by **LG**. For example, given *N* = 999, for *θ* = 1, 0 out of 1000 simulated datasets had absence of individuals with phenotype 0 and carrying minor alleles so that the mean and the standard deviation of $\widehat{\theta}$ were 2.024 and 0.258 which leaded to a standardized effect size of ${\scriptstyle \frac{\overline{\widehat{\theta}}}{sd\left(\widehat{\theta}\right)}}=7.84$. However, for *θ*
*θ* = 2, 58 out of 1000 simulated datasets had absence of individuals with phenotype 0 and carrying minor alleles so that the mean and the standard deviation of $\widehat{\theta}$ were 4.50 and 3.288 which leaded to a standardized effect size of ${\scriptstyle \frac{\overline{\widehat{\theta}}}{sd\left(\widehat{\theta}\right)}}=1.37$ which is much smaller than that for *θ* = 1. Below we will focus on power comparison between **SV** and **oLG**. The power gain for the new **SV** method was noticeable in detecting rare variants especially when the individuals was sampled using **Same** sampling method from the population generated using SVsimu (Figures 1--33).

For a common SNP with an MAF of 0.2 or 0.05, the power of **SV** appeared to be similar to or higher than that of **oLG** depending on the scale of sample size, regardless of the genetic disease models, sampling methods and distributions of noise (Figures 1--2,2, and and4).4). Surprisingly, with a small sample size of *N*=150, for a SNP with a MAF of 0.2 and *θ* = 0.7, the power difference between **SV** and **oLG** was 8% (Figure 4B and 4D). But for a SNP with a MAF of 0.05 and *θ =* 1.5, the power difference between **SV** and **oLG** was 15% (Figure 4B and 4D).

For a rare SNP with an MAF of 0.0075 or 0.0025, if the noise follows a logistic distribution, with **Rand** sampling method, the power of **oLG** was almost identical to or slightly higher than that of **SV,** regardless of genetic disease models (Figure 1A-1C). However, interestingly, with **Same** sampling method, the power of **SV** was slightly or much higher than that of **oLG,** regardless of genetic disease models (Figure 1B-1D). For example, for a SNP with MAF of 0.0075, *θ* = 2 (equivalent to OR=e* ^{θ}=*7.4), and

Figure 3 displays the power of the **SV** and **oLG** methods as a function of sample size for the additive disease model. As expected, the power of two methods increased with an increase in sample size. For a common SNP with an MAF of 0.2 or 0.05 and an effect size of 0.4 or 0.8, respectively, the power of **SV** was almost identical to that of **oLG**, regardless of the distributions of noise, sample size, disease models and sampling methods (Figure 3). For a rare SNP with a MAF of 0.0075 or 0.0025 and an effect size of 2 or 2.4, if the noise follows a logistic distribution and a **Rand** sampling method is used, the power of **SV** appeared to be similar to that of **oLG**, regardless of the disease models (Figure 3A) but the power of **SV** was much larger than that of **oLG** for a **Same** sampling method (Figure 3B). The power difference became larger with moderate sample sizes. If the noise follows a normal distribution, regardless of sampling method, the power of **SV** was much greater than that of **oLG** but depending on the sample size (Figure 3C-3D).

Table 2 and Supplementary Table 2S gives the mean of $\widehat{\theta}$, mean of estimated standard errors of $\widehat{\theta}$, and standard deviations of $\widehat{\theta}$ across simulation repetitions for the **LG**, **SV** and **oLG** methods based on 1,000 simulation repetitions. Data were generated using the same parameter setup as given in Table 1 and Figures 1*–*4.

The mean of estimated standard error of $\widehat{\theta}$ appeared to be close to its standard deviation for the **SV** method in all simulation setups but not for **LG**, **oLG** and **oPRB** (Table 2 and Supplementary Table 2S). Interestingly, when SNP is rare *(p _{A}=*0.0075) and association parameter is large (

We also calculated the ratio of the mean of $\widehat{\theta}$ over the mean of estimated standard error of $\widehat{\theta}$, i.e, ${\scriptstyle \frac{\overline{\widehat{\theta}}}{\overline{\overline{\widehat{se}\left(\widehat{\theta}\right)}}}}$, and the ratio of the mean of $\widehat{\theta}$ to the standard deviation of $\widehat{\theta}$, i.e, ${\scriptstyle \frac{\overline{\widehat{\theta}}}{sd\left(\widehat{\theta}\right)}}$, which were used to mimic the standardized effect sizes to make the estimates a comparable scale and was used to compare different models (Table 2 and supplementary Table 2S). Under the null hypothesis, no matter what the phenotype simulation model, sampling method, MAF and sample size, both standardized effect sizes with **SV** were very close and both were close 0 which showed that **SV** could control type I error rate but not for **oLG**. Under some extreme situations such as small sample size and rare SNP, ${\scriptstyle \frac{\overline{\widehat{\theta}}}{\overline{\overline{\widehat{se}\left(\widehat{\theta}\right)}}}}$ was higher than ${\scriptstyle \frac{\overline{\widehat{\theta}}}{sd\left(\widehat{\theta}\right)}}$ for **oLG** but both would be close to 0 as sample size increased. Under the alternative hypothesis, in most cases **SV** had the “standardized effect sizes” similar to **oLG** both were much larger than **LG** which further demonstrates that **SV** had the power similar to **oLG** both had larger power than **LG** in most cases. Under some extreme situations, rare SNP, small sample size or large effect size, **SV** had higher “standardized effect sizes” than **oLG**, which clearly demonstrated the power gain of **SV** compared with **LG** and **oLG** for these settings. All these simulation results obviously demonstrate that **SV** can give more efficient, more robust and much less variable $\widehat{\theta}$ than **oLG**. In particular, it dominates others under small sample sizes and rare variants.

We also recorded the computing time for each of the four methods above as implemented in R and Matlab for the simulated data. In Matlab, **SV** was typically about twice as fast as **oPRB** and **oLG** but was similar to **LG**. In R, **SV**, **oPRB**, and **oLG** had similar run times with **SV** tending to be slightly slower than **oLG** but all was slower than **LG** (Supplementary materials section 3 and Table 3S). These are consistent with the results reported by Bi et al. (2014).

ALL is the most common type of cancer in children and the cure rate is more than 80% but there exists considerable inter-individual variability in therapy response (Yang et al., 2009). Genetic variants of SNPs in the interleukin 15 (*IL15*) gene and other SNPs associated with risk of MRD at the end of induction therapy have been reported recently (Yang et al., 2009). We analyzed the top 25 SNPs identified by Spearman rank correlation test in childhood ALL in two independent populations: 318 patients in St Jude Total Therapy protocols XIIIB and XV (Pui et al., 2004, 2009), and 169 patients in Children's Oncology Group (COG) trial P9906 (Borowitz et al., 2003). For St Jude patients, MRD status was categorized as negative (<0.01%), positive (≥0.01% but <1%), and highpositive (≥1%). For COG patients, MRD status was similarly categorized as: negative (≤0.01%), positive (>0.01%, but ≤1%), and high-positive (>1%).

Table 3 shows association results for the top 25 SNPs in both and combined cohort of St Jude and COG. At a significance level of 0.05/25 = 0.002, in the combined cohorts, 24 SNPs were found statistically significant by **LG**, **oLG** and **oPRB** but 23 of them were detected by **SV**; for St. Jude cohorts, **LG**, **oLG**, **oPRB** and **SV** found 10, 9, 9, and 8 SNPs statistically significant, respectively, where 5 were detected by all four methods; for COG cohorts, **LG**, **oLG**, **oPRB** and **SV** found 8, 8, 7 and 8 SNPs statistically significant, respectively, where 6 were detected by all four methods. There were only one SNP (SNP_A-1794325) detected by all our methods in both SJ and COG cohorts. Overall, the p-values for all four methods were comparable. Based on these results it seems that all four methods perform similarly. However, we know that the distribution of the continuous MRD measure at the end of induction therapy was right-skewed and definitely not following a normal distribution especially for ALL (Moppett et al., 2003). More importantly, we do not know what are the true SNPs associated with MRD in ALL.

To further evaluate the performance of the proposed **SV** method, we analyzed data from the Genetic Analysis Workshop 17 (GAW17) which contained “mini-exome” sequence genotype data of 24,487 SNPs in 3,205 genomic regions of 697 unrelated individuals provided by the 1000 Genome Project [27]. Three quantitative phenotypes (*Q*_{1}, *Q*_{2}, and *Q*_{4}) were simulated from the normal distribution. *Q*_{1} was influenced not only by genetic variant, but also environmental variables, and gene-environmental interactions. *Q*_{2} was only influenced by genetic variants not environmental variables. *Q*_{4} was influenced only by the environments and not genetic variants. Here we only analyzed *Q*_{2} since there were no environments and gene-environments interactions associated with *Q*_{2}. *Q*_{2} was influenced by 72 SNPs in 13 genes. Furthermore, 200 replicate datasets were generated for each phenotype, using one fixed genotype data. To apply our methods to GAW17 data, we classified *Q*_{2} to the ordered categorical phenotype using Φ^{−1}(0.9) and Φ^{−1}(0.6) as two thresholds and then analyzed them by mimicking we do not know *Q*_{2} which is the same as our SV model. First, quality control analysis was performed on the SNPs and SNPs with MAFs less than 0.0086 or HWE test p-values less than 0.00001 were excluded. There were 8387 SNPs remaining in the association analysis of *Q*_{2}. The reclassified ordered categorical phenotype for the 1^{st}, 10^{th}, 100^{th} and 200^{th} replicate data was used as our outcomes (supplementary Table S3 for frequency table and Figure S2 for the histograms) and included age, gender, and smoking status as covariates in all four methods above.

Table 4 shows the association analyses results for *Q*_{2}. At a significance level of 0.00001, for the 1^{st} replicate data, there was no SNP found statistically significant by using **SV** and **LG** but there were 112 no-causal SNPs found statistically significant by using **oLG** and **oPRB**, which was similar to the 10^{th} replicate data. For the 200^{th} replicate data, at a level of 0.00001, no SNP was found statistically significant by any of the four methods. For the 100^{th} replicate data, **SV** and **LG** only found one true causal SNP but did not detect no-causal SNPs. **oLG** and **oPRB** also found the same true causal SNP but simultaneously found 99 no-causal SNPs whose p-values were 0. At a significance level of 0.0005, **SV** found more true causal SNPs than and similar no-causal SNPs to **LG**. **SV** found similar true causal SNPs to but much less no-causal SNPs than **oPRB** and **oLG**. GAW17 data analyses showed that **SV** had similar or higher power than **oLG** and **oPRB** but the latter cannot maintain the type I error rate. They were consistent with and further supported our extensive simulation results above.

With the availability of data from whole-genome sequencing and whole-exome sequencing studies in which small or moderate sample sizes are used due to the high cost of sequencing technology (Lanktree et al., 2010; Emond et al., 2012) and/or the rare diseases in cancer pharmacogenomics studies such as pediatric cancers of retinoblastoma and Ewing's (Gurney et al., 1995; Wheeler et al., 2013), there is an increasing demand for the development of powerful and robust association testing procedures for identifying genetic variations associated with an ordered multiple responses phenotype of interest. In this study, we propose a new **SV** system that models the relationship between an ordered phenotype and genetic variants and introduce an **SVSI** approach to testing the genotype-ordered categorical phenotype association. To be more detailed, the simplified **SV** model assumed the system noise following a normal distribution. The normal distribution assumption is considered reasonable because it is in accordance with the classical central limit theory. And after a simple transformation, we find the logistic approach is also a specified form of **SV** model, and the diversity is that the system noise is slightly different from the normal distribution. The diversity is so subtle that the corresponding results show tiny difference under asymptotic situations, i.e. common MAF and/or large sample size. Under non asymptotic situations, i.e. low MAF and/or small sample size, it is inevitable to suffer power loss for every statistical method. And the degree of power loss depends largely on the underlying assumptions. Through simulations, we found that both **LG** and **oLG** methods suffered obvious power loss because of high variance of estimated parameter and that oPRB and oLG could not control type I error at a stringent significance level. While the **SV** method sustained a better performance in these situations due to the normal distribution of the noise term compared to the logistic distribution with heavier tails as well as the updated computationally efficient and robust EM algorithm.

The statistical methods based on model are the most effective when the model is in accordance with actual data. Invalid model assumption will bias the results more or less. Hence, we think it is very important to compare two methods under their own model assumptions. Simulations and real data applications show that the proposed **SV** method has a robust performance for testing association between ordered phenotypes and genetic variations regardless of the logistic or normal distributions of noise and genetic disease models, and that generally outperforms the commonly used **LG** model, and **oLG** especially when the SNP is rare and when the sample size is limited. Thus, we recommend the use of the **SV** approach instead of the **LG** or **oLG** model, to identify genetic variants in genetic association studies for ordered phenotypes. Although not reported here, simulation studies showed similar results for the dominant and recessive disease models and for a common SNP with MAFs such as 0.1, 0.3, 0.4 or 0.5.

When we estimate the parameters using the system identification method, we suppose that the variance of noise is known as 1 because we are interested in testing genotypephenotype associations not estimating the effect size of association. In real data analysis, the true variance of noise is usually unknown and also may not be equal to 1 which will definitely affect the power of the **LG, oLG** and **SV**. By simulations, not surprisingly, as the true variance is bigger (smaller) than 1, the power of all three methods will decrease (increase). However, as expected, the power of the **SV** method is still identical to or higher than that of **oLG** and both are much higher than that of **LG** (data not shown). If the distribution of underlying noise is neither normal distribution nor logistic distribution, for example, t-distribution, simulation results show the same conclusion. Thus, conclusions about the power gain of the **SV** compared to **LG** and **oLG** is robust to the logistic, normal and *t* distribution of the underlying noise. In addition, if we are interested in estimating the association effect size of SNP on the phenotype, the noise variance parameter can also be estimated along with other parameters using generalized expectation maximization algorithm (Godoy et al., 2011). We have implemented the proposed new **SV** method in an R package which is available for free download from http://www.stjuderesearch.org/site/depts/biostats/software. The method can be easily applied to candidate gene association analysis, GWAS or NGS studies with hundreds or thousands of individuals for ordered categorical phenotypes.

This research is supported by the American Lebanese and Syrian Associated Charities (ALSAC), grants from National Natural Science Foundation of China (11171333 and 61134013), National Science Foundation (DMS-1209112) and National Institutes of Health (R01 HG006292). The Genetic Analysis Workshop is supported by the NIH grant R01 GM031575. We thank Dr. Xueyuan Cao for providing the ALL data and Dr. Dario Campana for collecting the original MRD data at SJ. We thank two anonymous reviewers for their insightful comments which have significantly improved the manuscript.

Conflict of Interest

The authors have no conflict of interests to declare.

- Bi W, Zhao Y. Iterative parameter estimate with batched binary-valued observations: convergence with an exponential rate. Proceedings of the 19th International Federation of Automatic Control World Congress. 2014;19:3220–3225.
- Borowitz MJ, Pullen DJ, Shuster JJ, et al. Children's Oncology Group study. Minimal residual disease detection in childhood precursor-B-cell acute lymphoblastic leukemia: relation to other risk factors: a Children's Oncology Group study. Leukemia. 2003;17(8):1566–1572. [PubMed]
- Chen T, Zhao Y, Ljung L. Impulse response estimation with binary measurements: are gularized FIR model approach system identification. 16th IFAC Symposium on System Identification. 2012;16(1):113–118.
- Cox DR. Regression models and life tables. J R Stat Soc. 1972;4:187–220.
- Emond MJ, Louie T, Emerson J, Zhao W, Mathias RA, Knowles MR, Wright FA, Rieder MJ, Tabor HK, Nickerson DA. Exome sequencing of extremephenotypesidentifiesDCTN4 as a modifier of chronic Pseudomonasaeruginosa infection in cystic fibrosis. Nat Genet. 2012;44(8):886–889. others. [PMC free article] [PubMed]
- Fine JP, Gray RJ. A proportional hazards model for the subdistribution of a competing risk. J Am Stat Assoc. 1999;94(446):496–509.
- Godoy B, Goodwin G, Aguero J, Marelli D, Wigren T. On identification of FIR systems having quantized output data. Automatica. 2011;47:1905–1915.
- Greene William H. Econometric Analysis. fifth edition Prentice Hall; 2003. pp. 736–740.
- Gurney JG, Severson RK, Davis S, Robison LL. Incidence of cancer in children in the United States. Sex-, race-, and 1-year age-specific rates by histologic type. Cancer. 1995;75:2186–95. [PubMed]
- Han JY, Shin ES, Lee YS, Ghang HY, Kim SY, Hwang JA, Kim JY, Lee JS. A genome-wide association study for irinotecan-related severe toxicities in patients with advanced non-small-cell lung cancer. Pharmacogenomics J. 2013;13:417–22. [PubMed]
- Ingle JN, Schaid DJ, Goss PE, Liu M, Mushiroda T, Chapman JA, Kubo M, Jenkins GD, Batzler A, Shepherd L. Genome-wide associations and functional genomic studies of musculoskeletal adverse events in women receiving aromatase inhibitors. J Clin Oncol. 2010;28:4674–4682. others. [PMC free article] [PubMed]
- Innocenti F, Innocenti F, Owzar K, Cox NL, Evans P, Kubo M, Zembutsu H, Jiang C, Hollis D, Mushiroda T, Li L. A genome-wide association study of overall survival in pancreatic cancer patients treated with gemcitabine in CALGB 80303. Clin. Cancer Res. 2012;18:577–584. others. [PMC free article] [PubMed]
- Kang G, Bi W, Zhao Y, Zhang JF, Yang JJ, Xu H, Loh ML, Hunger SP, Relling MV, Pounds S, Cheng C. A new system identification approach to identifying genetic variants in sequencing studies for a binary phenotype. Human Heredity. 2014;78:104–116. [PMC free article] [PubMed]
- Lanktree MB, Hegele RA, Schork NJ, Spence JD. Extremes of unexplained variation as a phenotype: an efficient approach for genome-wide association studies of cardiovascular disease. CircCardiovasc Genet. 2010;3:215–221. [PMC free article] [PubMed]
- Moppett J, Burke GAA, Steward CG, Oakhill A, Goulden NJ. The clinical relevance of detection of minimal residual disease in childhood acute lymphoblastic leukaemia. J Clin Pathol. 2003;56:249–253. [PMC free article] [PubMed]
- Nair GN, Fagnani F, Zampieri S, Ecans RJ. Feedback control under data rate constraints: an overview. Proceedings of the IEEE. 2007;95(1):108–137.
- Pui CH, Sandlund JT, Pei D, et al. Total Therapy Study XIIIB at St Jude Children's Research Hospital. Improved outcome for children with acute lymphoblastic leukemia: results of Total Therapy Study XIIIB at St Jude Children's Research Hospital. Blood. 2004;104(9):2690–2696. [PubMed]
- Pui CH, Campana D, Pei D, Bowman WP, Sandlund JT, Kaste SC, Ribeiro RC, Rubnitz JE, Raimondi SC, Onciu M. Treating childhood acute lymphoblastic leukemia without cranial irradiation. N Engl J Med. 2009;360(26):2730–2741. others. [PMC free article] [PubMed]
- Png E, Thalamuthu A, Ong RT, Snippe H, Boland GJ, Seielstad M. A genome-wide association study of hepatitis B vaccine response in an Indonesian population reveals multiple independent risk variants in the HLA region. Hum Mol Genet. 2011;20(19):3893–3898. [PubMed]
- Sladek R, Rocheleau G, Rung J, Dina C, Shen L, Serre D, Boutin P, Vincent D, Belisle A, Hadjadj S. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature. 2007;445:881–885. others. [PubMed]
- Treviño LR, Shimasaki N, Yang W, Panetta JC, Cheng C, Pei D, Chan D, Sparreboom A, Giacomini KM, Pui CH. Germline genetic variation in an organic anion transporter polypeptide associated with methotrexate pharmacokinetics and clinical effects. J ClinOncol. 2009;27(35):5972–5978. others. [PMC free article] [PubMed]
- Wang L, Yin G, Zhang J, Zhao Y. System identification with quantized observations. Birkhauser. 2010
- Wang L, Zhang J, Yin G. System identification using binary sensors. IEEE TAC. 2003;48(11):1892–1907.
- Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, Klemm A, Flicek P, Manolio T, Hindorff L, Parkinson H. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Research. 2014;42(Database issue):D1001–D1006. [PMC free article] [PubMed]
- Wheeler HE, Maitland ML, Dolan ME, Cox NJ, Ratain MJ. Cancer pharmacogenomics: strategies and challenges. Nat Rev Genet. 2013;14(1):23–34. [PMC free article] [PubMed]
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93. [PubMed]
- Yang JJ, Cheng C, Devidas M, Cao X, Campana D, Yang W, Fan Y, Neale G, Cox N, Scheet P. Genome-wide association study identifies germline polymorphisms associated with relapse of childhood acute lymphoblastic leukemia. Blood. 2012;120(20):4197–4204. others. [PubMed]
- Yang JJ, Cheng C, Yang W, Pei D, Cao X, Fan Y, Pounds S, Treviño LR, French D, Campana D. Genome-wide interrogation of germline genetic variation associated with treatment response in childhood acute lymphoblastic leukemia. JAMA. 2009;301(4):393–403. others. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |