Two-stage design is a well-known cost-effective way for conducting biomedical studies when the exposure variable is expensive or difficult to measure. Recent research development further allowed one or both stages of the two-stage design to be outcome dependent on a continuous outcome variable. This outcome-dependent sampling feature enables further efficiency gain in parameter estimation and overall cost reduction of the study (e.g. Wang, X. and Zhou, H., 2010. Design and inference for cancer biomarker study with an outcome and auxiliary-dependent subsampling. Biometrics
66, 502–511; Zhou, H., Song, R., Wu, Y. and Qin, J., 2011. Statistical inference for a two-stage outcome-dependent sampling design with a continuous outcome. Biometrics
67, 194–202). In this paper, we develop a semiparametric mixed effect regression model for data from a two-stage design where the second-stage data are sampled with an outcome-auxiliary-dependent sample (OADS) scheme. Our method allows the cluster- or center-effects of the study subjects to be accounted for. We propose an estimated likelihood function to estimate the regression parameters. Simulation study indicates that greater study efficiency gains can be achieved under the proposed two-stage OADS design with center-effects when compared with other alternative sampling schemes. We illustrate the proposed method by analyzing a dataset from the Collaborative Perinatal Project.
Center effect; Mixed model; Outcome-auxiliary-dependent sampling; Validation sample
With development of massively parallel sequencing technologies, there is a substantial need for developing powerful rare variant association tests. Common approaches include burden and non-burden tests. Burden tests assume all rare variants in the target region have effects on the phenotype in the same direction and of similar magnitude. The recently proposed sequence kernel association test (SKAT) (Wu, M. C., and others, 2011. Rare-variant association testing for sequencing data with the SKAT. The American Journal of Human Genetics
89, 82–93], an extension of the C-alpha test (Neale, B. M., and others, 2011. Testing for an unusual distribution of rare variants. PLoS Genetics
7, 161–165], provides a robust test that is particularly powerful in the presence of protective and deleterious variants and null variants, but is less powerful than burden tests when a large number of variants in a region are causal and in the same direction. As the underlying biological mechanisms are unknown in practice and vary from one gene to another across the genome, it is of substantial practical interest to develop a test that is optimal for both scenarios. In this paper, we propose a class of tests that include burden tests and SKAT as special cases, and derive an optimal test within this class that maximizes power. We show that this optimal test outperforms burden tests and SKAT in a wide range of scenarios. The results are illustrated using simulation studies and triglyceride data from the Dallas Heart Study. In addition, we have derived sample size/power calculation formula for SKAT with a new family of kernels to facilitate designing new sequence association studies.
Burden tests; Correlated effects; Kernel association test; Rare variants; Score test
In recent years, genome-wide association studies (GWAS) and gene-expression profiling have generated a large number of valuable datasets for assessing how genetic variations are related to disease outcomes. With such datasets, it is often of interest to assess the overall effect of a set of genetic markers, assembled based on biological knowledge. Genetic marker-set analyses have been advocated as more reliable and powerful approaches compared with the traditional marginal approaches (Curtis and others, 2005. Pathways to the analysis of microarray data. TRENDS in Biotechnology
23, 429–435; Efroni and others, 2007. Identification of key processes underlying cancer phenotypes using biologic pathway analysis. PLoS One
2, 425). Procedures for testing the overall effect of a marker-set have been actively studied in recent years. For example, score tests derived under an Empirical Bayes (EB) framework (Liu and others, 2007. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics
63, 1079–1088; Liu and others, 2008. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC bioinformatics
9, 292–2; Wu and others, 2010. Powerful SNP-set analysis for case-control genome-wide association studies. American Journal of Human Genetics
86, 929) have been proposed as powerful alternatives to the standard Rao score test (Rao, 1948. Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44, 50–57). The advantages of these EB-based tests are most apparent when the markers are correlated, due to the reduction in the degrees of freedom. In this paper, we propose an adaptive score test which up- or down-weights the contributions from each member of the marker-set based on the Z-scores of their effects. Such an adaptive procedure gains power over the existing procedures when the signal is sparse and the correlation among the markers is weak. By combining evidence from both the EB-based score test and the adaptive test, we further construct an omnibus test that attains good power in most settings. The null distributions of the proposed test statistics can be approximated well either via simple perturbation procedures or via distributional approximations. Through extensive simulation studies, we demonstrate that the proposed procedures perform well in finite samples. We apply the tests to a breast cancer genetic study to assess the overall effect of the FGFR2 gene on breast cancer risk.
Adaptive procedures; Empirical Bayes; GWAS; Pathway analysis; Score test; SNP sets
With the growing availability of omics data generated to describe different cells and tissues, the modeling and interpretation of such data has become increasingly important. Pathways are sets of reactions involving genes, metabolites, and proteins highlighting functional modules in the cell. Therefore, to discover activated or perturbed pathways when comparing two conditions, for example two different tissues, it is beneficial to use several types of omics data. We present a model that integrates transcriptomic and metabolomic data in order to make an informed pathway-level decision. Since metabolites can be seen as end-points of perturbations happening at the gene level, the gene expression data constitute the explanatory variables in a sparse regression model for the metabolite data. Sophisticated model selection procedures are developed to determine an appropriate model. We demonstrate that the transcript profiles can be used to informatively explain the metabolite data from cancer cell lines. Simulation studies further show that the proposed model offers a better performance in identifying active pathways than, for example, enrichment methods performed separately on the transcript and metabolite data.
Enrichment; Integrated modeling; Metabolomics; Pathways; Transcriptomics
The cross-odds ratio is defined as the ratio of the conditional odds of the occurrence of one cause-specific event for one subject given the occurrence of the same or a different cause-specific event for another subject in the same cluster over the unconditional odds of occurrence of the cause-specific event. It is a measure of the association between the correlated cause-specific failure times within a cluster. The joint cumulative incidence function can be expressed as a function of the marginal cumulative incidence functions and the cross-odds ratio. Assuming that the marginal cumulative incidence functions follow a generalized semiparametric model, this paper studies the parametric regression modeling of the cross-odds ratio. A set of estimating equations are proposed for the unknown parameters and the asymptotic properties of the estimators are explored. Non-parametric estimation of the cross-odds ratio is also discussed. The proposed procedures are applied to the Danish twin data to model the associations between twins in their times to natural menopause and to investigate whether the association differs among monozygotic and dizygotic twins and how these associations have changed over time.
Binomial modeling; Correlated cause-specific failure times; Danish twin data; Estimating equation; Generalized semiparametric additive model; Inverse censoring probability weighting; Joint cumulative incidence function; Large sample properties; Marginal cumulative incidence function; Parametric regression model
RNA-Seq is widely used in biological and biomedical studies. Methods for the estimation of the transcript's abundance using RNA-Seq data have been intensively studied, many of which are based on the assumption that the short-reads of RNA-Seq are uniformly distributed along the transcripts. However, the short-reads are found to be nonuniformly distributed along the transcripts, which can greatly reduce the accuracies of these methods based on the uniform assumption. Several methods are developed to adjust the biases induced by this nonuniformity, utilizing the short-read's empirical distribution in transcript. As an alternative, we found that RNA degradation plays a major role in the formation of the short-read's nonuniform distribution and thus developed a new approach that quantifies the short-read's nonuniform distribution by precisely modeling RNA degradation. Our model of RNA degradation fits RNA-Seq data quite well, and based on this model, a new statistical method was further developed to estimate transcript expression level, as well as the RNA degradation rate, for individual genes and their isoforms. We showed that our method can improve the accuracy of transcript isoform expression estimation. The RNA degradation rate of individual transcript we estimated is consistent across samples and/or experiments/platforms. In addition, the RNA degradation rate from our model is independent of the RNA length, consistent with previous studies on RNA decay rate.
EM algorithm; Gene expression; Next generation sequencing; RNA degradation; RNA-Seq
Advances in human genetics have led to epidemiological investigations not only of the effects of genes alone but also of gene–environment (G–E) interaction. A widely accepted design strategy in the study of how G–E relate to disease risks is the population-based case–control study (PBCCS). For simple random samples, semiparametric methods for testing G–E have been developed by Chatterjee and Carroll in 2005. The use of complex sampling in PBCCS that involve differential probabilities of sample selection of cases and controls and possibly cluster sampling is becoming more common. Two complexities, weighting for selection probabilities and intracluster correlation of observations, are induced by the complex sampling. We develop pseudo-semiparametric maximum likelihood estimators (pseudo-SPMLE) that apply to PBCCS with complex sampling. We study the finite sample performance of the pseudo-SPMLE using simulations and illustrate the pseudo-SPMLE with a US case–control study of kidney cancer.
Hardy–Weinberg equilibrium; Population weights; Selection probability; Stratified multistage cluster sampling; Taylor linearization
The meta-analytic approach to evaluating surrogate end points assesses the predictiveness of treatment effect on the surrogate toward treatment effect on the clinical end point based on multiple clinical trials. Definition and estimation of the correlation of treatment effects were developed in linear mixed models and later extended to binary or failure time outcomes on a case-by-case basis. In a general regression setting that covers nonnormal outcomes, we discuss in this paper several metrics that are useful in the meta-analytic evaluation of surrogacy. We propose a unified 3-step procedure to assess these metrics in settings with binary end points, time-to-event outcomes, or repeated measures. First, the joint distribution of estimated treatment effects is ascertained by an estimating equation approach; second, the restricted maximum likelihood method is used to estimate the means and the variance components of the random treatment effects; finally, confidence intervals are constructed by a parametric bootstrap procedure. The proposed method is evaluated by simulations and applications to 2 clinical trials.
Causal inference; Meta-analysis; Surrogacy
A noninferiority (NI) trial is sometimes employed to show efficacy of a new treatment when it is unethical to randomize current patients to placebo because of the established efficacy of a standard treatment. Under this framework, if the NI trial determines that the treatment advantage of the standard to the new drug (i.e. S−N) is less than the historic advantage of the standard to placebo (S−P), then the efficacy of the new treatment (N−P) is established indirectly. We explicitly combine information from the NI trial with estimates from a random effects model, allowing study-to-study variability in k historic trials. Existing methods under random effects, such as the synthesis method, fail to account for the variability of the true standard versus placebo effect in the NI trial. Our method effectively uses a prediction interval for the missing standard versus placebo effect rather than a confidence interval of the mean. The consequences are to increase the variance of the synthesis method by incorporating a prediction variance term and to approximate the null distribution of the new statistic with a t with k−1 degrees of freedom instead of the standard normal. Thus, it is harder to conclude NI of the new to (predicted) placebo, compared with traditional methods, especially when k is small or when between study variability is large. When the between study variances are nonzero, we demonstrate substantial Type I error rate inflation with conventional approaches; simulations suggest that the new procedure has only modest inflation, and it is very conservative when between study variances are zero. An example is used to illustrate practical issues.
Active control trial; Clinical trial; Meta-analysis; Noninferiority trial; Random effects; Synthesis method
We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable. Moreover, normalization is challenging because different sequencing experiments may generate quite different total numbers of reads. To overcome these difficulties, we use a log-linear model with a new approach to normalization. We derive a novel procedure to estimate the false discovery rate (FDR). Our method can be applied to data with quantitative, two-class, or multiple-class outcomes, and the computation is fast even for large data sets. We study the accuracy of our approaches for significance calculation and FDR estimation, and we demonstrate that our method has potential advantages over existing methods that are based on a Poisson or negative binomial model. In summary, this work provides a pipeline for the significance analysis of sequencing data.
Differential expression; FDR; Overdispersion; Poisson log-linear model; RNA-Seq; Score statistic
In many case–control genetic association studies, a set of correlated secondary phenotypes that may share common genetic factors with disease status are collected. Examination of these secondary phenotypes can yield valuable insights about the disease etiology and supplement the main studies. However, due to unequal sampling probabilities between cases and controls, standard regression analysis that assesses the effect of SNPs (single nucleotide polymorphisms) on secondary phenotypes using cases only, controls only, or combined samples of cases and controls can yield inflated type I error rates when the test SNP is associated with the disease. To solve this issue, we propose a Gaussian copula-based approach that efficiently models the dependence between disease status and secondary phenotypes. Through simulations, we show that our method yields correct type I error rates for the analysis of secondary phenotypes under a wide range of situations. To illustrate the effectiveness of our method in the analysis of real data, we applied our method to a genome-wide association study on high-density lipoprotein cholesterol (HDL-C), where “cases” are defined as individuals with extremely high HDL-C level and “controls” are defined as those with low HDL-C level. We treated 4 quantitative traits with varying degrees of correlation with HDL-C as secondary phenotypes and tested for association with SNPs in LIPG, a gene that is well known to be associated with HDL-C. We show that when the correlation between the primary and secondary phenotypes is >0.2, the P values from case–control combined unadjusted analysis are much more significant than methods that aim to correct for ascertainment bias. Our results suggest that to avoid false-positive associations, it is important to appropriately model secondary phenotypes in case–control genetic association studies.
Case–control studies; Statistical genetics; Statistical methods in Epidemiology
A population average regression model is proposed to assess the marginal effects of covariates on the cumulative incidence function when there is dependence across individuals within a cluster in the competing risks setting. This method extends the Fine–Gray proportional hazards model for the subdistribution to situations, where individuals within a cluster may be correlated due to unobserved shared factors. Estimators of the regression parameters in the marginal model are developed under an independence working assumption where the correlation across individuals within a cluster is completely unspecified. The estimators are consistent and asymptotically normal, and variance estimation may be achieved without specifying the form of the dependence across individuals. A simulation study evidences that the inferential procedures perform well with realistic sample sizes. The practical utility of the methods is illustrated with data from the European Bone Marrow Transplant Registry.
Clustered; Competing risk; Hazard of subdistribution; Marginal model; Martingale; Multivariate; Partial likelihood
In a family-based genetic study such as the Framingham Heart Study (FHS), longitudinal trait measurements are recorded on subjects collected from families. Observations on subjects from the same family are correlated due to shared genetic composition or environmental factors such as diet. The data have a 3-level structure with measurements nested in subjects and subjects nested in families. We propose a semiparametric variance components model to describe phenotype observed at a time point as the sum of a nonparametric population mean function, a nonparametric random quantitative trait locus (QTL) effect, a shared environmental effect, a residual random polygenic effect and measurement error. One feature of the model is that we do not assume a parametric functional form of the age-dependent QTL effect, and we use penalized spline-based method to fit the model. We obtain nonparametric estimation of the QTL heritability defined as the ratio of the QTL variance to the total phenotypic variance. We use simulation studies to investigate performance of the proposed methods and apply these methods to the FHS systolic blood pressure data to estimate age-specific QTL effect at 62cM on chromosome 17.
Genome-wide linkage study; Multivariate longitudinal data; Penalized splines; Quantitative trait locus
We propose a method for testing gene–environment (G × E) interactions on a complex trait in family-based studies in which a phenotypic ascertainment criterion has been imposed. This novel approach employs G-estimation, a semiparametric estimation technique from the causal inference literature, to avoid modeling of the association between the environmental exposure and the phenotype, to gain robustness against unmeasured confounding due to population substructure, and to acknowledge the ascertainment conditions. The proposed test allows for incomplete parental genotypes. It is compared by simulation studies to an analogous conditional likelihood–based approach and to the QBAT-I test, which also invokes the G-estimation principle but ignores ascertainment. We apply our approach to a study of chronic obstructive pulmonary disorder.
Causal inference; COPD; Family-based association; G-estimation; Gene–environment interaction
In high-throughput cancer genomic studies, markers identified from the analysis of single data sets often suffer a lack of reproducibility because of the small sample sizes. An ideal solution is to conduct large-scale prospective studies, which are extremely expensive and time consuming. A cost-effective remedy is to pool data from multiple comparable studies and conduct integrative analysis. Integrative analysis of multiple data sets is challenging because of the high dimensionality of genomic measurements and heterogeneity among studies. In this article, we propose a sparse boosting approach for marker identification in integrative analysis of multiple heterogeneous cancer diagnosis studies with gene expression measurements. The proposed approach can effectively accommodate the heterogeneity among multiple studies and identify markers with consistent effects across studies. Simulation shows that the proposed approach has satisfactory identification results and outperforms alternatives including an intensity approach and meta-analysis. The proposed approach is used to identify markers of pancreatic cancer and liver cancer.
Cancer genomics; Marker identification; Sparse boosting
Microarray expression studies suffer from the problem of batch effects and other unwanted variation. Many methods have been proposed to adjust microarray data to mitigate the problems of unwanted variation. Several of these methods rely on factor analysis to infer the unwanted variation from the data. A central problem with this approach is the difficulty in discerning the unwanted variation from the biological variation that is of interest to the researcher. We present a new method, intended for use in differential expression studies, that attempts to overcome this problem by restricting the factor analysis to negative control genes. Negative control genes are genes known a priori not to be differentially expressed with respect to the biological factor of interest. Variation in the expression levels of these genes can therefore be assumed to be unwanted variation. We name this method “Remove Unwanted Variation, 2-step” (RUV-2). We discuss various techniques for assessing the performance of an adjustment method and compare the performance of RUV-2 with that of other commonly used adjustment methods such as Combat and Surrogate Variable Analysis (SVA). We present several example studies, each concerning genes differentially expressed with respect to gender in the brain and find that RUV-2 performs as well or better than other methods. Finally, we discuss the possibility of adapting RUV-2 for use in studies not concerned with differential expression and conclude that there may be promise but substantial challenges remain.
Batch effect; Control gene; Differential expression; Factor analysis; SVA; Unwanted variation
Frailty models are useful for measuring unobserved heterogeneity in risk of failures across clusters, providing cluster-specific risk prediction. In a frailty model, the latent frailties shared by members within a cluster are assumed to act multiplicatively on the hazard function. In order to obtain parameter and frailty variate estimates, we consider the hierarchical likelihood (H-likelihood) approach (Ha, Lee and Song, 2001. Hierarchical-likelihood approach for frailty models. Biometrika
88, 233–243) in which the latent frailties are treated as “parameters” and estimated jointly with other parameters of interest. We find that the H-likelihood estimators perform well when the censoring rate is low, however, they are substantially biased when the censoring rate is moderate to high. In this paper, we propose a simple and easy-to-implement bias correction method for the H-likelihood estimators under a shared frailty model. We also extend the method to a multivariate frailty model, which incorporates complex dependence structure within clusters. We conduct an extensive simulation study and show that the proposed approach performs very well for censoring rates as high as 80%. We also illustrate the method with a breast cancer data set. Since the H-likelihood is the same as the penalized likelihood function, the proposed bias correction method is also applicable to the penalized likelihood estimators.
Frailty model; Hierarchical likelihood; Multivariate survival; NPMLE; Penalized likelihood; Semiparametric
To estimate an overall treatment difference with data from a randomized comparative clinical study, baseline covariates are often utilized to increase the estimation precision. Using the standard analysis of covariance technique for making inferences about such an average treatment difference may not be appropriate, especially when the fitted model is nonlinear. On the other hand, the novel augmentation procedure recently studied, for example, by Zhang and others (2008. Improving efficiency of inferences in randomized clinical trials using auxiliary covariates. Biometrics
64, 707–715) is quite flexible. However, in general, it is not clear how to select covariates for augmentation effectively. An overly adjusted estimator may inflate the variance and in some cases be biased. Furthermore, the results from the standard inference procedure by ignoring the sampling variation from the variable selection process may not be valid. In this paper, we first propose an estimation procedure, which augments the simple treatment contrast estimator directly with covariates. The new proposal is asymptotically equivalent to the aforementioned augmentation method. To select covariates, we utilize the standard lasso procedure. Furthermore, to make valid inference from the resulting lasso-type estimator, a cross validation method is used. The validity of the new proposal is justified theoretically and empirically. We illustrate the procedure extensively with a well-known primary biliary cirrhosis clinical trial data set.
ANCOVA; Cross validation; Efficiency augmentation; Mayo PBC data; Semi-parametric efficiency
Accommodating general patterns of confounding in sample size/power calculations for observational studies is extremely challenging, both technically and scientifically. While employing previously implemented sample size/power tools is appealing, they typically ignore important aspects of the design/data structure. In this paper, we show that sample size/power calculations that ignore confounding can be much more unreliable than is conventionally thought; using real data from the US state of North Carolina, naive calculations yield sample size estimates that are half those obtained when confounding is appropriately acknowledged. Unfortunately, eliciting realistic design parameters for confounding mechanisms is difficult. To overcome this, we propose a novel two-stage strategy for observational study design that can accommodate arbitrary patterns of confounding. At the first stage, researchers establish bounds for power that facilitate the decision of whether or not to initiate the study. At the second stage, internal pilot data are used to estimate key scientific inputs that can be used to obtain realistic sample size/power. Our results indicate that the strategy is effective at replicating gold standard calculations based on knowing the true confounding mechanism. Finally, we show that consideration of the nature of confounding is a crucial aspect of the elicitation process; depending on whether the confounder is positively or negatively associated with the exposure of interest and outcome, naive power calculations can either under or overestimate the required sample size. Throughout, simulation is advocated as the only general means to obtain realistic estimates of statistical power; we describe, and provide in an R package, a simple algorithm for estimating power for a case–control study.
Case–control study; Observational study design; Power; Sample size; Simulation
Methods for causal inference regarding health effects of air quality regulations are met with unique challenges because (1) changes in air quality are intermediates on the causal pathway between regulation and health, (2) regulations typically affect multiple pollutants on the causal pathway towards health, and (3) regulating a given location can affect pollution at other locations, that is, there is interference between observations. We propose a principal stratification method designed to examine causal effects of a regulation on health that are and are not associated with causal effects of the regulation on air quality. A novel feature of our approach is the accommodation of a continuously scaled multivariate intermediate response vector representing multiple pollutants. Furthermore, we use a spatial hierarchical model for potential pollution concentrations and ultimately use estimates from this model to assess validity of assumptions regarding interference. We apply our method to estimate causal effects of the 1990 Clean Air Act Amendments among approximately 7 million Medicare enrollees living within 6 miles of a pollution monitor.
Air pollution; Bayesian statistics; Causal inference; Principal stratification; Spatial data
A measure of explained risk is developed for application with the proportional hazards model. The statistic, which is called the estimated explained relative risk, has a simple analytical form and is unaffected by censoring that is independent of survival time conditional on the covariates. An asymptotic confidence interval for the limiting value of the estimated explained relative risk is derived, and the role of individual factors in the computation of its estimate is established. Simulations are performed to compare the results of the estimated explained relative risk to other known explained risk measures with censored data. Prostate cancer data are used to demonstrate an analysis incorporating the proposed approach.
Censored data; Entropy; Explained risk; Proportional hazards; Relative risk
Randomized trials with dropouts or censored data and discrete time-to-event type outcomes are frequently analyzed using the Kaplan–Meier or product limit (PL) estimation method. However, the PL method assumes that the censoring mechanism is noninformative and when this assumption is violated, the inferences may not be valid. We propose an expanded PL method using a Bayesian framework to incorporate informative censoring mechanism and perform sensitivity analysis on estimates of the cumulative incidence curves. The expanded method uses a model, which can be viewed as a pattern mixture model, where odds for having an event during the follow-up interval (tk−1,tk], conditional on being at risk at tk−1, differ across the patterns of missing data. The sensitivity parameters relate the odds of an event, between subjects from a missing-data pattern with the observed subjects for each interval. The large number of the sensitivity parameters is reduced by considering them as random and assumed to follow a log-normal distribution with prespecified mean and variance. Then we vary the mean and variance to explore sensitivity of inferences. The missing at random (MAR) mechanism is a special case of the expanded model, thus allowing exploration of the sensitivity to inferences as departures from the inferences under the MAR assumption. The proposed approach is applied to data from the TRial Of Preventing HYpertension.
Clinical trials; Hypertension; Ignorability index; Missing data; Pattern-mixture model; TROPHY trial
Family-based association studies have been widely used to identify association between diseases and genetic markers. It is known that genotyping uncertainty is inherent in both directly genotyped or sequenced DNA variations and imputed data in silico. The uncertainty can lead to genotyping errors and missingness and can negatively impact the power and Type I error rates of family-based association studies even if the uncertainty is independent of disease status. Compared with studies using unrelated subjects, there are very few methods that address the issue of genotyping uncertainty for family-based designs. The limited attempts have mostly been made to correct the bias caused by genotyping errors. Without properly addressing the issue, the conventional testing strategy, i.e. family-based association tests using called genotypes, can yield invalid statistical inferences. Here, we propose a new test to address the challenges in analyzing case-parents data by using calls with high accuracy and modeling genotype-specific call rates. Our simulations show that compared with the conventional strategy and an alternative test, our new test has an improved performance in the presence of substantial uncertainty and has a similar performance when the uncertainty level is low. We also demonstrate the advantages of our new method by applying it to imputed markers from a genome-wide case-parents association study.
Case-parents design; Family-based association tests; Genotype-specific missingness; Genotyping uncertainty; Imputed genotypes
We investigate methods for regression analysis when covariates are measured with errors. In a subset of the whole cohort, a surrogate variable is available for the true unobserved exposure variable. The surrogate variable satisfies the classical measurement error model, but it may not have repeated measurements. In addition to the surrogate variables that are available among the subjects in the calibration sample, we assume that there is an instrumental variable (IV) that is available for all study subjects. An IV is correlated with the unobserved true exposure variable and hence can be useful in the estimation of the regression coefficients. We propose a robust best linear estimator that uses all the available data, which is the most efficient among a class of consistent estimators. The proposed estimator is shown to be consistent and asymptotically normal under very weak distributional assumptions. For Poisson or linear regression, the proposed estimator is consistent even if the measurement error from the surrogate or IV is heteroscedastic. Finite-sample performance of the proposed estimator is examined and compared with other estimators via intensive simulation studies. The proposed method and other methods are applied to a bladder cancer case–control study.
Calibration sample; Estimating equation; Heteroscedastic measurement error; Nonparametric correction
Resampling-based expression pathway analysis techniques have been shown to preserve type I error rates, in contrast to simple gene-list approaches that implicitly assume the independence of genes in ranked lists. However, resampling is intensive in computation time and memory requirements. We describe accurate analytic approximations to permutations of score statistics, including novel approaches for Pearson's correlation, and summed score statistics, that have good performance for even relatively small sample sizes. Our approach preserves the essence of permutation pathway analysis, but with greatly reduced computation. Extensions for inclusion of covariates and censored data are described, and we test the performance of our procedures using simulations based on real datasets. These approaches have been implemented in the new R package safeExpress.
Gene sets; Multiple hypothesis testing; Permutation approximation