In modern cancer epidemiology, diseases are classified based on pathologic and molecular traits, and different combinations of these traits give rise to many disease subtypes. The effect of predictor variables can be measured by fitting a polytomous logistic model to such data. The differences (heterogeneity) among the relative risk parameters associated with subtypes are of great interest to better understand disease etiology. Due to the heterogeneity of the relative risk parameters, when a risk factor is changed, the prevalence of one subtype may change more than that of another subtype does. Estimation of the heterogeneity parameters is difficult when disease trait information is only partially observed and the number of disease subtypes is large. We consider a robust semiparametric approach based on the pseudo-conditional likelihood for estimating these heterogeneity parameters. Through simulation studies, we compare the robustness and efficiency of our approach with that of the maximum likelihood approach. The method is then applied to analyze the associations of weight gain with risk of breast cancer subtypes using data from the American Cancer Society Cancer Prevention Study II Nutrition Cohort.
doi:10.4172/2155-6180.1000197
PMCID: PMC4270282
PMID: 25530913
Dimension reduction; Disease classification; Estimating equations; Etiologic heterogeneity; Missing disease trait; Polytomous logistic model
Summary
Missing covariate data often arise in biomedical studies, and analysis of such data that ignores subjects with incomplete information may lead to inefficient and possibly biased estimates. A great deal of attention has been paid to handling a single missing covariate or a monotone pattern of missing data when the missingness mechanism is missing at random. In this paper, we propose a semiparametric method for handling non-monotone patterns of missing data. The proposed method relies on the assumption that the missingness mechanism of a variable does not depend on the missing variable itself but may depend on the other missing variables. This mechanism is somewhat less general than the completely non-ignorable mechanism but is sometimes more flexible than the missing at random mechanism where the missingness mechansim is allowed to depend only on the completely observed variables. The proposed approach is robust to misspecification of the distribution of the missing covariates, and the proposed mechanism helps to nullify (or reduce) the problems due to non-identifiability that result from the non-ignorable missingness mechanism. The asymptotic properties of the proposed estimator are derived. Finite sample performance is assessed through simulation studies. Finally, for the purpose of illustration we analyze an endometrial cancer dataset and a hip fracture dataset.
doi:10.1111/biom.12159
PMCID: PMC4061254
PMID: 24571224
Dimension reduction; Estimating equations; Missing at random; Non-ignorable missing data; Robust method
We take a semiparametric approach in fitting a linear transformation model to a right censored data when predictive variables are subject to measurement errors. We construct consistent estimating equations when repeated measurements of a surrogate of the unobserved true predictor are available. The proposed approach applies under minimal assumptions on the distributions of the true covariate or the measurement errors. We derive the asymptotic properties of the estimator and illustrate the characteristics of the estimator in finite sample performance via simulation studies. We apply the method to analyze an AIDS clinical trial data set that motivated the work.
doi:10.1111/biom.12119
PMCID: PMC3954407
PMID: 24350758
Counting process; Estimating equation; Induced hazard; Kernel density; Non-differential measurement errors; U-statistics
Summary
With advances in modern medicine and clinical diagnosis, case-control data with characterization of finer subtypes of cases are often available. In matched case-control studies, missingness in exposure values often leads to deletion of entire stratum, and thus entails a significant loss in information. When subtypes of cases are treated as categorical outcomes, the data are further stratified and deletion of observations becomes even more expensive in terms of precision of the category-specific odds-ratio parameters, especially using the multinomial logit model. The stereotype regression model for categorical responses lies intermediate between the proportional odds and the multinomial or baseline category logit model. The use of this class of models has been limited as the structure of the model implies certain inferential challenges with non-identifiability and non-linearity in the parameters. We illustrate how to handle missing data in matched case-control studies with finer disease sub-classification within the cases under a stereotype regression model. We present both a Monte Carlo based full Bayesian approach as well as an expectation/conditional maximization algorithm for estimation of model parameters in presence of a completely general missingness mechanism. We illustrate our methods by using data from an ongoing matched case-control study of colorectal cancer. Simulation results are presented under various missing data mechanisms and departures from modeling assumptions.
doi:10.1111/j.1541-0420.2010.01453.x
PMCID: PMC3119773
PMID: 20560931
Conditional likelihood; Non-ignorable missingness; Proportional odds; Stages of cancer; Vector generalized linear model
Summary
Complex diseases like cancers can often be classified into subtypes using various pathological and molecular traits of the disease. In this article, we develop methods for analysis of disease incidence in cohort studies incorporating data on multiple disease traits using a two-stage semiparametric Cox proportional hazards regression model that allows one to examine the heterogeneity in the effect of the covariates by the levels of the different disease traits. For inference in the presence of missing disease traits, we propose a generalization of an estimating equation approach for handling missing cause of failure in competing-risk data. We prove asymptotic unbiasedness of the estimating equation method under a general missing-at-random assumption and propose a novel influence-function-based sandwich variance estimator. The methods are illustrated using simulation studies and a real data application involving the Cancer Prevention Study II nutrition cohort.
doi:10.1093/biomet/asq036
PMCID: PMC3372244
PMID: 22822252
Competing-risk; Etiologic heterogeneity; Influence function; Missing cause of failure; Partial likelihood; Proportional hazard regression; Two-stage model
Summary
We propose a semiparametric Bayesian method for handling measurement error in nutritional epidemiological data. Our goal is to estimate nonparametrically the form of association between a disease and exposure variable while the true values of the exposure are never observed. Motivated by nutritional epidemiological data we consider the setting where a surrogate covariate is recorded in the primary data, and a calibration data set contains information on the surrogate variable and repeated measurements of an unbiased instrumental variable of the true exposure. We develop a flexible Bayesian method where not only is the relationship between the disease and exposure variable treated semiparametrically, but also the relationship between the surrogate and the true exposure is modeled semiparametrically. The two nonparametric functions are modeled simultaneously via B-splines. In addition, we model the distribution of the exposure variable as a Dirichlet process mixture of normal distributions, thus making its modeling essentially nonparametric and placing this work into the context of functional measurement error modeling. We apply our method to the NIH-AARP Diet and Health Study and examine its performance in a simulation study.
doi:10.1111/j.1541-0420.2009.01309.x
PMCID: PMC2888615
PMID: 19673858
B-splines; Dirichlet process prior; Gibbs sampling; Measurement error; Metropolis-Hastings algorithm; Partly linear model
Typically locus specific genotype data do not contain information regarding the gametic phase of haplotypes, especially when an individual is heterozygous at more than one locus among a large number of linked polymorphic loci. Thus, studying disease-haplotype association using unphased genotype data is essentially a problem of handling a missing covariate in a case-control design. There are several methods for estimating a disease-haplotype association parameter in a matched case-control study. Here we propose a conditional likelihood approach for inference regarding the disease-haplotype association using unphased genotype data arising from a matched case-control study design. The proposed method relies on a logistic disease risk model and a Hardy-Weinberg equilibrium (HWE) among the control population only. We develop an expectation and conditional maximization (ECM) algorithm for jointly estimating the haplotype frequency and the disease-haplotype association parameter(s). We apply the proposed method to analyze the data from the Alpha-Tocopherol, Beta-Carotene Cancer prevention study, and a matched case-control study of breast cancer patients conducted in Israel. The performance of the proposed method is evaluated via simulation studies.
doi:10.2202/1557-4679.1079
PMCID: PMC2835450
PMID: 20231916