This paper extends single-level missing data methods to efficient estimation of a Q-level nested hierarchical general linear model given ignorable missing data with a general missing pattern at any of the Q levels. The key idea is to reexpress a desired hierarchical model as the joint distribution of all variables including the outcome that are subject to missingness, conditional on all of the covariates that are completely observed; and to estimate the joint model under normal theory. The unconstrained joint model, however, identifies extraneous parameters that are not of interest in subsequent analysis of the hierarchical model, and that rapidly multiply as the number of levels, the number of variables subject to missingness, and the number of random coefficients grow. Therefore, the joint model may be extremely high dimensional and difficult to estimate well unless constraints are imposed to avoid the proliferation of extraneous covariance components at each level. Furthermore, the over-identified hierarchical model may produce considerably biased inferences. The challenge is to represent the constraints within the framework of the Q-level model in a way that is uniform without regard to Q; in a way that facilitates efficient computation for any number of Q levels; and also in a way that produces unbiased and efficient analysis of the hierarchical model. Our approach yields Q-step recursive estimation and imputation procedures whose qth step computation involves only level-q data given higher-level computation components. We illustrate the approach with a study of the growth in body mass index analyzing a national sample of elementary school children.
doi:10.1515/ijb-2012-0048
PMCID: PMC3898356
PMID: 24077621
Child Health; Hierarchical General Linear Model; Ignorable Missing Data; Maximum Likelihood; Multiple Imputation
doi:10.2202/1557-4679.1355
PMCID: PMC3608098
PMID: 22499727
Exact analytic expressions are developed for the average power of the Benjamini and Hochberg false discovery control procedure. The result is based on explicit computation of the joint probability distribution of the total number of rejections and the number of false rejections, and expressed in terms of the cumulative distribution functions of the p-values of the hypotheses. An example of analytic evaluation of the average power is given. The result is confirmed by numerical experiments and applied to a meta-analysis of three clinical studies in mammography.
doi:10.2202/1557-4679.1103
PMCID: PMC3020656
PMID: 21243075
hypothesis testing; multiple comparisons; false discovery; distribution of rejections; meta-analysis
CpG islands are genome subsequences with an unexpectedly high number of CG di-nucleotides. They are typically identified using filtering criteria (e.g., G+C% expected vs. observed CpG ratio and length) and are computed using sliding window methods. Most such studies illusively assume an exhaustive search of CpG islands are achieved on the genome sequence of interest. We devise a Lexis diagram and explicitly show that filtering criteria-based definitions of CpG islands are mathematically incomplete and non-operational. These facts imply that the sliding window methods frequently fail to identify a large percentage of subsequences that meet the filtering criteria. We also demonstrate that an exhaustive search is computationally expensive. We develop the Hierarchical Factor Segmentation (HFS) algorithm, a pattern recognition technique with an adaptive model selection device to overcome the incompleteness and non-operational drawbacks, and to achieve effective computations for identifying CpG-islands. The concept of a CpG island “core” is introduced and computed using the HFS algorithm, which is independent from any specific filtering criteria. Upon such a CpG island “core,” a CpG-island is constructed using a Lexis diagram. This two-step computational approach provides a nearly exhaustive search for CpG islands that can be practically implemented on whole chromosomes. In a simulation study realistically mimicking CpG-island dynamics through a Hidden Markov Model we demonstrate that this approach retains very high sensitivity and specificity, that is, very low rates of false positives and false negatives. Finally, we apply the HFS algorithm to identify CpG island cores on human chromosome 21.
doi:10.2202/1557-4679.1158
PMCID: PMC2818740
PMID: 20148132
AIC and BIC model selection criteria; non-parametric decoding; filtering criteria; hierarchical factor segmentation; human chromosome 21; mathematical incompleteness; methylation
For both clinical and research purposes, biopsies are used to classify liver damage known as fibrosis on an ordinal multi-state scale ranging from no damage to cirrhosis. Misclassification can arise from reading error (misreading of a specimen) or sampling error (the specimen does not accurately represent the liver). Studies of biopsy accuracy have not attempted to synthesize these two sources of error or to estimate actual misclassification rates from either source. Using data from two studies of reading error and two of sampling error, we find surprisingly large possible misclassification rates, including a greater than 50% chance of misclassification for one intermediate stage of fibrosis. We find that some readers tend to misclassify consistently low or consistently high, and some specimens tend to be misclassified low while others tend to be misclassified high. Non-invasive measures of liver fibrosis have generally been evaluated by comparison to simultaneous biopsy results, but biopsy appears to be too unreliable to be considered a gold standard. Non-invasive measures may therefore be more useful than such comparisons suggest. Both stochastic uncertainty and uncertainty about our model assumptions appear to be substantial. Improved studies of biopsy accuracy would include large numbers of both readers and specimens, greater effort to reduce or eliminate reading error in studies of sampling error, and careful estimation of misclassification rates rather than less useful quantities such as kappa statistics.
doi:10.2202/1557-4679.1139
PMCID: PMC2810974
PMID: 20104258
fibrosis; hepatitis C; kappa statistic; latent variables; misclassification
Epidemiologic research focuses on estimating exposure-disease associations. In some applications the exposure may be dichotomized, for instance when threshold levels of the exposure are of primary public health interest (e.g., consuming 5 or more fruits and vegetables per day may reduce cancer risk). Errors in exposure variables are known to yield biased regression coefficients in exposure-disease models. Methods for bias-correction with continuous mismeasured exposures have been extensively discussed, and are often based on validation substudies, where the “true” and imprecise exposures are observed on a small subsample. In this paper, we focus on biases associated with dichotomization of a mismeasured continuous exposure. The amount of bias, in relation to measurement error in the imprecise continuous predictor, and choice of dichotomization cut point are discussed. Measurement error correction via regression calibration is developed for this scenario, and compared to naïvely using the dichotomized mismeasured predictor in linear exposure-disease models. Properties of the measurement error correction method (i.e., bias, mean-squared error) are assessed via simulations.
doi:10.2202/1557-4679.1143
PMCID: PMC2743435
PMID: 20046953
measurement error correction; dichotomizing covariates; regression calibration
Observational studies of drugs and medical procedures based on administrative data are increasingly used to inform regulatory and clinical decisions. However, the validity of such studies is often questioned because available data may not contain measurements of many important prognostic variables that guide treatment decisions. Recently, approaches to this problem have been proposed that use instrumental variables (IV) defined at the level of an individual health care provider or aggregation of providers. Implicitly, these approaches attempt to estimate causal effects by using differences in medical practice patterns as a quasi-experiment. Although preference-based IV methods may usefully complement standard statistical approaches, they make assumptions that are unfamiliar to most biomedical researchers and therefore the validity of such analyses can be hard to evaluate. Here, we propose a simple framework based on a single unobserved dichotomous variable that can be used to explore how violations of IV assumptions and treatment effect heterogeneity may bias the standard IV estimator with respect to the average treatment effect in the population. This framework suggests various ways to anticipate the likely direction of bias using both empirical data and commonly available subject matter knowledge, such as whether medications or medical procedures tend to be overused, underused, or often misused. This approach is described in the context of a study comparing the gastrointestinal bleeding risk attributable to different non-steroidal anti-inflammatory drugs.
PMCID: PMC2719903
PMID: 19655038
pharmacoepidemiology; health services research; causal inference; outcomes research; unmeasured confounding; instrumental variables
We consider a method for extending instrumental variables methods in order to estimate the overall effect of a treatment or exposure. The approach is designed for settings in which the instrument influences both the treatment of interest and a secondary treatment also influenced by the primary treatment. We demonstrate that, while instrumental variables methods may be used to estimate the joint effects of the primary and secondary treatments, they cannot by themselves be used to estimate the overall effect of the primary treatment. However, instrumental variables methods may be used in conjunction with approaches for estimating the effect of the primary on the secondary treatment to estimate the overall effect of the primary treatment. We consider extending the proposed methods to deal with confounding of the effect of the instrument, mediation of the effect of the instrument by other variables, failure-time outcomes, and time-varying secondary treatments. We motivate our discussion by considering estimation of the overall effect of the type of vascular access among hemodialysis patients.
PMCID: PMC2669310
PMID: 19381345
instrumental variables; causal inference
Marginal structural models (MSM) are an important class of models in causal inference. Given a longitudinal data structure observed on a sample of n independent and identically distributed experimental units, MSM model the counterfactual outcome distribution corresponding with a static treatment intervention, conditional on user-supplied baseline covariates. Identification of a static treatment regimen-specific outcome distribution based on observational data requires, beyond the standard sequential randomization assumption, the assumption that each experimental unit has positive probability of following the static treatment regimen. The latter assumption is called the experimental treatment assignment (ETA) assumption, and is parameter-specific. In many studies the ETA is violated because some of the static treatment interventions to be compared cannot be followed by all experimental units, due either to baseline characteristics or to the occurrence of certain events over time. For example, the development of adverse effects or contraindications can force a subject to stop an assigned treatment regimen.
In this article we propose causal effect models for a user-supplied set of realistic individualized treatment rules. Realistic individualized treatment rules are defined as treatment rules which always map into the set of possible treatment options. Thus, causal effect models for realistic treatment rules do not rely on the ETA assumption and are fully identifiable from the data. Further, these models can be chosen to generalize marginal structural models for static treatment interventions. The estimating function methodology of Robins and Rotnitzky (1992) (analogue to its application in Murphy, et. al. (2001) for a single treatment rule) provides us with the corresponding locally efficient double robust inverse probability of treatment weighted estimator.
In addition, we define causal effect models for “intention-to-treat” regimens. The proposed intention-to-treat interventions enforce a static intervention until the time point at which the next treatment does not belong to the set of possible treatment options, at which point the intervention is stopped. We provide locally efficient estimators of such intention-to-treat causal effects.
PMCID: PMC2613338
PMID: 19122793
counterfactual; causal effect; causal inference; double robust estimating function; dynamic treatment regimen; estimating function; individualized stopped treatment regimen; individualized treatment rule; inverse probability of treatment weighted estimating functions; locally efficient estimation; static treatment intervention
Consider a longitudinal observational or controlled study in which one collects chronological data over time on a random sample of subjects. The time-dependent process one observes on each subject contains time-dependent covariates, time-dependent treatment actions, and an outcome process or single final outcome of interest. A statically optimal individualized treatment rule (as introduced in van der Laan et. al. (2005), Petersen et. al. (2007)) is a treatment rule which at any point in time conditions on a user-supplied subset of the past, computes the future static treatment regimen that maximizes a (conditional) mean future outcome of interest, and applies the first treatment action of the latter regimen. In particular, Petersen et. al. (2007) clarified that, in order to be statically optimal, an individualized treatment rule should not depend on the observed treatment mechanism. Petersen et. al. (2007) further developed estimators of statically optimal individualized treatment rules based on a past capturing all confounding of past treatment history on outcome. In practice, however, one typically wishes to find individualized treatment rules responding to a user-supplied subset of the complete observed history, which may not be sufficient to capture all confounding. The current article provides an important advance on Petersen et. al. (2007) by developing locally efficient double robust estimators of statically optimal individualized treatment rules responding to such a user-supplied subset of the past. However, failure to capture all confounding comes at a price; the static optimality of the resulting rules becomes origin-specific. We explain origin-specific static optimality, and discuss the practical importance of the proposed methodology. We further present the results of a data analysis in which we estimate a statically optimal rule for switching antiretroviral therapy among patients infected with resistant HIV virus.
PMCID: PMC2613337
PMID: 19122792
counterfactual; causal inference; double robust estimating function; dynamic treatment regime; history-adjusted marginal structural model; inverse probability weighting