Exact analytic expressions are developed for the average power of the Benjamini and Hochberg false discovery control procedure. The result is based on explicit computation of the joint probability distribution of the total number of rejections and the number of false rejections, and expressed in terms of the cumulative distribution functions of the p-values of the hypotheses. An example of analytic evaluation of the average power is given. The result is confirmed by numerical experiments and applied to a meta-analysis of three clinical studies in mammography.
doi:10.2202/1557-4679.1103
PMCID: PMC3020656
PMID: 21243075
hypothesis testing; multiple comparisons; false discovery; distribution of rejections; meta-analysis
We describe analytic approaches for study designs that, like large simple trials, can be better characterized as longitudinal studies with baseline randomization than as either a pure randomized experiment or a purely observational study. We (i) discuss the intention-to-treat effect as an effect measure for randomized studies, (ii) provide a formal definition of causal effect for longitudinal studies, (iii) describe several methods -- based on inverse probability weighting and g-estimation -- to estimate such effect, (iv) present an application of these methods to a naturalistic trial of antipsychotics on symptom severity of schizophrenia, and (v) discuss the relative advantages and disadvantages of each method.
doi:10.2202/1557-4679.1117
PMCID: PMC2835458
PMID: 20231914
This paper considers joint analysis of current status and marker data using a threshold model based on first hitting times. A failure time is defined as the time at which a subject's latent health status process first decreases to zero. We extend the bivariate Wiener process model in Whitmore et al. (1998) to the case when only current status data are available. We develop maximum likelihood estimation procedures and provide simulation studies. We apply our methods to a motivating example involving liver tumors in mice.
doi:10.2202/1557-4679.1122
PMCID: PMC2835457
PMID: 20231913
Bayesian hierarchical models that characterize the distributions of (transformed) gene profiles have been proven very useful and flexible in selecting differentially expressed genes across different types of tissue samples (e.g. Lo and Gottardo, 2007). However, the marginal mean and variance of these models are assumed to be the same for different gene clusters and for different tissue types. Moreover, it is not easy to determine which of the many competing Bayesian hierarchical models provides the best fit for a specific microarray data set. To address these two issues, we propose a marginal mixture model that directly models the marginal distribution of transformed gene profiles. Specifically, we approximate the marginal distributions of transformed gene profiles via a mixture of three-component multivariate Normal distributions, each component of which has the same structures of marginal mean vector and covariance matrix as those for Bayesian hierarchical models, but the values can differ. Based on the proposed model, a method is derived to select genes differentially expressed across two types of tissue samples. The derived gene selection method performs well on a real microarray data set and consistently has the best performance (based on class agreement indices) compared with several other gene selection methods on simulated microarray data sets generated from three different mixture models.
doi:10.2202/1557-4679.1093
PMCID: PMC2835454
PMID: 20231912
Researchers of uncommon diseases are often interested in assessing potential risk factors. Given the low incidence of disease, these studies are frequently case-control in design. Such a design allows a sufficient number of cases to be obtained without extensive sampling and can increase efficiency; however, these case-control samples are then biased since the proportion of cases in the sample is not the same as the population of interest. Methods for analyzing case-control studies have focused on utilizing logistic regression models that provide conditional and not causal estimates of the odds ratio. This article will demonstrate the use of the prevalence probability and case-control weighted targeted maximum likelihood estimation (MLE), as described by van der Laan (2008), in order to obtain causal estimates of the parameters of interest (risk difference, relative risk, and odds ratio). It is meant to be used as a guide for researchers, with step-by-step directions to implement this methodology. We will also present simulation studies that show the improved efficiency of the case-control weighted targeted MLE compared to other techniques.
doi:10.2202/1557-4679.1115
PMCID: PMC2835459
PMID: 20231910
Understanding how long-term clinical outcomes relate to short-term response to therapy is an important topic of research with a variety of applications. In HIV, early measures of viral RNA levels are known to be a strong prognostic indicator of future viral load response. However, mutations observed in the high-dimensional viral genotype at an early time point may change this prognosis. Unfortunately, some subjects may not have a viral genetic sequence measured at the early time point, and the sequence may be missing for reasons related to the outcome. Complete-case analyses of missing data are generally biased when the assumption that data are missing completely at random is not met, and methods incorporating multiple imputation may not be well-suited for the analysis of high-dimensional data. We propose a semiparametric multiple testing approach to the problem of identifying associations between potentially missing high-dimensional covariates and response. Following the recent exposition by Tsiatis, unbiased nonparametric summary statistics are constructed by inversely weighting the complete cases according to the conditional probability of being observed, given data that is observed for each subject. Resulting summary statistics will be unbiased under the assumption of missing at random. We illustrate our approach through an application to data from a recent AIDS clinical trial, and demonstrate finite sample properties with simulations.
doi:10.2202/1557-4679.1102
PMCID: PMC2835453
PMID: 20231909
Translational research studies often involve a central study (e.g. clinical trial, cohort of patients, etc.) and multiple investigators who are each interested in addressing different research questions using the same patient population. However, it is often impossible for the investigators to include all patients in all of the ancillary translational research substudies that are part of the main study. This arises due to time and budgetary constraints and other logistical considerations. In this paper, we propose a prospective Systematic Missing-At-Random study design (SMAR) with planned partially missing covariates collected using a nested random sampling scheme that allows an integrated statistical analysis across all domains of data. We propose an algorithm for data analysis that incorporates the features of the design. We show that the SMAR design is computationally and statistically efficient as well as cost effective using simulation studies and a published data example. An extension to a two-stage prospective-retrospective design is discussed.
doi:10.2202/1557-4679.1046
PMCID: PMC2835456
PMID: 20231908
Typically locus specific genotype data do not contain information regarding the gametic phase of haplotypes, especially when an individual is heterozygous at more than one locus among a large number of linked polymorphic loci. Thus, studying disease-haplotype association using unphased genotype data is essentially a problem of handling a missing covariate in a case-control design. There are several methods for estimating a disease-haplotype association parameter in a matched case-control study. Here we propose a conditional likelihood approach for inference regarding the disease-haplotype association using unphased genotype data arising from a matched case-control study design. The proposed method relies on a logistic disease risk model and a Hardy-Weinberg equilibrium (HWE) among the control population only. We develop an expectation and conditional maximization (ECM) algorithm for jointly estimating the haplotype frequency and the disease-haplotype association parameter(s). We apply the proposed method to analyze the data from the Alpha-Tocopherol, Beta-Carotene Cancer prevention study, and a matched case-control study of breast cancer patients conducted in Israel. The performance of the proposed method is evaluated via simulation studies.
doi:10.2202/1557-4679.1079
PMCID: PMC2835450
PMID: 20231916
We consider a method for extending instrumental variables methods in order to estimate the overall effect of a treatment or exposure. The approach is designed for settings in which the instrument influences both the treatment of interest and a secondary treatment also influenced by the primary treatment. We demonstrate that, while instrumental variables methods may be used to estimate the joint effects of the primary and secondary treatments, they cannot by themselves be used to estimate the overall effect of the primary treatment. However, instrumental variables methods may be used in conjunction with approaches for estimating the effect of the primary on the secondary treatment to estimate the overall effect of the primary treatment. We consider extending the proposed methods to deal with confounding of the effect of the instrument, mediation of the effect of the instrument by other variables, failure-time outcomes, and time-varying secondary treatments. We motivate our discussion by considering estimation of the overall effect of the type of vascular access among hemodialysis patients.
doi:10.2202/1557-4679.1082
PMCID: PMC2835455
PMID: 20231915
We propose an interaction tree (IT) procedure to optimize the subgroup analysis in comparative studies that involve censored survival times. The proposed method recursively partitions the data into two subsets that show the greatest interaction with the treatment, which results in a number of objectively defined subgroups: in some of them the treatment effect is prominent while in others the treatment may have a negligible or even negative effect. The resultant tree structure can be used to explore the overall interaction between treatment and other covariates and help identify and describe possible target populations on which an experimental treatment demonstrates desired efficacy. We follow the standard CART (Breiman, et al., 1984) methodology to develop the interaction tree structure. Variable importance information is extracted via random forests of interaction trees. Both simulated experiments and an analysis of the primary billiary cirrhosis (PBC) data are provided for evaluation and illustration of the proposed procedure.
doi:10.2202/1557-4679.1071
PMCID: PMC2835451
PMID: 20231911
The commonly used two-sample tests of equal area-under-the-curve (AUC), where AUC is based on the linear trapezoidal rule, may have poor properties when observations are missing, even if they are missing completely at random (MCAR). We propose two tests: one that has good properties when data are MCAR and another that has good properties when the data are missing at random (MAR), provided that the pattern of missingness is monotonic. In addition, we discuss other non-parametric tests of hypotheses that are similar, but not identical, to the hypothesis of equal AUCs, but that often have better statistical properties than do AUC tests and may be more scientifically appropriate for many settings.
doi:10.2202/1557-4679.1068
PMCID: PMC2835452
PMID: 20231907