The Rao–Blackwell theorem offers a procedure for converting a crude unbiased estimator of a parameter θ into a “better” one, in fact unique and optimal if the improvement is based on a minimal sufficient statistic that is complete. In contrast, behind every minimal sufficient statistic that is not complete, there is an improvable Rao–Blackwell improvement. This is illustrated via a simple example based on the uniform distribution, in which a rather natural Rao–Blackwell improvement is uniformly improvable. Furthermore, in this example the maximum likelihood estimator is inefficient, and an unbiased generalized Bayes estimator performs exceptionally well. Counterexamples of this sort can be useful didactic tools for explaining the true nature of a methodology and possible consequences when some of the assumptions are violated.
[Received December 2014. Revised September 2015.]
Improper prior; Minimal sufficiency; Uniform distribution; Uniformly minimum-variance unbiased estimator.
There is growing frustration with the concept of the p-value. Besides having an ambiguous interpretation, the p-value can be made as small as desired by increasing the sample size, n. The p-value is outdated and does not make sense with big data: Everything becomes statistically significant. The root of the problem with the p-value is in the mean comparison. We argue that statistical uncertainty should be measured on the individual, not the group, level. Consequently, standard deviation (SD), not standard error (SE), error bars should be used to graphically present the data on two groups. We introduce a new measure based on the discrimination of individuals/objects from two groups, and call it the D-value. The D-value can be viewed as the n-of-1 p-value because it is computed in the same way as p while letting n equal 1. We show how the D-value is related to discrimination probability and the area above the receiver operating characteristic (ROC) curve. The D-value has a clear interpretation as the proportion of patients who get worse after the treatment, and as such facilitates to weigh up the likelihood of events under different scenarios.
[Received January 2015. Revised June 2015.]
Discrimination error; Effect size; ROC curve; Significance testing.
Bootstrapping has enormous potential in statistics education and practice, but there are subtle issues and ways to go wrong. For example, the common combination of nonparametric bootstrapping and bootstrap percentile confidence intervals is less accurate than using t-intervals for small samples, though more accurate for larger samples. My goals in this article are to provide a deeper understanding of bootstrap methods—how they work, when they work or not, and which methods work better—and to highlight pedagogical issues. Supplementary materials for this article are available online.
[Received December 2014. Revised August 2015]
Bias; Confidence intervals; Sampling distribution; Standard error; Statistical concepts; Teaching.
Statistical principles and methods are critical to the success of biomedical and translational research. However, it is difficult to track and evaluate the monetary value of a biostatistician to a medical school (SoM). Limited published data on this topic is available, especially comparing across SoMs. Using National Institutes of Health (NIH) awards and American Association of Medical Colleges (AAMC) faculty counts data (2010–2013), together with online information on biostatistics faculty from 119 institutions across the country, we demonstrated that the number of biostatistics faculty was significantly positively associated with the amount of NIH awards, both as a school total and on a per faculty basis, across various sizes of U.S. SoMs. Biostatisticians, as a profession, need to be proactive in communicating and advocating the value of their work and their unique contribution to the long-term success of a biomedical research enterprise.
Biomedical and translational research; Biostatistical collaboration; Extramural funding
This paper presents a counterintuitive result regarding the estimation of
a regression slope co-efficient. Paradoxically, the precision of the slope
estimator can deteriorate when additional information is used to estimate its
value. In a randomized experiment, the distribution of baseline variables should
be identical across treatments due to randomization. The motivation for this
paper came from noting that the precision of slope estimators deteriorated when
pooling baseline predictors across treatment groups.
Jensen’s inequality; moderator analysis; pooled variance estimator; ratio estimator
We develop a novel nonparametric likelihood ratio test for independence between two random variables using a technique that is free of the common constraints of defining a given set of specific dependence structures. Our methodology revolves around an exact density-based empirical likelihood ratio test statistic that approximates in a distribution-free fashion the corresponding most powerful parametric likelihood ratio test. We demonstrate that the proposed test is very powerful in detecting general structures of dependence between two random variables, including non-linear and/or random-effect dependence structures. An extensive Monte Carlo study confirms that the proposed test is superior to the classical nonparametric procedures across a variety of settings. The real-world applicability of the proposed test is illustrated using data from a study of biomarkers associated with myocardial infarction.
Density-based empirical likelihood; Empirical likelihood; Independence test; Likelihood; Nonlinear dependence; Nonparametric test; Random effect
The incorrect notion that kurtosis somehow measures “peakedness” (flatness, pointiness or modality) of a distribution is remarkably persistent, despite attempts by statisticians to set the record straight. This article puts the notion to rest once and for all. Kurtosis tells you virtually nothing about the shape of the peak - its only unambiguous interpretation is in terms of tail extremity; i.e., either existing outliers (for the sample kurtosis) or propensity to produce outliers (for the kurtosis of a probability distribution). To clarify this point, relevant literature is reviewed, counterexample distributions are given, and it is shown that the proportion of the kurtosis that is determined by the central μ ± σ range is usually quite small.
Fourth Moment; Inequality; Mesokurtic; Leptokurtic; Platykurtic
Matching for several nominal covariates with many levels has usually been
thought to be difficult because these covariates combine to form an enormous
number of interaction categories with few if any people in most such categories.
Moreover, because nominal variables are not ordered, there is often no notion of
a “close substitute” when an exact match is unavailable. In a
case-control study of the risk factors for read-mission within 30 days of
surgery in the Medicare population, we wished to match for 47 hospitals, 15
surgical procedures grouped or nested within 5 procedure groups, two genders, or
47 × 15 × 2 = 1410 categories. In addition, we wished to match
as closely as possible for the continuous variable age (65–80 years).
There were 1380 readmitted patients or cases. A fractional factorial experiment
may balance main effects and low-order interactions without achieving balance
for high-order interactions. In an analogous fashion, we balance certain main
effects and low-order interactions among the covariates; moreover, we use as
many exactly matched pairs as possible. This is done by creating a match that is
exact for several variables, with a close match for age, and both a
“near-exact match” and a “finely balanced match”
for another nominal variable, in this case a 47 × 5 = 235 category
variable representing the interaction of the 47 hospitals and the five surgical
procedure groups. The method is easily implemented in R.
Assignment algorithm; Caliper match; Finely balanced match; Near-exact match
The aim of this paper is to address issues in research that may be missing from statistics classes and important for (bio-)statistics students. In the context of a case study, we discuss data acquisition and preprocessing steps that fill the gap between research questions posed by subject matter scientists and statistical methodology for formal inference. Issues include participant recruitment, data collection training and standardization, variable coding, data review and verification, data cleaning and editing, and documentation. Despite the critical importance of these details in research, most of these issues are rarely discussed in an applied statistics program. One reason for the lack of more formal training is the difficulty in addressing the many challenges that can possibly arise in the course of a study in a systematic way. This article can help to bridge this gap between research questions and formal statistical inference by using an illustrative case study for a discussion. We hope that reading and discussing this paper and practicing data preprocessing exercises will sensitize statistics students to these important issues and achieve optimal conduct, quality control, analysis, and interpretation of a study.
Statistical education; Applied statistics courses; Quality control; Data collection; Data cleaning; Data code book; Data dictionary
Several statistical packages are capable of estimating generalized linear mixed models and these packages provide one or more of three estimation methods: penalized quasi-likelihood, Laplace, and Gauss-Hermite. Many studies have investigated these methods’ performance for the mixed-effects logistic regression model. However, the authors focused on models with one or two random effects and assumed a simple covariance structure between them, which may not be realistic. When there are multiple correlated random effects in a model, the computation becomes intensive, and often an algorithm fails to converge. Moreover, in our analysis of smoking status and exposure to anti-tobacco advertisements, we have observed that when a model included multiple random effects, parameter estimates varied considerably from one statistical package to another even when using the same estimation method. This article presents a comprehensive review of the advantages and disadvantages of each estimation method. In addition, we compare the performances of the three methods across statistical packages via simulation, which involves two- and three-level logistic regression models with at least three correlated random effects. We apply our findings to a real dataset. Our results suggest that two packages—SAS GLIMMIX Laplace and SuperMix Gaussian quadrature—perform well in terms of accuracy, precision, convergence rates, and computing speed. We also discuss the strengths and weaknesses of the two packages in regard to sample sizes.
Adaptive Gauss-Hermite integration; Antismoking advertising; Laplace approximation; Mixed-effects logistic regression; Penalized quasi-likelihood
Assuming a binary outcome, logistic regression is the most common approach to estimating a crude or adjusted odds ratio corresponding to a continuous predictor. We revisit a method termed the discriminant function approach, which leads to closed-form estimators and corresponding standard errors. In its most appealing application, we show that the approach suggests a multiple linear regression of the continuous predictor of interest on the outcome and other covariates, in place of the traditional logistic regression model. If standard diagnostics support the assumptions (including normality of errors) accompanying this linear regression model, the resulting estimator has demonstrable advantages over the usual maximum likelihood estimator via logistic regression. These include improvements in terms of bias and efficiency based on a minimum variance unbiased estimator of the log odds ratio, as well as the availability of an estimate when logistic regression fails to converge due to a separation of data points. Use of the discriminant function approach as described here for multivariable analysis requires less stringent assumptions than those for which it was historically criticized, and is worth considering when the adjusted odds ratio associated with a particular continuous predictor is of primary interest. Simulation and case studies illustrate these points.
Bias; Efficiency; Logistic regression; Minimum variance unbiased estimator
We demonstrate the algebraic equivalence of two unbiased variance estimators for the sample grand mean in a random sample of subjects from an infinite population where subjects provide repeated observations following a homoscedastic random effects model.
Intraclass correlation coefficient; Random effects; Ratio-unbiased estimators; Unbiased estimators; Variance estimators; Variance inflation factor
It is an obvious fact that the power of a test statistic is dependent upon the significance (alpha) level at which the test is performed. It is perhaps a less obvious fact that the relative performance of two statistics in terms of power is also a function of the alpha level. Through numerous personal discussions, we have noted that even some competent statisticians have the mistaken intuition that relative power comparisons at traditional levels such as α = 0.05 will be roughly similar to relative power comparisons at very low levels, such as the level α = 5 × 10−8, which is commonly used in genome-wide association studies. In this brief note, we demonstrate that this notion is in fact quite wrong, especially with respect to comparing tests with differing degrees of freedom. In fact, at very low alpha levels the cost of additional degrees of freedom is often comparatively low. Thus we recommend that statisticians exercise caution when interpreting the results of power comparison studies which use alpha levels that will not be used in practice.
Power; Small Significance Levels
The randomized discontinuation trial (RDT) design is an enrichment-type design that has been used in a variety of diseases to evaluate the efficacy of new treatments. The RDT design seeks to select a more homogeneous group of patients, consisting of those who are more likely to show a treatment benefit if one exists. In oncology, the RDT design has been applied to evaluate the effects of cytostatic agents, that is, drugs that act primarily by slowing tumor growth rather than shrinking tumors. In the RDT design, all patients receive treatment during an initial, open-label run-in period of duration T. Patients with objective response (substantial tumor shrinkage) remain on therapy while those with early progressive disease are removed from the trial. Patients with stable disease (SD) are then randomized to either continue active treatment or switched to placebo. The main analysis compares outcomes, for example, progression-free survival (PFS), between the two randomized arms. As a secondary objective, investigators may seek to estimate PFS for all treated patients, measured from the time of entry into the study, by combining information from the run-in and post run-in periods. For t ≤ T, PFS is estimated by the observed proportion of patients who are progression-free among all patients enrolled. For t > T, the estimate can be expressed as Ŝ(t) = p̂OR × ŜOR(t − T) + p̂SD × ŜSD(t − T), where p̂OR is the estimated probability of response during the run-in period, p̂SD is the estimated probability of SD, and ŜOR(t − T) and ŜSD(t − T) are the Kaplan–Meier estimates of subsequent PFS in the responders and patients with SD randomized to continue treatment, respectively. In this article, we derive the variance of Ŝ(t), enabling the construction of confidence intervals for both S(t) and the median survival time. Simulation results indicate that the method provides accurate coverage rates. An interesting aspect of the design is that outcomes during the run-in phase have a negative multinomial distribution, something not frequently encountered in practice.
Confidence limits; Enrichment design; Negative multinomial distribution; Phase II clinical trials
Cox proportional hazards (PH) models are commonly used in medical research to investigate the associations between covariates and time to event outcomes. It is frequently noted that with less than ten events per covariate, these models produce spurious results, and therefore, should not be used. Statistical literature contains asymptotic power formulae for the Cox model which can be used to determine the number of events needed to detect an association. Here we investigate via simulations the performance of these formulae in small sample settings for Cox models with 1- or 2-covariates. Our simulations indicate that, when the number of events is small, the power estimate based on the asymptotic formulae is often inflated. The discrepancy between the asymptotic and empirical power is larger for the dichotomous covariate especially in cases where allocation of sample size to its levels is unequal. When more than one covariate is included in the same model, the discrepancy between the asymptotic power and the empirical power is even larger, especially when a high positive correlation exists between the two covariates.
Time to event data; Cox Proportional Hazards Model; number of events per covariate; event size formula
The power of a test, the probability of rejecting the null hypothesis in favor of an alternative, may be computed using estimates of one or more distributional parameters. Statisticians frequently fix mean values and calculate power or sample size using a variance estimate from an existing study. Hence computed power becomes a random variable for a fixed sample size. Likewise, the sample size necessary to achieve a fixed power varies randomly. Standard statistical practice requires reporting uncertainty associated with such point estimates. Previous authors studied an asymptotically unbiased method of obtaining confidence intervals for noncentrality and power of the general linear univariate model in this setting. We provide exact confidence intervals for noncentrality, power, and sample size. Such confidence intervals, particularly one-sided intervals, help in planning a future study and in evaluating existing studies.
Effect size; Meta-analysis; Noncentral F distribution
Plausibility of high variability in treatment effects across individuals has been recognized as an important consideration in clinical studies. Surprisingly, little attention has been given to evaluating this variability in design of clinical trials or analyses of resulting data. High variation in a treatment’s efficacy or safety across individuals (referred to herein as treatment heterogeneity) may have important consequences because the optimal treatment choice for an individual may be different from that suggested by a study of average effects. We call this an individual qualitative interaction (IQI), borrowing terminology from earlier work - referring to a qualitative interaction (QI) being present when the optimal treatment varies across a“groups” of individuals. At least three techniques have been proposed to investigate treatment heterogeneity: techniques to detect a QI, use of measures such as the density overlap of two outcome variables under different treatments, and use of cross-over designs to observe “individual effects.” We elucidate underlying connections among them, their limitations and some assumptions that may be required. We do so under a potential outcomes framework that can add insights to results from usual data analyses and to study design features that improve the capability to more directly assess treatment heterogeneity.
Causation; Crossover interaction; Individual effects; Potential outcomes; Probability of similar response; Subject-treatment interaction
At present, there are many software procedures available enabling
statisticians to fit linear mixed models (LMMs) to continuous dependent
variables in clustered or longitudinal data sets. LMMs are flexible tools for
analyzing relationships among variables in these types of data sets, in that a
variety of covariance structures can be used depending on the subject matter
under study. The explicit random effects in LMMs allow analysts to make
inferences about the variability between clusters or subjects in larger
hypothetical populations, and examine cluster- or subject-level variables that
explain portions of this variability. These models can also be used to analyze
longitudinal or clustered data sets with data that are missing at random (MAR),
and can accommodate time-varying covariates in longitudinal data sets. While the
software procedures currently available have many features in common, more
specific analytic aspects of fitting LMMs (e.g., crossed random effects,
appropriate hypothesis testing for variance components, diagnostics,
incorporating sampling weights) may only be available in selected software
procedures. With this article, we aim to perform a comprehensive and up-to-date
comparison of the current capabilities of software procedures for fitting LMMs,
and provide statisticians with a guide for selecting a software procedure
appropriate for their analytic goals.
Models for Clustered Data; Longitudinal Data Analysis; Covariance Structures; Statistical Software
We consider the problem of estimating the correlation in bivariate normal data when the means and variances are assumed known, with emphasis on the small sample case. We consider eight different estimators, several of them considered here for the first time in the literature. In a simulation study, we found that Bayesian estimators using the uniform and arc-sine priors outperformed several empirical and exact or approximate maximum likelihood estimators in small samples. The arc-sine prior did better for large values of the correlation. For testing whether the correlation is zero, we found that Bayesian hypothesis tests outperformed significance tests based on the empirical and exact or approximate maximum likelihood estimators considered in small samples, but that all tests performed similarly for sample size 50. These results lead us to suggest using the posterior mean with the arc-sine prior to estimate the correlation in small samples when the variances are assumed known.
Arc-sine prior; Bayes factor; Bayesian test; Maximum likelihood estimator; Uniform prior; Jeffreys prior
P-values are useful statistical measures of evidence against a null hypothesis. In contrast to other statistical estimates, however, their sample-to-sample variability is usually not considered or estimated, and therefore not fully appreciated. Via a systematic study of log-scale p-value standard errors, bootstrap prediction bounds, and reproducibility probabilities for future replicate p-values, we show that p-values exhibit surprisingly large variability in typical data situations. In addition to providing context to discussions about the failure of statistical results to replicate, our findings shed light on the relative value of exact p-values vis-a-vis approximate p-values, and indicate that the use of *, **, and *** to denote levels .05, .01, and .001 of statistical significance in subject-matter journals is about the right level of precision for reporting p-values when judged by widely accepted rules for rounding statistical estimates.
log p-value; measure of evidence; prediction interval; reproducibility probability
Matching is a powerful statistical tool in design and analysis. Conventional two-group, or bipartite, matching has been widely used in practice. However, its utility is limited to simpler designs. In contrast, nonbipartite matching is not limited to the two-group case, handling multiparty matching situations. It can be used to find the set of matches that minimize the sum of distances based on a given distance matrix. It brings greater flexibility to the matching design, such as multigroup comparisons. Thanks to improvements in computing power and freely available algorithms to solve nonbipartite problems, the cost in terms of computation time and complexity is low. This article reviews the optimal nonbipartite matching algorithm and its statistical applications, including observational studies with complex designs and an exact distribution-free test comparing two multivariate distributions. We also introduce an R package that performs optimal nonbipartite matching. We present an easily accessible web application to make nonbipartite matching freely available to general researchers.
Bipartite matching; CRAN; Observational studies; Propensity score; R package
Statistical experiments, more commonly referred to as Monte Carlo or simulation studies, are used to study the behavior of statistical methods and measures under controlled situations. Whereas recent computing and methodological advances have permitted increased efficiency in the simulation process, known as variance reduction, such experiments remain limited by their finite nature and hence are subject to uncertainty; when a simulation is run more than once, different results are obtained. However, virtually no emphasis has been placed on reporting the uncertainty, referred to here as Monte Carlo error, associated with simulation results in the published literature, or on justifying the number of replications used. These deserve broader consideration. Here we present a series of simple and practical methods for estimating Monte Carlo error as well as determining the number of replications required to achieve a desired level of accuracy. The issues and methods are demonstrated with two simple examples, one evaluating operating characteristics of the maximum likelihood estimator for the parameters in logistic regression and the other in the context of using the bootstrap to obtain 95% confidence intervals. The results suggest that in many settings, Monte Carlo error may be more substantial than traditionally thought.
Bootstrap; Jackknife; Replication
When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to non-oracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold cross-validation with any oracle method, and not just the SCAD and Adaptive Lasso.
Adaptive Lasso; Lasso; Model selection; Oracle estimation