Boosting is an important tool in classification methodology. It combines the performance of many weak classifiers to produce a powerful committee, and its validity can be explained by additive modeling and maximum likelihood. The method has very general applications, especially for high-dimensional predictors. For example, it can be applied to distinguish cancer samples from healthy control samples by using antibody microarray data. Microarray data are often high-dimensional and many of them are incomplete. One natural idea is to impute a missing variable based on the observed predictors. However, the calculation of imputation for high-dimensional predictors with missing data may be rather tedious. In this paper, we propose 2 conditional mean imputation methods. They can be applied to the situation even when a complete-case subset does not exist. Simulation results indicate that the proposed methods are superior than other naive methods. We apply the methods to a pancreatic cancer study in which serum protein microarrays are used for classification.
Additive model; Classification; Imputation; Nonmonotone missing pattern
Accurate knowledge of the null distribution of hypothesis tests is important for valid application of the tests. In previous papers and software, the asymptotic null distribution of likelihood ratio tests for detecting genetic linkage in multivariate variance components models has been stated to be a mixture of chi-square distributions with binomial mixing probabilities. For variance components models under the complete pleiotropy assumption, we show by simulation and by theoretical arguments based on the geometry of the parameter space that all aspects of the previously stated asymptotic null distribution are incorrect—both the binomial mixing probabilities and the chi-square components. Correcting the null distribution gives more conservative critical values than previously stated, yielding P values that can easily be 10 times larger. The true mixing probabilities give the highest probability to the case where all variance parameters are estimated positive, and the mixing components show severe departures from chi-square distributions. Thus, the asymptotic null distribution has complex features that raise challenges for the assessment of significance of multivariate linkage findings. We propose a method to generate an asymptotic null distribution that is much faster than other empirical methods such as permutation, enabling us to obtain P values with higher precision more efficiently.
Asymptotic null distribution; Likelihood ratio test; Mixing probabilities; Multivariate linkage; Nonstandard boundary condition; Single-factor model; Variance components
When analyzing a 2 × 2 table, the two-sided Fisher's exact test and the usual exact confidence interval (CI) for the odds ratio may give conflicting inferences; for example, the test rejects but the associated CI contains an odds ratio of 1. The problem is that the usual exact CI is the inversion of the test that rejects if either of the one-sided Fisher's exact tests rejects at half the nominal significance level. Further, the confidence set that is the inversion of the usual two-sided Fisher's exact test may not be an interval, so following Blaker (2000, Confidence curves and improved exact confidence intervals for discrete distributions. Canadian Journal of Statistics 28, 783–798), we define the “matching” interval as the smallest interval that contains the confidence set. We explore these 2 versions of Fisher's exact test as well as an exact test suggested by Blaker (2000) and provide the R package exact2 ×2 which automatically assigns the appropriate matching interval to each of the 3 exact tests.
Conditional Exact Test; Confidence Set; Fisher's Exact Test; Odds Ratio; Two-by-Two Table
In the past decade, several principal stratification–based statistical methods have been developed for testing and estimation of a treatment effect on an outcome measured after a postrandomization event. Two examples are the evaluation of the effect of a cancer treatment on quality of life in subjects who remain alive and the evaluation of the effect of an HIV vaccine on viral load in subjects who acquire HIV infection. However, in general the developed methods have not addressed the issue of missing outcome data, and hence their validity relies on a missing completely at random (MCAR) assumption. Because in many applications the MCAR assumption is untenable, while a missing at random (MAR) assumption is defensible, we extend the semiparametric likelihood sensitivity analysis approach of Gilbert and others (2003) and Jemiai and Rotnitzky (2005) to allow the outcome to be MAR. We combine these methods with the robust likelihood–based method of Little and An (2004) for handling MAR data to provide semiparametric estimation of the average causal effect of treatment on the outcome. The new method, which does not require a monotonicity assumption, is evaluated in a simulation study and is applied to data from the first HIV vaccine efficacy trial.
Causal inference; HIV vaccine trial; Missing at random; Posttreatment selection bias; Principal stratification; Sensitivity analysis
Bandeen-Roche and Liang (2002, Modelling multivariate failure time associations in the presence of a competing risk. Biometrika 89, 299–314.) tailored Oakes (1989, Bivariate survival models induced by frailties. Journal of the American Statistical Association 84, 487–493.)'s conditional hazard ratio to evaluate cause-specific associations in bivariate competing risks data. In many population-based family studies, one observes complex multivariate competing risks data, where the family sizes may be > 2, certain marginals may be exchangeable, and there may be multiple correlated relative pairs having a given pairwise association. Methods for bivariate competing risks data are inadequate in these settings. We show that the rank correlation estimator of Bandeen-Roche and Liang (2002) extends naturally to general clustered family structures. Consistency, asymptotic normality, and variance estimation are easily obtained with U-statistic theories. A natural by-product is an easily implemented test for constancy of the association over different time regions. In the Cache County Study on Memory in Aging, familial associations in dementia onset are of interest, accounting for death prior to dementia. The proposed methods using all available data suggest attenuation in dementia associations at later ages, which had been somewhat obscured in earlier analyses.
Cause-specific hazard ratio; Concordance estimator; Dependent censoring; Exchangeable clustered data; Time-varying association
Genome-wide association studies (GWAS) are increasingly utilized for identifying novel susceptible genetic variants for complex traits, but there is little consensus on analysis methods for such data. Most commonly used methods include single single nucleotide polymorphism (SNP) analysis or haplotype analysis with Bonferroni correction for multiple comparisons. Since the SNPs in typical GWAS are often in linkage disequilibrium (LD), at least locally, Bonferroni correction of multiple comparisons often leads to conservative error control and therefore lower statistical power. In this paper, we propose a hidden Markov random field model (HMRF) for GWAS analysis based on a weighted LD graph built from the prior LD information among the SNPs and an efficient iterative conditional mode algorithm for estimating the model parameters. This model effectively utilizes the LD information in calculating the posterior probability that an SNP is associated with the disease. These posterior probabilities can then be used to define a false discovery controlling procedure in order to select the disease-associated SNPs. Simulation studies demonstrated the potential gain in power over single SNP analysis. The proposed method is especially effective in identifying SNPs with borderline significance at the single-marker level that nonetheless are in high LD with significant SNPs. In addition, by simultaneously considering the SNPs in LD, the proposed method can also help to reduce the number of false identifications of disease-associated SNPs. We demonstrate the application of the proposed HMRF model using data from a case–control GWAS of neuroblastoma and identify 1 new SNP that is potentially associated with neuroblastoma.
Empirical Bayes; False discovery; Iterative conditional model; Linkage disequilibrium
Before a comparative diagnostic trial is carried out, maximum sample sizes for the diseased group and the nondiseased group need to be obtained to achieve a nominal power to detect a meaningful difference in diagnostic accuracy. Sample size calculation depends on the variance of the statistic of interest, which is the difference between receiver operating characteristic summary measures of 2 medical diagnostic tests. To obtain an appropriate value for the variance, one often has to assume an arbitrary parametric model and the associated parameter values for the 2 groups of subjects under 2 tests to be compared. It becomes more tedious to do so when the same subject undergoes 2 different tests because the correlation is then involved in modeling the test outcomes. The calculated variance based on incorrectly specified parametric models may be smaller than the true one, which will subsequently result in smaller maximum sample sizes, leaving the study underpowered. In this paper, we develop a nonparametric adaptive method for comparative diagnostic trials to update the sample sizes using interim data, while allowing early stopping during interim analyses. We show that the proposed method maintains the nominal power and type I error rate through theoretical proofs and simulation studies.
Diagnostic accuracy; Error spending function; ROC; Sensitivity; Specificity
In occupational case–control studies, work-related exposure assessments are often fallible measures of the true underlying exposure. In lieu of a gold standard, often more than 2 imperfect measurements (e.g. triads) are used to assess exposure. While methods exist to assess the diagnostic accuracy in the absence of a gold standard, these methods are infrequently used to correct for measurement error in exposure–disease associations in occupational case–control studies. Here, we present a likelihood-based approach that (a) provides evidence regarding whether the misclassification of tests is differential or nondifferential; (b) provides evidence whether the misclassification of tests is independent or dependent conditional on latent exposure status, and (c) estimates the measurement error–corrected exposure–disease association. These approaches use information from all imperfect assessments simultaneously in a unified manner, which in turn can provide a more accurate estimate of exposure–disease association than that based on individual assessments. The performance of this method is investigated through simulation studies and applied to the National Occupational Hazard Survey, a case–control study assessing the association between asbestos exposure and mesothelioma.
Case–control study; Gold standard; Missing data; Occupational exposure assessment
We describe a new stochastic search algorithm for linear regression models called the bounded mode stochastic search (BMSS). We make use of BMSS to perform variable selection and classification as well as to construct sparse dependency networks. Furthermore, we show how to determine genetic networks from genomewide data that involve any combination of continuous and discrete variables. We illustrate our methodology with several real-world data sets.
Bayesian regression analysis; Dependency networks; Gene expression; Stochastic search; Variable selection
Network reconstruction is a main goal of many biological endeavors. Graphical Gaussian models (GGMs) are often used since the underlying assumptions are well understood, the graph is readily estimated by calculating the partial correlation (paCor) matrix, and its interpretation is straightforward. In spite of these advantages, GGMs are limited in that interactions are not accommodated as the underlying multivariate normality assumption allows for linear dependencies only. As we show, when applied in the presence of interactions, the GGM framework can lead to incorrect inference regarding dependence. Identifying the exact dependence structure in this context is a difficult problem, largely because an analogue of the paCor matrix is not available and dependencies can involve many nodes. We here present a computationally efficient approach to identify bivariate interactions in networks. A key element is recognizing that interactions have a marginal linear effect and as a result information about their presence can be obtained from the paCor matrix. Theoretical derivations for the exact effect are presented and used to motivate the approach; and simulations suggest that the method works well, even in fairly complicated scenarios. Practical advantages are demonstrated in analyses of data from a breast cancer study.
Gene association networks; Gene expression; Graphical Gaussian models
We consider estimation and variable selection in the partial linear model for censored data. The partial linear model for censored data is a direct extension of the accelerated failure time model, the latter of which is a very important alternative model to the proportional hazards model. We extend rank-based lasso-type estimators to a model that may contain nonlinear effects. Variable selection in such partial linear model has direct application to high-dimensional survival analyses that attempt to adjust for clinical predictors. In the microarray setting, previous methods can adjust for other clinical predictors by assuming that clinical and gene expression data enter the model linearly in the same fashion. Here, we select important variables after adjusting for prognostic clinical variables but the clinical effects are assumed nonlinear. Our estimator is based on stratification and can be extended naturally to account for multiple nonlinear effects. We illustrate the utility of our method through simulation studies and application to the Wisconsin prognostic breast cancer data set.
Lasso; Logrank; Penalized least squares; Survival analysis
A novel 3-step random forests methodology involving survival data (survival forests), ordinal data (multiclass forests), and continuous data (regression forests) is introduced for cancer staging. The methodology is illustrated for esophageal cancer using worldwide esophageal cancer collaboration data involving 4627 patients.
Predicted survival; Random forests; Survival curves; TNM
This paper deals with the analysis of recurrent event data subject to censored observation. Using a suitable adaptation of generalized estimating equations for longitudinal data, we propose a straightforward methodology for estimating the parameters indexing the conditional means and variances of the process interevent (i.e. gap) times. The proposed methodology permits the use of both time-fixed and time-varying covariates, as well as transformations of the gap times, creating a flexible and useful class of methods for analyzing gap-time data. Censoring is dealt with by imposing a parametric assumption on the censored gap times, and extensive simulation results demonstrate the relative robustness of parameter estimates even when this parametric assumption is incorrect. A suitable large-sample theory is developed. Finally, we use our methods to analyze data from a randomized trial of asthma prevention in young children.
Asthma; Censoring; Generalized estimating equation; Intensity model; Longitudinal data; Marginal model
Microarray time-course data can be used to explore interactions among genes and infer gene network. The crucial step in constructing gene network is to develop an appropriate causality test. In this regard, the expression profile of each gene can be treated as a time series. A typical existing method establishes the Granger causality based on Wald type of test, which relies on the homoscedastic normality assumption of the data distribution. However, this assumption can be seriously violated in real microarray experiments and thus may lead to inconsistent test results and false scientific conclusions. To overcome the drawback, we propose an estimating equation–based method which is robust to both heteroscedasticity and nonnormality of the gene expression data. In fact, it only requires the residuals to be uncorrelated. We will use simulation studies and a real-data example to demonstrate the applicability of the proposed method.
Chi-square approximation; Estimating equation; F-test; False-positive rate; Granger causality; Time-course data
Mass spectrometry is a powerful tool with much promise in global proteomic studies. The discipline of statistics offers robust methodologies to extract and interpret high-dimensional mass-spectrometry data and will be a valuable contributor to the field. Here, we describe the process by which data are produced, characteristics of the data, and the analytical preprocessing steps that are taken in order to interpret the data and use it in downstream statistical analyses. Because of the complexity of data acquisition, statistical methods developed for gene expression microarray data are not directly applicable to proteomic data. Areas in need of statistical research for proteomic data include alignment, experimental design, abundance normalization, and statistical analysis.
Experimental design; Fourier transform; Mass calibration; Mass spectrometry; Normalization
We present a penalized matrix decomposition (PMD), a new framework for computing a rank-K approximation for a matrix. We approximate the matrix X as , where dk, uk, and vk minimize the squared Frobenius norm of X, subject to penalties on uk and vk. This results in a regularized version of the singular value decomposition. Of particular interest is the use of L1-penalties on uk and vk, which yields a decomposition of X using sparse vectors. We show that when the PMD is applied using an L1-penalty on vk but not on uk, a method for sparse principal components results. In fact, this yields an efficient algorithm for the “SCoTLASS” proposal (Jolliffe and others 2003) for obtaining sparse principal components. This method is demonstrated on a publicly available gene expression data set. We also establish connections between the SCoTLASS method for sparse principal component analysis and the method of Zou and others (2006). In addition, we show that when the PMD is applied to a cross-products matrix, it results in a method for penalized canonical correlation analysis (CCA). We apply this penalized CCA method to simulated data and to a genomic data set consisting of gene expression and DNA copy number measurements on the same set of samples.
Canonical correlation analysis; DNA copy number; Integrative genomic analysis; L1; Matrix decomposition; Principal component analysis; Sparse principal component analysis; SVD
Prostate-specific antigen (PSA) is a biomarker routinely and repeatedly measured on prostate cancer patients treated by radiation therapy (RT). It was shown recently that its whole pattern over time rather than just its current level was strongly associated with prostate cancer recurrence. To more accurately guide clinical decision making, monitoring of PSA after RT would be aided by dynamic powerful prognostic tools that incorporate the complete posttreatment PSA evolution. In this work, we propose a dynamic prognostic tool derived from a joint latent class model and provide a measure of variability obtained from the parameters asymptotic distribution. To validate this prognostic tool, we consider predictive accuracy measures and provide an empirical estimate of their variability. We also show how to use them in the longitudinal context to compare the dynamic prognostic tool we developed with a proportional hazard model including either baseline covariates or baseline covariates and the expected level of PSA at the time of prediction in a landmark model. Using data from 3 large cohorts of patients treated after the diagnosis of prostate cancer, we show that the dynamic prognostic tool based on the joint model reduces the error of prediction and offers a powerful tool for individual prediction.
Error of prediction; Joint latent class model; Mixed model; Posterior probability; Predictive accuracy; Prostate cancer prognosis
Antiviral agents are an important component in mitigation/containment strategies for pandemic influenza. However, most research for mitigation/containment strategies relies on the antiviral efficacies evaluated from limited data of clinical trials. Which efficacy measures can be reliably estimated from these studies depends on the trial design, the size of the epidemics, and the statistical methods. We propose a Bayesian framework for modeling the influenza transmission dynamics within households. This Bayesian framework takes into account asymptomatic infections and is able to estimate efficacies with respect to protecting against viral infection, infection with clinical disease, and pathogenicity (the probability of disease given infection). We use the method to reanalyze 2 clinical studies of oseltamivir, an influenza antiviral agent, and compare the results with previous analyses. We found significant prophylactic efficacies in reducing the risk of viral infection and infection with disease but no prophylactic efficacy in reducing pathogenicity. We also found significant therapeutic efficacies in reducing pathogenicity and the risk of infection with disease but no therapeutic efficacy in reducing the risk of viral infection in the contacts.
Asymptomatic; Bayesian; Influenza; Markov chain Monte Carlo
This paper presents a robust method to conduct inference in finely stratified familial studies under proband-based sampling. We assume that the interest is in both the marginal effects of subject-specific covariates on a binary response and the familial aggregation of the response, as quantified by intrafamilial pairwise odds ratios. We adopt an estimating function for proband-based family studies originally developed by Zhao and others (1998) in the context of an unstratified design and treat the stratification effects as fixed nuisance parameters. Our method requires modeling only the first 2 joint moments of the observations and reduces by 2 orders of magnitude the bias induced by fitting the stratum-specific nuisance parameters. An analytical standard error estimator for the proposed estimator is also provided. The proposed approach is applied to a matched case–control familial study of sleep apnea. A simulation study confirms the usefulness of the approach.
2-Index asymptotics; Adjusted profile estimating function; Ascertainment bias; Bias reduction; Familial aggregation; Nuisance parameter; Proband; Sparse data; Stratified study
High-throughput oligonucleotide microarrays are commonly employed to investigate genetic disease, including cancer. The algorithms employed to extract genotypes and copy number variation function optimally for diploid genomes usually associated with inherited disease. However, cancer genomes are aneuploid in nature leading to systematic errors when using these techniques. We introduce a preprocessing transformation and hidden Markov model algorithm bespoke to cancer. This produces genotype classification, specification of regions of loss of heterozygosity, and absolute allelic copy number segmentation. Accurate prediction is demonstrated with a combination of independent experimental techniques. These methods are exemplified with affymetrix genome-wide SNP6.0 data from 755 cancer cell lines, enabling inference upon a number of features of biological interest. These data and the coded algorithm are freely available for download.
Allelic; Cancer; Copy; Number; Somatic; Variation
Dropout is a common occurrence in longitudinal studies. Building upon the pattern-mixture modeling approach within the Bayesian paradigm, we propose a general framework of varying-coefficient models for longitudinal data with informative dropout, where measurement times can be irregular and dropout can occur at any point in continuous time (not just at observation times) together with administrative censoring. Specifically, we assume that the longitudinal outcome process depends on the dropout process through its model parameters. The unconditional distribution of the repeated measures is a mixture over the dropout (administrative censoring) time distribution, and the continuous dropout time distribution with administrative censoring is left completely unspecified. We use Markov chain Monte Carlo to sample from the posterior distribution of the repeated measures given the dropout (administrative censoring) times; Bayesian bootstrapping on the observed dropout (administrative censoring) times is carried out to obtain marginal covariate effects. We illustrate the proposed framework using data from a longitudinal study of depression in HIV-infected women; the strategy for sensitivity analysis on unverifiable assumption is also demonstrated.
HIV/AIDS; Missing data; Nonparametric regression; Penalized splines
Interval-censored longitudinal data taken from a Norwegian study of individuals with Parkinson's disease are investigated with respect to the onset of dementia. Of interest are risk factors for dementia and the subdivision of total life expectancy (LE) into LE with and without dementia. To estimate LEs using extrapolation, a parametric continuous-time 3-state illness–death Markov model is presented in a Bayesian framework. The framework is well suited to allow for heterogeneity via random effects and to investigate additional computation using model parameters. In the estimation of LEs, microsimulation is used to take into account random effects. Intensities of moving between the states are allowed to change in a piecewise-constant fashion by linking them to age as a time-dependent covariate. Possible right censoring at the end of the follow-up can be incorporated. The model is applicable in many situations where individuals are followed over a long time period. In describing how a disease develops over time, the model can help to predict future need for health care.
Dementia; Life expectancy; Microsimulation; Multistate model; Random effects; Right censoring; Survival
Leptospirosis is the most widespread zoonosis throughout the world and human mortality from severe disease forms is high even when optimal treatment is provided. Leptospirosis is also one of the most common causes of reproductive losses in cattle worldwide and is associated with significant economic costs to the dairy farming industry. Herds are tested for exposure to the causal organism either through serum testing of individual animals or through testing bulk milk samples. Using serum results from a commonly used enzyme-linked immunosorbent assay (ELISA) test for Leptospira interrogans serovar Hardjo (L. hardjo) on samples from 979 animals across 12 Scottish dairy herds and the corresponding bulk milk results, we develop a model that predicts the mean proportion of exposed animals in a herd conditional on the bulk milk test result. The data are analyzed through use of a Bayesian latent variable generalized linear mixed model to provide estimates of the true (but unobserved) level of exposure to the causal organism in each herd in addition to estimates of the accuracy of the serum ELISA. We estimate 95% confidence intervals for the accuracy of the serum ELISA of (0.688, 0.987) and (0.975, 0.998) for test sensitivity and specificity, respectively. Using a percentage positivity cutoff in bulk milk of at most 41% ensures that there is at least a 97.5% probability of less than 5% of the herd being exposed to L. hardjo. Our analyses provide strong statistical evidence in support of the validity of interpreting bulk milk samples as a proxy for individual animal serum testing. The combination of validity and cost-effectiveness of bulk milk testing has the potential to reduce the risk of human exposure to leptospirosis in addition to offering significant economic benefits to the dairy industry.
Bayesian; Latent class analysis; Leptospirosis
Association studies have been widely used to identify genetic liability variants for complex diseases. While scanning the chromosomal region 1 single nucleotide polymorphism (SNP) at a time may not fully explore linkage disequilibrium, haplotype analyses tend to require a fairly large number of parameters, thus potentially losing power. Clustering algorithms, such as the cladistic approach, have been proposed to reduce the dimensionality, yet they have important limitations. We propose a SNP-Haplotype Adaptive REgression (SHARE) algorithm that seeks the most informative set of SNPs for genetic association in a targeted candidate region by growing and shrinking haplotypes with 1 more or less SNP in a stepwise fashion, and comparing prediction errors of different models via cross-validation. Depending on the evolutionary history of the disease mutations and the markers, this set may contain a single SNP or several SNPs that lay a foundation for haplotype analyses. Haplotype phase ambiguity is effectively accounted for by treating haplotype reconstruction as a part of the learning procedure. Simulations and a data application show that our method has improved power over existing methodologies and that the results are informative in the search for disease-causal loci.
Adaptive regression; Haplotype; Multilocus analysis; SNP
The incidence of nasopharyngeal carcinoma (NPC) varies widely according to age at diagnosis, geographic location, and ethnic background. On a global scale, NPC incidence is common among specific populations primarily living in southern and eastern Asia and northern Africa, but in most areas, including almost all western countries, it remains a relatively uncommon malignancy. Specific to these low-risk populations is a general observation of possible bimodality in the observed age-incidence curves. We have developed a multiplicative frailty model that allows for the demonstrated points of inflection at ages 15–24 and 65–74. The bimodal frailty model has 2 independent compound Poisson-distributed frailties and gives a significant improvement in fit over a unimodal frailty model. Applying the model to population-based cancer registry data worldwide, 2 biologically relevant estimates are derived, namely the proportion of susceptible individuals and the number of genetic and epigenetic events required for the tumor to develop. The results are critically compared and discussed in the context of existing knowledge of the epidemiology and pathogenesis of NPC.
Carcinogenesis; Compound Poisson; Frailty; Nasopharyngeal carcinoma; Survival analysis