PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (127)
 

Clipboard (0)
None

Select a Filter Below

Journals
Year of Publication
Document Types
1.  Significance analysis and statistical dissection of variably methylated regions 
Biostatistics (Oxford, England)  2011;13(1):166-178.
It has recently been proposed that variation in DNA methylation at specific genomic locations may play an important role in the development of complex diseases such as cancer. Here, we develop 1- and 2-group multiple testing procedures for identifying and quantifying regions of DNA methylation variability. Our method is the first genome-wide statistical significance calculation for increased or differential variability, as opposed to the traditional approach of testing for mean changes. We apply these procedures to genome-wide methylation data obtained from biological and technical replicates and provide the first statistical proof that variably methylated regions exist and are due to interindividual variation. We also show that differentially variable regions in colon tumor and normal tissue show enrichment of genes regulating gene expression, cell morphogenesis, and development, supporting a biological role for DNA methylation variability in cancer.
doi:10.1093/biostatistics/kxr013
PMCID: PMC3276267  PMID: 21685414
Bump finding; Functional data analysis; Multiple testing; Preprocessing; Variably methylation regions (VMRs)
2.  Mixed model analysis of censored longitudinal data with flexible random-effects density 
Mixed models are commonly used to represent longitudinal or repeated measures data. An additional complication arises when the response is censored, for example, due to limits of quantification of the assay used. While Gaussian random effects are routinely assumed, little work has characterized the consequences of misspecifying the random-effects distribution nor has a more flexible distribution been studied for censored longitudinal data. We show that, in general, maximum likelihood estimators will not be consistent when the random-effects density is misspecified, and the effect of misspecification is likely to be greatest when the true random-effects density deviates substantially from normality and the number of noncensored observations on each subject is small. We develop a mixed model framework for censored longitudinal data in which the random effects are represented by the flexible seminonparametric density and show how to obtain estimates in SAS procedure NLMIXED. Simulations show that this approach can lead to reduction in bias and increase in efficiency relative to assuming Gaussian random effects. The methods are demonstrated on data from a study of hepatitis C virus.
doi:10.1093/biostatistics/kxr026
PMCID: PMC3276268  PMID: 21914727
Censoring; HCV; HIV; Limit of quantification; Longitudinal data; Random effects
3.  Evaluating prognostic accuracy of biomarkers in nested case–control studies 
Biostatistics (Oxford, England)  2011;13(1):89-100.
Nested case–control (NCC) design is used frequently in epidemiological studies as a cost-effective subcohort sampling strategy to conduct biomarker research. Sampling strategy, on the other hoand, creates challenges for data analysis because of outcome-dependent missingness in biomarker measurements. In this paper, we propose inverse probability weighted (IPW) methods for making inference about the prognostic accuracy of a novel biomarker for predicting future events with data from NCC studies. The consistency and asymptotic normality of these estimators are derived using the empirical process theory and convergence theorems for sequences of weakly dependent random variables. Simulation and analysis using Framingham Offspring Study data suggest that the proposed methods perform well in finite samples.
doi:10.1093/biostatistics/kxr021
PMCID: PMC3276269  PMID: 21856652
Inverse probability weighting; Nested case–control study; Time-dependent accuracy
4.  A robust method using propensity score stratification for correcting verification bias for binary tests 
Sensitivity and specificity are common measures of the accuracy of a diagnostic test. The usual estimators of these quantities are unbiased if data on the diagnostic test result and the true disease status are obtained from all subjects in an appropriately selected sample. In some studies, verification of the true disease status is performed only for a subset of subjects, possibly depending on the result of the diagnostic test and other characteristics of the subjects. Estimators of sensitivity and specificity based on this subset of subjects are typically biased; this is known as verification bias. Methods have been proposed to correct verification bias under the assumption that the missing data on disease status are missing at random (MAR), that is, the probability of missingness depends on the true (missing) disease status only through the test result and observed covariate information. When some of the covariates are continuous, or the number of covariates is relatively large, the existing methods require parametric models for the probability of disease or the probability of verification (given the test result and covariates), and hence are subject to model misspecification. We propose a new method for correcting verification bias based on the propensity score, defined as the predicted probability of verification given the test result and observed covariates. This is estimated separately for those with positive and negative test results. The new method classifies the verified sample into several subsamples that have homogeneous propensity scores and allows correction for verification bias. Simulation studies demonstrate that the new estimators are more robust to model misspecification than existing methods, but still perform well when the models for the probability of disease and probability of verification are correctly specified.
doi:10.1093/biostatistics/kxr020
PMCID: PMC3276270  PMID: 21856650
Diagnostic test; Model misspecification; Propensity score; Sensitivity; Specificity
5.  A joint latent variable model approach to item reduction and validation 
Many applications of biomedical science involve unobservable constructs, from measurement of health states to severity of complex diseases. The primary aim of measurement is to identify relevant pieces of observable information that thoroughly describe the construct of interest. Validation of the construct is often performed separately. Noting the increasing popularity of latent variable methods in biomedical research, we propose a Multiple Indicator Multiple Cause (MIMIC) latent variable model that combines item reduction and validation. Our joint latent variable model accounts for the bias that occurs in the traditional 2-stage process. The methods are motivated by an example from the Physical Activity and Lymphedema clinical trial in which the objectives were to describe lymphedema severity through self-reported Likert scale symptoms and to determine the relationship between symptom severity and a “gold standard” diagnostic measure of lymphedema. The MIMIC model identified 1 symptom as a potential candidate for removal. We present this paper as an illustration of the advantages of joint latent variable models and as an example of the applicability of these models for biomedical research.
doi:10.1093/biostatistics/kxr018
PMCID: PMC3276271  PMID: 21775486
Factor analysis; Latent variable models; Lymphedema; Multiple Indicator Multiple Cause models
6.  Inference for discretely observed stochastic kinetic networks with applications to epidemic modeling 
Biostatistics (Oxford, England)  2011;13(1):153-165.
We present a new method for Bayesian Markov Chain Monte Carlo–based inference in certain types of stochastic models, suitable for modeling noisy epidemic data. We apply the so-called uniformization representation of a Markov process, in order to efficiently generate appropriate conditional distributions in the Gibbs sampler algorithm. The approach is shown to work well in various data-poor settings, that is, when only partial information about the epidemic process is available, as illustrated on the synthetic data from SIR-type epidemics and the Center for Disease Control and Prevention data from the onset of the H1N1 pandemic in the United States.
doi:10.1093/biostatistics/kxr019
PMCID: PMC3276272  PMID: 21835814
Gibbs sampler; Kinetic constants; Maximum likelihood; SIR model; Stochastic kinetics network
7.  A survival analysis approach to modeling human fecundity 
Understanding conception probabilities is important not only for helping couples to achieve pregnancy but also in identifying acute or chronic reproductive toxicants that affect the highly timed and interrelated processes underlying hormonal profiles, ovulation, libido, and conception during menstrual cycles. Currently, 2 statistical approaches are available for estimating conception probabilities depending upon the research question and extent of data collection during the menstrual cycle: a survival approach when interested in modeling time-to-pregnancy (TTP) in relation to women or couples' purported exposure(s), or a hierarchical Bayesian approach when one is interested in modeling day-specific conception probabilities during the estimated fertile window. We propose a biologically valid discrete survival model that unifies the above 2 approaches while relaxing some assumptions that may not be consistent with human reproduction or behavior. This approach combines both the survival and the hierarchical models allowing investigators to obtain the distribution of TTP and day-specific probabilities during the fertile window in a single model. Our model allows for the consideration of covariate effects at both the cycle and the daily level while accounting for daily variation in conception. We conduct extensive simulations and utilize the New York State Angler Prospective Pregnancy Cohort Study to illustrate our approach. We also provide the code to implement the model in R software in the supplemental section of the supplementary material available at Biostatistics online.
doi:10.1093/biostatistics/kxr015
PMCID: PMC3276273  PMID: 21697247
Censoring; Conception; Discrete survival; Fecundity; Random effects; Time-varying covariates
8.  Latent class models for joint analysis of disease prevalence and high-dimensional semicontinuous biomarker data 
High-dimensional biomarker data are often collected in epidemiological studies when assessing the association between biomarkers and human disease is of interest. We develop a latent class modeling approach for joint analysis of high-dimensional semicontinuous biomarker data and a binary disease outcome. To model the relationship between complex biomarker expression patterns and disease risk, we use latent risk classes to link the 2 modeling components. We characterize complex biomarker-specific differences through biomarker-specific random effects, so that different biomarkers can have different baseline (low-risk) values as well as different between-class differences. The proposed approach also accommodates data features that are common in environmental toxicology and other biomarker exposure data, including a large number of biomarkers, numerous zero values, and complex mean–variance relationship in the biomarkers levels. A Monte Carlo EM (MCEM) algorithm is proposed for parameter estimation. Both the MCEM algorithm and model selection procedures are shown to work well in simulations and applications. In applying the proposed approach to an epidemiological study that examined the relationship between environmental polychlorinated biphenyl (PCB) exposure and the risk of endometriosis, we identified a highly significant overall effect of PCB concentrations on the risk of endometriosis.
doi:10.1093/biostatistics/kxr024
PMCID: PMC3276274  PMID: 21908867
Categorical data; Chemical exposure biomarkers; Latent variables; Monte Carlo EM algorithm; Random effects
9.  Efficient design and inference for multistage randomized trials of individualized treatment policies 
Biostatistics (Oxford, England)  2011;13(1):142-152.
Clinical demand for individualized “adaptive” treatment policies in diverse fields has spawned development of clinical trial methodology for their experimental evaluation via multistage designs, building upon methods intended for the analysis of naturalistically observed strategies. Because often there is no need to parametrically smooth multistage trial data (in contrast to observational data for adaptive strategies), it is possible to establish direct connections among different methodological approaches. We show by algebraic proof that the maximum likelihood (ML) and optimal semiparametric (SP) estimators of the population mean of the outcome of a treatment policy and its standard error are equal under certain experimental conditions. This result is used to develop a unified and efficient approach to design and inference for multistage trials of policies that adapt treatment according to discrete responses. We derive a sample size formula expressed in terms of a parametric version of the optimal SP population variance. Nonparametric (sample-based) ML estimation performed well in simulation studies, in terms of achieved power, for scenarios most likely to occur in real studies, even though sample sizes were based on the parametric formula. ML outperformed the SP estimator; differences in achieved power predominately reflected differences in their estimates of the population mean (rather than estimated standard errors). Neither methodology could mitigate the potential for overestimated sample sizes when strong nonlinearity was purposely simulated for certain discrete outcomes; however, such departures from linearity may not be an issue for many clinical contexts that make evaluation of competitive treatment policies meaningful.
doi:10.1093/biostatistics/kxr016
PMCID: PMC3276275  PMID: 21765180
Adaptive treatment strategy; Efficient SP estimation; Maximum likelihood; Multi-stage design; Sample size formula
10.  Checking semiparametric transformation models with censored data 
Semiparametric transformation models provide a very general framework for studying the effects of (possibly time-dependent) covariates on survival time and recurrent event times. Assessing the adequacy of these models is an important task because model misspecification affects the validity of inference and the accuracy of prediction. In this paper, we introduce appropriate time-dependent residuals for these models and consider the cumulative sums of the residuals. Under the assumed model, the cumulative sum processes converge weakly to zero-mean Gaussian processes whose distributions can be approximated through Monte Carlo simulation. These results enable one to assess, both graphically and numerically, how unusual the observed residual patterns are in reference to their null distributions. The residual patterns can also be used to determine the nature of model misspecification. Extensive simulation studies demonstrate that the proposed methods perform well in practical situations. Three medical studies are provided for illustrations.
doi:10.1093/biostatistics/kxr017
PMCID: PMC3276276  PMID: 21785165
Goodness of fit; Martingale residuals; Model checking; Model misspecification; Model selection; Recurrent events; Survival data; Time-dependent covariate
11.  Dirichlet negative multinomial regression for overdispersed correlated count data 
Biostatistics (Oxford, England)  2012;14(2):395-404.
A generic random effects formulation for the Dirichlet negative multinomial distribution is developed together with a convenient regression parameterization. A simulation study indicates that, even when somewhat misspecified, regression models based on the Dirichlet negative multinomial distribution have smaller median absolute error than generalized estimating equations, with a particularly pronounced improvement when correlation between observations in a cluster is high. Estimation of explanatory variable effects and sources of variation is illustrated for a study of clinical trial recruitment.
doi:10.1093/biostatistics/kxs050
PMCID: PMC3590929  PMID: 23221819
Dirichlet negative multinomial; Longitudinal count data; Regression; Sources of variation
12.  Efficient measurement error correction with spatially misaligned data 
Biostatistics (Oxford, England)  2011;12(4):610-623.
Association studies in environmental statistics often involve exposure and outcome data that are misaligned in space. A common strategy is to employ a spatial model such as universal kriging to predict exposures at locations with outcome data and then estimate a regression parameter of interest using the predicted exposures. This results in measurement error because the predicted exposures do not correspond exactly to the true values. We characterize the measurement error by decomposing it into Berkson-like and classical-like components. One correction approach is the parametric bootstrap, which is effective but computationally intensive since it requires solving a nonlinear optimization problem for the exposure model parameters in each bootstrap sample. We propose a less computationally intensive alternative termed the “parameter bootstrap” that only requires solving one nonlinear optimization problem, and we also compare bootstrap methods to other recently proposed methods. We illustrate our methodology in simulations and with publicly available data from the Environmental Protection Agency.
doi:10.1093/biostatistics/kxq083
PMCID: PMC3169665  PMID: 21252080
Environmental epidemiology; Environmental statistics; Exposure modeling; Kriging; Measurement error
13.  Comparing costs associated with risk stratification rules for t-year survival 
Biostatistics (Oxford, England)  2011;12(4):597-609.
Accurate risk prediction is an important step in developing optimal strategies for disease prevention and treatment. Based on the predicted risks, patients can be stratified to different risk categories where each category corresponds to a particular clinical intervention. Incorrect or suboptimal interventions are likely to result in unnecessary financial and medical consequences. It is thus essential to account for the costs associated with the clinical interventions when developing and evaluating risk stratification (RS) rules for clinical use. In this article, we propose to quantify the value of an RS rule based on the total expected cost attributed to incorrect assignment of risk groups due to the rule. We have established the relationship between cost parameters and optimal threshold values used in the stratification rule that minimizes the total expected cost over the entire population of interest. Statistical inference procedures are developed for evaluating and comparing given RS rules and examined through simulation studies. The proposed procedures are illustrated with an example from the Cardiovascular Health Study.
doi:10.1093/biostatistics/kxr001
PMCID: PMC3169667  PMID: 21415016
Disease prognosis; Optimal risk stratification; Risk prediction
14.  Integrative analysis and variable selection with multiple high-dimensional data sets 
Biostatistics (Oxford, England)  2011;12(4):763-775.
In high-throughput -omics studies, markers identified from analysis of single data sets often suffer from a lack of reproducibility because of sample limitation. A cost-effective remedy is to pool data from multiple comparable studies and conduct integrative analysis. Integrative analysis of multiple -omics data sets is challenging because of the high dimensionality of data and heterogeneity among studies. In this article, for marker selection in integrative analysis of data from multiple heterogeneous studies, we propose a 2-norm group bridge penalization approach. This approach can effectively identify markers with consistent effects across multiple studies and accommodate the heterogeneity among studies. We propose an efficient computational algorithm and establish the asymptotic consistency property. Simulations and applications in cancer profiling studies show satisfactory performance of the proposed approach.
doi:10.1093/biostatistics/kxr004
PMCID: PMC3169668  PMID: 21415015
High-dimensional data; Integrative analysis; 2-norm group bridge
15.  Classifying tissue samples from measurements on cells with within-class tissue sample heterogeneity 
Biostatistics (Oxford, England)  2011;12(4):695-709.
We consider here the problem of classifying a macro-level object based on measurements of embedded (micro-level) observations within each object, for example, classifying a patient based on measurements on a collection of a random number of their cells. Classification problems with this hierarchical, nested structure have not received the same statistical understanding as the general classification problem. Some heuristic approaches have been developed and a few authors have proposed formal statistical models. We focus on the problem where heterogeneity exists between the macro-level objects within a class. We propose a model-based statistical methodology that models the log-odds of the macro-level object belonging to a class using a latent-class variable model to account for this heterogeneity. The latent classes are estimated by clustering the macro-level object density estimates. We apply this method to the detection of patients with cervical neoplasia based on quantitative cytology measurements on cells in a Papanicolaou smear. Quantitative cytology is much cheaper and potentially can take less time than the current standard of care. The results show that the automated quantitative cytology using the proposed method is roughly equivalent to clinical cytopathology and shows significant improvement over a statistical model that does not account for the heterogeneity of the data.
doi:10.1093/biostatistics/kxr010
PMCID: PMC3169670  PMID: 21642388
Automating cervical neoplasia screening; Clustering densities; Cumulative log-odds; Functional data clustering; Macro-level classification; Quantitative cytology
16.  Recursive partitioning of resistant mutations for longitudinal markers based on a U-type score 
Biostatistics (Oxford, England)  2011;12(4):750-762.
Development of human immunodeficiency virus resistance mutations is a major cause of failure of antiretroviral treatment. We develop a recursive partitioning method to correlate high-dimensional viral sequences with repeatedly measured outcomes. The splitting criterion of this procedure is based on a class of U-type score statistics. The proposed method is flexible enough to apply to a broad range of problems involving longitudinal outcomes. Simulation studies are performed to explore the finite-sample properties of the proposed method, which is also illustrated through analysis of data collected in 3 phase II clinical trials testing the antiretroviral drug efavirenz.
doi:10.1093/biostatistics/kxr011
PMCID: PMC3169671  PMID: 21596729
Antiretroviral drugs; Longitudinal data; Recursive partitioning; Repeated measurements; Resistance mutations; Tree method
17.  A fused lasso latent feature model for analyzing multi-sample aCGH data 
Biostatistics (Oxford, England)  2011;12(4):776-791.
Array-based comparative genomic hybridization (aCGH) enables the measurement of DNA copy number across thousands of locations in a genome. The main goals of analyzing aCGH data are to identify the regions of copy number variation (CNV) and to quantify the amount of CNV. Although there are many methods for analyzing single-sample aCGH data, the analysis of multi-sample aCGH data is a relatively new area of research. Further, many of the current approaches for analyzing multi-sample aCGH data do not appropriately utilize the additional information present in the multiple samples. We propose a procedure called the Fused Lasso Latent Feature Model (FLLat) that provides a statistical framework for modeling multi-sample aCGH data and identifying regions of CNV. The procedure involves modeling each sample of aCGH data as a weighted sum of a fixed number of features. Regions of CNV are then identified through an application of the fused lasso penalty to each feature. Some simulation analyses show that FLLat outperforms single-sample methods when the simulated samples share common information. We also propose a method for estimating the false discovery rate. An analysis of an aCGH data set obtained from human breast tumors, focusing on chromosomes 8 and 17, shows that FLLat and Significance Testing of Aberrant Copy number (an alternative, existing approach) identify similar regions of CNV that are consistent with previous findings. However, through the estimated features and their corresponding weights, FLLat is further able to discern specific relationships between the samples, for example, identifying 3 distinct groups of samples based on their patterns of CNV for chromosome 17.
doi:10.1093/biostatistics/kxr012
PMCID: PMC3169672  PMID: 21642389
Cancer; DNA copy number; False discovery rate; Mutation
18.  A shared parameter model for the estimation of longitudinal concomitant intervention effects 
Biostatistics (Oxford, England)  2011;12(4):737-749.
We investigate a change-point approach for modeling and estimating the regression effects caused by a concomitant intervention in a longitudinal study. Since a concomitant intervention is often introduced when a patient's health status exhibits undesirable trends, statistical models without properly incorporating the intervention and its starting time may lead to biased estimates of the intervention effects. We propose a shared parameter change-point model to evaluate the pre- and postintervention time trends of the response and develop a likelihood-based method for estimating the intervention effects and other parameters. Application and statistical properties of our method are demonstrated through a longitudinal clinical trial in depression and heart disease and a simulation study.
doi:10.1093/biostatistics/kxq084
PMCID: PMC3202304  PMID: 21262930
Change-point model; Concomitant intervention; Likelihood; Longitudinal study; Shared parameter model
19.  Estimating the acute health effects of coarse particulate matter accounting for exposure measurement error 
Biostatistics (Oxford, England)  2011;12(4):637-652.
In air pollution epidemiology, there is a growing interest in estimating the health effects of coarse particulate matter (PM) with aerodynamic diameter between 2.5 and 10 μm. Coarse PM concentrations can exhibit considerable spatial heterogeneity because the particles travel shorter distances and do not remain suspended in the atmosphere for an extended period of time. In this paper, we develop a modeling approach for estimating the short-term effects of air pollution in time series analysis when the ambient concentrations vary spatially within the study region. Specifically, our approach quantifies the error in the exposure variable by characterizing, on any given day, the disagreement in ambient concentrations measured across monitoring stations. This is accomplished by viewing monitor-level measurements as error-prone repeated measurements of the unobserved population average exposure. Inference is carried out in a Bayesian framework to fully account for uncertainty in the estimation of model parameters. Finally, by using different exposure indicators, we investigate the sensitivity of the association between coarse PM and daily hospital admissions based on a recent national multisite time series analysis. Among Medicare enrollees from 59 US counties between the period 1999 and 2005, we find a consistent positive association between coarse PM and same-day admission for cardiovascular diseases.
doi:10.1093/biostatistics/kxr002
PMCID: PMC3202305  PMID: 21297159
Air pollution; Coarse particulate matter; Exposure measurement error; Multisite time series analysis
20.  A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data 
Biostatistics (Oxford, England)  2012;14(2):232-243.
Recent developments in RNA-sequencing (RNA-seq) technology have led to a rapid increase in gene expression data in the form of counts. RNA-seq can be used for a variety of applications, however, identifying differential expression (DE) remains a key task in functional genomics. There have been a number of statistical methods for DE detection for RNA-seq data. One common feature of several leading methods is the use of the negative binomial (Gamma–Poisson mixture) model. That is, the unobserved gene expression is modeled by a gamma random variable and, given the expression, the sequencing read counts are modeled as Poisson. The distinct feature in various methods is how the variance, or dispersion, in the Gamma distribution is modeled and estimated. We evaluate several large public RNA-seq datasets and find that the estimated dispersion in existing methods does not adequately capture the heterogeneity of biological variance among samples. We present a new empirical Bayes shrinkage estimate of the dispersion parameters and demonstrate improved DE detection.
doi:10.1093/biostatistics/kxs033
PMCID: PMC3590927  PMID: 23001152
Differential expression; Empirical Bayes; RNA sequencing; Shrinkage estimator
21.  Contact intervals, survival analysis of epidemic data, and estimation of R0 
Biostatistics (Oxford, England)  2010;12(3):548-566.
We argue that the time from the onset of infectiousness to infectious contact, which we call the “contact interval,” is a better basis for inference in epidemic data than the generation or serial interval. Since contact intervals can be right censored, survival analysis is the natural approach to estimation. Estimates of the contact interval distribution can be used to estimate R0 in both mass-action and network-based models. We apply these methods to 2 data sets from the 2009 influenza A(H1N1) pandemic.
doi:10.1093/biostatistics/kxq068
PMCID: PMC3114649  PMID: 21071607
Basic reproductive number (R0); Epidemic data; Generation intervals; Survival analysis
22.  Partial linear inference for a 2-stage outcome-dependent sampling design with a continuous outcome 
Biostatistics (Oxford, England)  2010;12(3):506-520.
The outcome-dependent sampling (ODS) design, which allows observation of exposure variable to depend on the outcome, has been shown to be cost efficient. In this article, we propose a new statistical inference method, an estimated penalized likelihood method, for a partial linear model in the setting of a 2-stage ODS with a continuous outcome. We develop the asymptotic properties and conduct simulation studies to demonstrate the performance of the proposed estimator. A real environmental study data set is used to illustrate the proposed method.
doi:10.1093/biostatistics/kxq070
PMCID: PMC3114650  PMID: 21156990
Biased sampling; Partial linear model; P-spline; Validation sample; 2-stage
23.  Evaluation of diagnostic accuracy in detecting ordered symptom statuses without a gold standard 
Biostatistics (Oxford, England)  2011;12(3):567-581.
Our research is motivated by 2 methodological problems in assessing diagnostic accuracy of traditional Chinese medicine (TCM) doctors in detecting a particular symptom whose true status has an ordinal scale and is unknown—imperfect gold standard bias and ordinal scale symptom status. In this paper, we proposed a nonparametric maximum likelihood method for estimating and comparing the accuracy of different doctors in detecting a particular symptom without a gold standard when the true symptom status had an ordered multiple class. In addition, we extended the concept of the area under the receiver operating characteristic curve to a hyper-dimensional overall accuracy for diagnostic accuracy and alternative graphs for displaying a visual result. The simulation studies showed that the proposed method had good performance in terms of bias and mean squared error. Finally, we applied our method to our motivating example on assessing the diagnostic abilities of 5 TCM doctors in detecting symptoms related to Chills disease.
doi:10.1093/biostatistics/kxq075
PMCID: PMC3114651  PMID: 21209155
Bootstrap; Diagnostic accuracy; EM algorithm; MSE; Ordinal tests; Traditional Chinese medicine (TCM); Volume under the ROC surface (VUS)
24.  A continuous-index Bayesian hidden Markov model for prediction of nucleosome positioning in genomic DNA 
Biostatistics (Oxford, England)  2010;12(3):462-477.
Nucleosomes are units of chromatin structure, consisting of DNA sequence wrapped around proteins called “histones.” Nucleosomes occur at variable intervals throughout genomic DNA and prevent transcription factor (TF) binding by blocking TF access to the DNA. A map of nucleosomal locations would enable researchers to detect TF binding sites with greater efficiency. Our objective is to construct an accurate genomic map of nucleosome-free regions (NFRs) based on data from high-throughput genomic tiling arrays in yeast. These high-volume data typically have a complex structure in the form of dependence on neighboring probes as well as underlying DNA sequence, variable-sized gaps, and missing data. We propose a novel continuous-index model appropriate for non-equispaced tiling array data that simultaneously incorporates DNA sequence features relevant to nucleosome formation. Simulation studies and an application to a yeast nucleosomal assay demonstrate the advantages of using the new modeling framework, as well as its robustness to distributional misspecifications. Our results reinforce the previous biological hypothesis that higher-order nucleotide combinations are important in distinguishing nucleosomal regions from NFRs.
doi:10.1093/biostatistics/kxq077
PMCID: PMC3114652  PMID: 21193724
Chromatin structure; Data augmentation; FAIRE; Tiling arrays
25.  Efficient p-value evaluation for resampling-based tests 
Biostatistics (Oxford, England)  2011;12(3):582-593.
The resampling-based test, which often relies on permutation or bootstrap procedures, has been widely used for statistical hypothesis testing when the asymptotic distribution of the test statistic is unavailable or unreliable. It requires repeated calculations of the test statistic on a large number of simulated data sets for its significance level assessment, and thus it could become very computationally intensive. Here, we propose an efficient p-value evaluation procedure by adapting the stochastic approximation Markov chain Monte Carlo algorithm. The new procedure can be used easily for estimating the p-value for any resampling-based test. We show through numeric simulations that the proposed procedure can be 100–500 000 times as efficient (in term of computing time) as the standard resampling-based procedure when evaluating a test statistic with a small p-value (e.g. less than 10 − 6). With its computational burden reduced by this proposed procedure, the versatile resampling-based test would become computationally feasible for a much wider range of applications. We demonstrate the application of the new method by applying it to a large-scale genetic association study of prostate cancer.
doi:10.1093/biostatistics/kxq078
PMCID: PMC3114653  PMID: 21209154
Bootstrap procedures; Genetic association studies; p-value; Resampling-based tests; Stochastic approximation Markov chain Monte Carlo

Results 1-25 (127)