PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (140)
 

Clipboard (0)
None

Select a Filter Below

Journals
Year of Publication
Document Types
1.  NONPARAMETRIC INFERENCE PROCEDURE FOR PERCENTILES OF THE RANDOM EFFECTS DISTRIBUTION IN META-ANALYSIS 
The annals of applied statistics  2010;4(1):520-532.
To investigate whether treating cancer patients with erythropoiesis-stimulating agents (ESAs) would increase the mortality risk, Bennett et al. [Journal of the American Medical Association 299 (2008) 914–924] conducted a meta-analysis with the data from 52 phase III trials comparing ESAs with placebo or standard of care. With a standard parametric random effects modeling approach, the study concluded that ESA administration was significantly associated with increased average mortality risk. In this article we present a simple nonparametric inference procedure for the distribution of the random effects. We re-analyzed the ESA mortality data with the new method. Our results about the center of the random effects distribution were markedly different from those reported by Bennett et al. Moreover, our procedure, which estimates the distribution of the random effects, as opposed to just a simple population average, suggests that the ESA may be beneficial to mortality for approximately a quarter of the study populations. This new meta-analysis technique can be implemented with study-level summary statistics. In contrast to existing methods for parametric random effects models, the validity of our proposal does not require the number of studies involved to be large. From the results of an extensive numerical study, we find that the new procedure performs well even with moderate individual study sample sizes.
doi:10.1214/09-AOAS280SUPP
PMCID: PMC4321956
Bivariate beta; conditional permutation test; erythropoiesis-stimulating agents; logit-normal; two-level hierachical model
2.  Longitudinal High-Dimensional Principal Components Analysis with Application to Diffusion Tensor Imaging of Multiple Sclerosis 
The annals of applied statistics  2014;8(4):2175-2202.
We develop a flexible framework for modeling high-dimensional imaging data observed longitudinally. The approach decomposes the observed variability of repeatedly measured high-dimensional observations into three additive components: a subject-specific imaging random intercept that quantifies the cross-sectional variability, a subject-specific imaging slope that quantifies the dynamic irreversible deformation over multiple realizations, and a subject-visit specific imaging deviation that quantifies exchangeable effects between visits. The proposed method is very fast, scalable to studies including ultra-high dimensional data, and can easily be adapted to and executed on modest computing infrastructures. The method is applied to the longitudinal analysis of diffusion tensor imaging (DTI) data of the corpus callosum of multiple sclerosis (MS) subjects. The study includes 176 subjects observed at 466 visits. For each subject and visit the study contains a registered DTI scan of the corpus callosum at roughly 30,000 voxels.
PMCID: PMC4316386  PMID: 25663955
principal components; linear mixed model; diffusion tensor imaging; brain imaging data; multiple sclerosis
3.  [No title available] 
PMCID: PMC4295721  PMID: 25598858
4.  SEPARABLE FACTOR ANALYSIS WITH APPLICATIONS TO MORTALITY DATA 
The annals of applied statistics  2014;8(1):120-147.
Human mortality data sets can be expressed as multiway data arrays, the dimensions of which correspond to categories by which mortality rates are reported, such as age, sex, country and year. Regression models for such data typically assume an independent error distribution or an error model that allows for dependence along at most one or two dimensions of the data array. However, failing to account for other dependencies can lead to inefficient estimates of regression parameters, inaccurate standard errors and poor predictions. An alternative to assuming independent errors is to allow for dependence along each dimension of the array using a separable covariance model. However, the number of parameters in this model increases rapidly with the dimensions of the array and, for many arrays, maximum likelihood estimates of the covariance parameters do not exist. In this paper, we propose a submodel of the separable covariance model that estimates the covariance matrix for each dimension as having factor analytic structure. This model can be viewed as an extension of factor analysis to array-valued data, as it uses a factor model to estimate the covariance along each dimension of the array. We discuss properties of this model as they relate to ordinary factor analysis, describe maximum likelihood and Bayesian estimation methods, and provide a likelihood ratio testing procedure for selecting the factor model ranks. We apply this methodology to the analysis of data from the Human Mortality Database, and show in a cross-validation experiment how it outperforms simpler methods. Additionally, we use this model to impute mortality rates for countries that have no mortality data for several years. Unlike other approaches, our methodology is able to estimate similarities between the mortality rates of countries, time periods and sexes, and use this information to assist with the imputations.
PMCID: PMC4256680  PMID: 25489353
Array normal; Kronecker product; multiway data; Bayesian estimation; imputation
5.  GENE-LEVEL PHARMACOGENETIC ANALYSIS ON SURVIVAL OUTCOMES USING GENE-TRAIT SIMILARITY REGRESSION 
The annals of applied statistics  2014;8(2):1232-1255.
Gene/pathway-based methods are drawing significant attention due to their usefulness in detecting rare and common variants that affect disease susceptibility. The biological mechanism of drug responses indicates that a gene-based analysis has even greater potential in pharmacogenetics. Motivated by a study from the Vitamin Intervention for Stroke Prevention (VISP) trial, we develop a gene-trait similarity regression for survival analysis to assess the effect of a gene or pathway on time-to-event outcomes. The similarity regression has a general framework that covers a range of survival models, such as the proportional hazards model and the proportional odds model. The inference procedure developed under the proportional hazards model is robust against model misspecification. We derive the equivalence between the similarity survival regression and a random effects model, which further unifies the current variance-component based methods. We demonstrate the effectiveness of the proposed method through simulation studies. In addition, we apply the method to the VISP trial data to identify the genes that exhibit an association with the risk of a recurrent stroke. TCN2 gene was found to be associated with the recurrent stroke risk in the low-dose arm. This gene may impact recurrent stroke risk in response to cofactor therapy.
PMCID: PMC4091797  PMID: 25018788
association study; gene/pathway; pharmacogenetics; similarity regression; survival data; proportional odds model; proportional hazard model
6.  COMBINING ISOTONIC REGRESSION AND EM ALGORITHM TO PREDICT GENETIC RISK UNDER MONOTONICITY CONSTRAINT 
The annals of applied statistics  2014;8(2):1182-1208.
In certain genetic studies, clinicians and genetic counselors are interested in estimating the cumulative risk of a disease for individuals with and without a rare deleterious mutation. Estimating the cumulative risk is difficult, however, when the estimates are based on family history data. Often, the genetic mutation status in many family members is unknown; instead, only estimated probabilities of a patient having a certain mutation status are available. Also, ages of disease-onset are subject to right censoring. Existing methods to estimate the cumulative risk using such family-based data only provide estimation at individual time points, and are not guaranteed to be monotonic, nor non-negative. In this paper, we develop a novel method that combines Expectation-Maximization and isotonic regression to estimate the cumulative risk across the entire support. Our estimator is monotonic, satisfies self-consistent estimating equations, and has high power in detecting differences between the cumulative risks of different populations. Application of our estimator to a Parkinson’s disease (PD) study provides the age-at-onset distribution of PD in PARK2 mutation carriers and non-carriers, and reveals a significant difference between the distribution in compound heterozygous carriers compared to non-carriers, but not between heterozygous carriers and non-carriers.
doi:10.1214/14-AOAS730
PMCID: PMC4231830  PMID: 25404955
Binomial likelihood; Parkinson’s disease; Pool adjacent violation algorithm; Self-consistency estimating equations
7.  Local Tests for Identifying Anisotropic Diffusion Areas in Human Brain with DTI 
The annals of applied statistics  2013;7(1):201-225.
Diffusion tensor imaging (DTI) plays a key role in analyzing the physical structures of biological tissues, particularly in reconstructing fiber tracts of the human brain in vivo. On the one hand, eigenvalues of diffusion tensors (DTs) estimated from diffusion weighted imaging (DWI) data usually contain systematic bias, which subsequently biases the diffusivity measurements popularly adopted in fiber tracking algorithms. On the other hand, correctly accounting for the spatial information is important in the construction of these diffusivity measurements since the fiber tracts are typically spatially structured. This paper aims to establish test-based approaches to identify anisotropic water diffusion areas in the human brain. These areas in turn indicate the areas passed by fiber tracts. Our proposed test statistic not only takes into account the bias components in eigenvalue estimates, but also incorporates the spatial information of neighboring voxels. Under mild regularity conditions, we demonstrate that the proposed test statistic asymptotically follows a χ2 distribution under the null hypothesis. Simulation and real DTI data examples are provided to illustrate the efficacy of our proposed methods.
PMCID: PMC4280843  PMID: 25558295
Brain tissue; diffusion tensor; eigenvalue; fiber tracts; local test; quantitative scalar
8.  LEVERAGING LOCAL IDENTITY-BY-DESCENT INCREASES THE POWER OF CASE/CONTROL GWAS WITH RELATED INDIVIDUALS 
The annals of applied statistics  2014;8(2):974-998.
Large case/control genome-wide association studies (GWAS) often include groups of related individuals with known relationships. When testing for associations at a given locus, current methods incorporate only the familial relationships between individuals. Here, we introduce the chromosome-based Quasi Likelihood Score (cQLS) statistic that incorporates local Identity-By-Descent (IBD) to increase the power to detect associations. In studies robust to population stratification, such as those with case/control sibling pairs, simulations show that the study power can be increased by over 50%. In our example, a GWAS examining late-onset Alzheimers disease, the p-values among the most strongly associated SNPs in the APOE gene tend to decrease, with the smallest p-value decreasing from 1.23 × 10−8 to 7.70 × 10−9. Furthermore, as a part of our simulations, we reevaluate our expectations about the use of families in GWAS. We show that, although adding only half as many unique chromosomes, genotyping affected siblings is more efficient than genotyping randomly ascertained cases. We also show that genotyping cases with a family history of disease will be less beneficial when searching for SNPs with smaller effect sizes.
PMCID: PMC4275846  PMID: 25544865
cQLS; GWAS; related individuals; case-control
9.  Imputation of Truncated p-Values For Meta-Analysis Methods and Its Genomic Application1 
The annals of applied statistics  2014;8(4):2150-2174.
Microarray analysis to monitor expression activities in thousands of genes simultaneously has become routine in biomedical research during the past decade. a tremendous amount of expression profiles are generated and stored in the public domain and information integration by meta-analysis to detect differentially expressed (DE) genes has become popular to obtain increased statistical power and validated findings. Methods that aggregate transformed p-value evidence have been widely used in genomic settings, among which Fisher's and Stouffer's methods are the most popular ones. In practice, raw data and p-values of DE evidence are often not available in genomic studies that are to be combined. Instead, only the detected DE gene lists under a certain p-value threshold (e.g., DE genes with p-value < 0.001) are reported in journal publications. The truncated p-value information makes the aforementioned meta-analysis methods inapplicable and researchers are forced to apply a less efficient vote counting method or naïvely drop the studies with incomplete information. The purpose of this paper is to develop effective meta-analysis methods for such situations with partially censored p-values. We developed and compared three imputation methods—mean imputation, single random imputation and multiple imputation—for a general class of evidence aggregation methods of which Fisher's and Stouffer's methods are special examples. The null distribution of each method was analytically derived and subsequent inference and genomic analysis frameworks were established. Simulations were performed to investigate the type Ierror, power and the control of false discovery rate (FDR) for (correlated) gene expression data. The proposed methods were applied to several genomic applications in colorectal cancer, pain and liquid association analysis of major depressive disorder (MDD). The results showed that imputation methods outperformed existing naïve approaches. Mean imputation and multiple imputation methods performed the best and are recommended for future applications.
doi:10.1214/14-AOAS747
PMCID: PMC4274812  PMID: 25541588
Microarray analysis; meta-analysis; Fisher's method; Stouffer's method; missing value imputation
10.  EFFECT OF BREASTFEEDING ON GASTROINTESTINAL INFECTION IN INFANTS: A TARGETED MAXIMUM LIKELIHOOD APPROACH FOR CLUSTERED LONGITUDINAL DATA 
The annals of applied statistics  2014;8(2):703-725.
The PROmotion of Breastfeeding Intervention Trial (PROBIT) cluster-randomized a program encouraging breastfeeding to new mothers in hospital centers. The original studies indicated that this intervention successfully increased duration of breastfeeding and lowered rates of gastrointestinal tract infections in newborns. Additional scientific and popular interest lies in determining the causal effect of longer breastfeeding on gastrointestinal infection. In this study, we estimate the expected infection count under various lengths of breastfeeding in order to estimate the effect of breastfeeding duration on infection. Due to the presence of baseline and time-dependent confounding, specialized “causal” estimation methods are required. We demonstrate the double-robust method of Targeted Maximum Likelihood Estimation (TMLE) in the context of this application and review some related methods and the adjustments required to account for clustering. We compare TMLE (implemented both parametrically and using a data-adaptive algorithm) to other causal methods for this example. In addition, we conduct a simulation study to determine (1) the effectiveness of controlling for clustering indicators when cluster-specific confounders are unmeasured and (2) the importance of using data-adaptive TMLE.
PMCID: PMC4259272  PMID: 25505499
Causal inference; G-computation; inverse probability weighting; marginal effects; missing data; pediatrics
11.  Finite-Sample Equivalence in Statistical Models for Presence-Only Data 
The annals of applied statistics  2012;7(4):1917-1939.
Statistical modeling of presence-only data has attracted much recent attention in the ecological literature, leading to a proliferation of methods, including the inhomogeneous Poisson process (IPP) model, maximum entropy (Maxent) modeling of species distributions and logistic regression models. Several recent articles have shown the close relationships between these methods. We explain why the IPP intensity function is a more natural object of inference in presence-only studies than occurrence probability (which is only defined with reference to quadrat size), and why presence-only data only allows estimation of relative, and not absolute intensity of species occurrence.
All three of the above techniques amount to parametric density estimation under the same exponential family model (in the case of the IPP, the fitted density is multiplied by the number of presence records to obtain a fitted intensity). We show that IPP and Maxent give the exact same estimate for this density, but logistic regression in general yields a different estimate in finite samples. When the model is misspecified—as it practically always is—logistic regression and the IPP may have substantially different asymptotic limits with large data sets. We propose “infinitely weighted logistic regression,” which is exactly equivalent to the IPP in finite samples. Consequently, many already-implemented methods extending logistic regression can also extend the Maxent and IPP models in directly analogous ways using this technique.
doi:10.1214/13-AOAS667
PMCID: PMC4258396  PMID: 25493106
Presence-only data; logistic regression; maximum entropy; Poisson process models; species modeling; case-control sampling
12.  CLUSTERING SOUTH AFRICAN HOUSEHOLDS BASED ON THEIR ASSET STATUS USING LATENT VARIABLE MODELS 
The annals of applied statistics  2014;8(2):747-776.
The Agincourt Health and Demographic Surveillance System has since 2001 conducted a biannual household asset survey in order to quantify household socio-economic status (SES) in a rural population living in northeast South Africa. The survey contains binary, ordinal and nominal items. In the absence of income or expenditure data, the SES landscape in the study population is explored and described by clustering the households into homogeneous groups based on their asset status.
A model-based approach to clustering the Agincourt households, based on latent variable models, is proposed. In the case of modeling binary or ordinal items, item response theory models are employed. For nominal survey items, a factor analysis model, similar in nature to a multinomial probit model, is used. Both model types have an underlying latent variable structure—this similarity is exploited and the models are combined to produce a hybrid model capable of handling mixed data types. Further, a mixture of the hybrid models is considered to provide clustering capabilities within the context of mixed binary, ordinal and nominal response data. The proposed model is termed a mixture of factor analyzers for mixed data (MFA-MD).
The MFA-MD model is applied to the survey data to cluster the Agincourt households into homogeneous groups. The model is estimated within the Bayesian paradigm, using a Markov chain Monte Carlo algorithm. Intuitive groupings result, providing insight to the different socio-economic strata within the Agincourt region.
doi:10.1214/14-AOAS726
PMCID: PMC4256055  PMID: 25485026
Clustering; mixed data; item response theory; Metropolis-within-Gibbs
13.  DISCUSSION OF: TREELETS—AN ADAPTIVE MULTI-SCALE BASIS FOR SPARSE UNORDERED DATA 
The annals of applied statistics  2008;2(2):489-493.
We would like to congratulate Lee, Nadler and Wasserman on their contribution to clustering and data reduction methods for high p and low n situations. A composite of clustering and traditional principal components analysis, treelets is an innovative method for multi-resolution analysis of unordered data. It is an improvement over traditional PCA and an important contribution to clustering methodology. Their paper presents theory and supporting applications addressing the two main goals of the treelet method: (1) Uncover the underlying structure of the data and (2) Data reduction prior to statistical learning methods. We will organize our discussion into two main parts to address their methodology in terms of each of these two goals. We will present and discuss treelets in terms of a clustering algorithm and an improvement over traditional PCA. We will also discuss the applicability of treelets to more general data, in particular, the application of treelets to microarray data.
doi:10.1214/07-AOAS137
PMCID: PMC4251495  PMID: 25478036
14.  BAYESIAN DATA AUGMENTATION DOSE FINDING WITH CONTINUAL REASSESSMENT METHOD AND DELAYED TOXICITY 
The annals of applied statistics  2013;7(4):1837-2457.
A major practical impediment when implementing adaptive dose-finding designs is that the toxicity outcome used by the decision rules may not be observed shortly after the initiation of the treatment. To address this issue, we propose the data augmentation continual re-assessment method (DA-CRM) for dose finding. By naturally treating the unobserved toxicities as missing data, we show that such missing data are nonignorable in the sense that the missingness depends on the unobserved outcomes. The Bayesian data augmentation approach is used to sample both the missing data and model parameters from their posterior full conditional distributions. We evaluate the performance of the DA-CRM through extensive simulation studies, and also compare it with other existing methods. The results show that the proposed design satisfactorily resolves the issues related to late-onset toxicities and possesses desirable operating characteristics: treating patients more safely, and also selecting the maximum tolerated dose with a higher probability. The new DA-CRM is illustrated with two phase I cancer clinical trials.
doi:10.1214/13-AOAS661
PMCID: PMC3972824  PMID: 24707327
Bayesian adaptive design; Late-onset toxicity; Nonignorable missing data; Phase I clinical trial
15.  ANALYSIS OF MULTIPLE SCLEROSIS LESIONS VIA SPATIALLY VARYING COEFFICIENTS 
The annals of applied statistics  2014;8(2):1095-1118.
Magnetic resonance imaging (MRI) plays a vital role in the scientific investigation and clinical management of multiple sclerosis. Analyses of binary multiple sclerosis lesion maps are typically “mass univariate” and conducted with standard linear models that are ill suited to the binary nature of the data and ignore the spatial dependence between nearby voxels (volume elements). Smoothing the lesion maps does not entirely eliminate the non-Gaussian nature of the data and requires an arbitrary choice of the smoothing parameter. Here we present a Bayesian spatial model to accurately model binary lesion maps and to determine if there is spatial dependence between lesion location and subject specific covariates such as MS subtype, age, gender, disease duration and disease severity measures. We apply our model to binary lesion maps derived from T2-weighted MRI images from 250 multiple sclerosis patients classified into five clinical subtypes, and demonstrate unique modeling and predictive capabilities over existing methods.
PMCID: PMC4243942  PMID: 25431633
Image analysis; Multiple sclerosis; Magnetic resonance imaging; Lesion probability map; Markov random fields; Conditional autoregressive model; Spatially varying coefficients
16.  A BAYESIAN HIERARCHICAL SPATIAL POINT PROCESS MODEL FOR MULTI-TYPE NEUROIMAGING META-ANALYSIS 
The annals of applied statistics  2014;8(3):1800-1824.
Neuroimaging meta-analysis is an important tool for finding consistent effects over studies that each usually have 20 or fewer subjects. Interest in meta-analysis in brain mapping is also driven by a recent focus on so-called “reverse inference”: where as traditional “forward inference” identifies the regions of the brain involved in a task, a reverse inference identifies the cognitive processes that a task engages. Such reverse inferences, however, requires a set of meta-analysis, one for each possible cognitive domain. However, existing methods for neuroimaging meta-analysis have significant limitations. Commonly used methods for neuroimaging meta-analysis are not model based, do not provide interpretable parameter estimates, and only produce null hypothesis inferences; further, they are generally designed for a single group of studies and cannot produce reverse inferences. In this work we address these limitations by adopting a non-parametric Bayesian approach for meta analysis data from multiple classes or types of studies. In particular, foci from each type of study are modeled as a cluster process driven by a random intensity function that is modeled as a kernel convolution of a gamma random field. The type-specific gamma random fields are linked and modeled as a realization of a common gamma random field, shared by all types, that induces correlation between study types and mimics the behavior of a univariate mixed effects model. We illustrate our model on simulation studies and a meta analysis of five emotions from 219 studies and check model fit by a posterior predictive assessment. In addition, we implement reverse inference by using the model to predict study type from a newly presented study. We evaluate this predictive performance via leave-one-out cross validation that is efficiently implemented using importance sampling techniques.
PMCID: PMC4241351  PMID: 25426185
Bayesian Spatial Point Processes; Classification; Hierarchical model; Random Intensity Measure; Neuorimage meta-analysis
17.  MAXIMUM LIKELIHOOD ESTIMATION FOR SOCIAL NETWORK DYNAMICS 
The annals of applied statistics  2010;4(2):567-588.
A model for network panel data is discussed, based on the assumption that the observed data are discrete observations of a continuous-time Markov process on the space of all directed graphs on a given node set, in which changes in tie variables are independent conditional on the current graph. The model for tie changes is parametric and designed for applications to social network analysis, where the network dynamics can be interpreted as being generated by choices made by the social actors represented by the nodes of the graph. An algorithm for calculating the Maximum Likelihood estimator is presented, based on data augmentation and stochastic approximation. An application to an evolving friendship network is given and a small simulation study is presented which suggests that for small data sets the Maximum Likelihood estimator is more efficient than the earlier proposed Method of Moments estimator.
doi:10.1214/09-AOAS313
PMCID: PMC4236314  PMID: 25419259
Graphs; Longitudinal data; Method of moments; Stochastic approximation; Robbins-Monro algorithm
18.  MULTIPLE TESTING OF LOCAL MAXIMA FOR DETECTION OF PEAKS IN CHIP-SEQ DATA 
The annals of applied statistics  2013;7(1):471-494.
A topological multiple testing approach to peak detection is proposed for the problem of detecting transcription factor binding sites in ChIP-Seq data. After kernel smoothing of the tag counts over the genome, the presence of a peak is tested at each observed local maximum, followed by multiple testing correction at the desired false discovery rate level. Valid p-values for candidate peaks are computed via Monte Carlo simulations of smoothed Poisson sequences, whose background Poisson rates are obtained via linear regression from a Control sample at two different scales. The proposed method identifies nearby binding sites that other methods do not.
PMCID: PMC4233463  PMID: 25411587
false discovery rate; kernel smoothing; matched filter; Poisson sequence; topological inference
19.  VOXEL-LEVEL MAPPING OF TRACER KINETICS IN PET STUDIES: A STATISTICAL APPROACH EMPHASIZING TISSUE LIFE TABLES1 
The annals of applied statistics  2014;8(2):1065-1094.
Most radiotracers used in dynamic positron emission tomography (PET) scanning act in a linear time-invariant fashion so that the measured time-course data are a convolution between the time course of the tracer in the arterial supply and the local tissue impulse response, known as the tissue residue function. In statistical terms the residue is a life table for the transit time of injected radiotracer atoms. The residue provides a description of the tracer kinetic information measurable by a dynamic PET scan. Decomposition of the residue function allows separation of rapid vascular kinetics from slower blood-tissue exchanges and tissue retention. For voxel-level analysis, we propose that residues be modeled by mixtures of nonparametrically derived basis residues obtained by segmentation of the full data volume. Spatial and temporal aspects of diagnostics associated with voxel-level model fitting are emphasized. Illustrative examples, some involving cancer imaging studies, are presented. Data from cerebral PET scanning with 18F fluoro-deoxyglucose (FDG) and 15O water (H2O) in normal subjects is used to evaluate the approach. Cross-validation is used to make regional comparisons between residues estimated using adaptive mixture models with more conventional compartmental modeling techniques. Simulations studies are used to theoretically examine mean square error performance and to explore the benefit of voxel-level analysis when the primary interest is a statistical summary of regional kinetics. The work highlights the contribution that multivariate analysis tools and life-table concepts can make in the recovery of local metabolic information from dynamic PET studies, particularly ones in which the assumptions of compartmental-like models, with residues that are sums of exponentials, might not be certain.
PMCID: PMC4225726  PMID: 25392718
Kinetic analysis; life-table; mixture modeling; PET
20.  HYPOTHESIS SETTING AND ORDER STATISTIC FOR ROBUST GENOMIC META-ANALYSIS 
The annals of applied statistics  2014;8(2):777-800.
Meta-analysis techniques have been widely developed and applied in genomic applications, especially for combining multiple transcriptomic studies. In this paper, we propose an order statistic of p-values (rth ordered p-value, rOP) across combined studies as the test statistic. We illustrate different hypothesis settings that detect gene markers differentially expressed (DE) “in all studies”, “in the majority of studies”, or “in one or more studies”, and specify rOP as a suitable method for detecting DE genes “in the majority of studies”. We develop methods to estimate the parameter r in rOP for real applications. Statistical properties such as its asymptotic behavior and a one-sided testing correction for detecting markers of concordant expression changes are explored. Power calculation and simulation show better performance of rOP compared to classical Fisher's method, Stouffer's method, minimum p-value method and maximum p-value method under the focused hypothesis setting. Theoretically, rOP is found connected to the naïve vote counting method and can be viewed as a generalized form of vote counting with better statistical properties. The method is applied to three microarray meta-analysis examples including major depressive disorder, brain cancer and diabetes. The results demonstrate rOP as a more generalizable, robust and sensitive statistical framework to detect disease-related markers.
PMCID: PMC4222050  PMID: 25383132
21.  UNEXPECTED PROPERTIES OF BANDWIDTH CHOICE WHEN SMOOTHING DISCRETE DATA FOR CONSTRUCTING A FUNCTIONAL DATA CLASSIFIER 
The annals of applied statistics  2013;41(6):2739-2767.
The data functions that are studied in the course of functional data analysis are assembled from discrete data, and the level of smoothing that is used is generally that which is appropriate for accurate approximation of the conceptually smooth functions that were not actually observed. Existing literature shows that this approach is effective, and even optimal, when using functional data methods for prediction or hypothesis testing. However, in the present paper we show that this approach is not effective in classification problems. There a useful rule of thumb is that undersmoothing is often desirable, but there are several surprising qualifications to that approach. First, the effect of smoothing the training data can be more significant than that of smoothing the new data set to be classified; second, undersmoothing is not always the right approach, and in fact in some cases using a relatively large bandwidth can be more effective; and third, these perverse results are the consequence of very unusual properties of error rates, expressed as functions of smoothing parameters. For example, the orders of magnitude of optimal smoothing parameter choices depend on the signs and sizes of terms in an expansion of error rate, and those signs and sizes can vary dramatically from one setting to another, even for the same classifier.
doi:10.1214/13-AOS1158
PMCID: PMC4191932  PMID: 25309640
Centroid method; discrimination; kernel smoothing; quadratic discrimination; smoothing parameter choice; training data
22.  A SEMI-PARAMETRIC BAYESIAN MODEL OF INTER- AND INTRA-EXAMINER AGREEMENT FOR PERIODONTAL PROBING DEPTH 
The annals of applied statistics  2014;8(1):331-351.
Periodontal probing depth is a measure of periodontitis severity. We develop a Bayesian hierarchical model linking true pocket depth to both observed and recorded values of periodontal probing depth, while permitting correlation among measures obtained from the same mouth and between duplicate examiners’ measures obtained at the same periodontal site. Periodontal site-specific examiner effects are modeled as arising from a Dirichlet process mixture, facilitating identification of classes of sites that are measured with similar bias. Using simulated data, we demonstrate the model's ability to recover examiner site-specific bias and variance heterogeneity and to provide cluster-adjusted point and interval agreement estimates. We conclude with an analysis of data from a probing depth calibration training exercise.
doi:10.1214/13-AOAS688
PMCID: PMC4175569  PMID: 25264473
Agreement; cluster-correlated data; clustering; Dirichlet process mixture model; measurement error; periodontal disease; weighted kappa
23.  A NEW METHOD OF PEAK DETECTION FOR ANALYSIS OF COMPREHENSIVE TWO-DIMENSIONAL GAS CHROMATOGRAPHY MASS SPECTROMETRY DATA* 
The annals of applied statistics  2014;8(2):1209-1231.
We develop a novel peak detection algorithm for the analysis of comprehensive two-dimensional gas chromatography time-of-flight mass spectrometry (GC×GC-TOF MS) data using normal-exponential-Bernoulli (NEB) and mixture probability models. The algorithm first performs baseline correction and denoising simultaneously using the NEB model, which also defines peak regions. Peaks are then picked using a mixture of probability distribution to deal with the co-eluting peaks. Peak merging is further carried out based on the mass spectral similarities among the peaks within the same peak group. The algorithm is evaluated using experimental data to study the effect of different cut-offs of the conditional Bayes factors and the effect of different mixture models including Poisson, truncated Gaussian, Gaussian, Gamma, and exponentially modified Gaussian (EMG) distributions, and the optimal version is introduced using a trial-and-error approach. We then compare the new algorithm with two existing algorithms in terms of compound identification. Data analysis shows that the developed algorithm can detect the peaks with lower false discovery rates than the existing algorithms, and a less complicated peak picking model is a promising alternative to the more complicated and widely used EMG mixture models.
PMCID: PMC4175529  PMID: 25264474
Bayes factor; GC×GC-TOF MS; metabolomics; mixture model; normal-exponential-Bernoulli (NEB) model; peak detection
24.  TOXICITY PROFILING OF ENGINEERED NANOMATERIALS VIA MULTIVARIATE DOSE-RESPONSE SURFACE MODELING 
The annals of applied statistics  2012;6(4):1707-1729.
New generation in vitro high-throughput screening (HTS) assays for the assessment of engineered nanomaterials provide an opportunity to learn how these particles interact at the cellular level, particularly in relation to injury pathways. These types of assays are often characterized by small sample sizes, high measurement error and high dimensionality, as multiple cytotoxicity outcomes are measured across an array of doses and durations of exposure. In this paper we propose a probability model for the toxicity profiling of engineered nanomaterials. A hierarchical structure is used to account for the multivariate nature of the data by modeling dependence between outcomes and thereby combining information across cytotoxicity pathways. In this framework we are able to provide a flexible surface-response model that provides inference and generalizations of various classical risk assessment parameters. We discuss applications of this model to data on eight nanoparticles evaluated in relation to four cytotoxicity parameters.
doi:10.1214/12-AOAS563
PMCID: PMC4151981  PMID: 25191531
Additive models; dose-response models; hierarchical models; multivariate; nanotoxicology
25.  Bayesian Non-Parametric Hierarchical Modeling for Multiple Membership Data in Grouped Attendance Interventions 
The annals of applied statistics  2013;7(2):10.1214/12-AOAS620.
We develop a dependent Dirichlet process (DDP) model for repeated measures multiple membership (MM) data. This data structure arises in studies under which an intervention is delivered to each client through a sequence of elements which overlap with those of other clients on different occasions. Our interest concentrates on study designs for which the overlaps of sequences occur for clients who receive an intervention in a shared or grouped fashion whose memberships may change over multiple treatment events. Our motivating application focuses on evaluation of the effectiveness of a group therapy intervention with treatment delivered through a sequence of cognitive behavioral therapy session blocks, called modules. An open-enrollment protocol permits entry of clients at the beginning of any new module in a manner that may produce unique MM sequences across clients. We begin with a model that composes an addition of client and multiple membership module random effect terms, which are assumed independent. Our MM DDP model relaxes the assumption of conditionally independent client and module random effects by specifying a collection of random distributions for the client effect parameters that are indexed by the unique set of module attendances. We demonstrate how this construction facilitates examining heterogeneity in the relative effectiveness of group therapy modules over repeated measurement occasions.
doi:10.1214/12-AOAS620
PMCID: PMC3833697  PMID: 24273629
Bayesian hierarchical models; Conditional autoregressive prior; Dependent Dirichlet process; Group therapy; Growth curve; Mental health; Multiple membership; Non-parametric priors; Substance abuse treatment

Results 1-25 (140)