PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (31)
 

Clipboard (0)
None
Journals
Year of Publication
Document Types
1.  Genomic outlier profile analysis: mixture models, null hypotheses, and nonparametric estimation 
In most analyses of large-scale genomic data sets, differential expression analysis is typically assessed by testing for differences in the mean of the distributions between 2 groups. A recent finding by Tomlins and others (2005) is of a different type of pattern of differential expression in which a fraction of samples in one group have overexpression relative to samples in the other group. In this work, we describe a general mixture model framework for the assessment of this type of expression, called outlier profile analysis. We start by considering the single-gene situation and establishing results on identifiability. We propose 2 nonparametric estimation procedures that have natural links to familiar multiple testing procedures. We then develop multivariate extensions of this methodology to handle genome-wide measurements. The proposed methodologies are compared using simulation studies as well as data from a prostate cancer gene expression study.
doi:10.1093/biostatistics/kxn015
PMCID: PMC2605210  PMID: 18539648
Bonferroni correction; DNA microarray; False discovery rate; Goodness of fit; Multiple comparisons; Uniform distribution
2.  Genomic outlier profile analysis: mixture models, null hypotheses, and nonparametric estimation 
SUMMARY
In most analyses of large-scale genomic data sets, differential expression analysis is typically assessed by testing for differences in the mean of the distributions between 2 groups. A recent finding by Tomlins and others (2005) is of a different type of pattern of differential expression in which a fraction of samples in one group have overexpression relative to samples in the other group. In this work, we describe a general mixture model framework for the assessment of this type of expression, called outlier profile analysis. We start by considering the single-gene situation and establishing results on identifiability. We propose 2 nonparametric estimation procedures that have natural links to familiar multiple testing procedures. We then develop multivariate extensions of this methodology to handle genome-wide measurements. The proposed methodologies are compared using simulation studies as well as data from a prostate cancer gene expression study.
doi:10.1093/biostatistics/kxn015
PMCID: PMC2605210  PMID: 18539648
Bonferroni correction; DNA microarray; False discovery rate; Goodness of fit; Multiple comparisons; Uniform distribution
3.  A method for constructing a confidence bound for the actual error rate of a prediction rule in high dimensions 
Biostatistics (Oxford, England)  2008;10(2):282-296.
Constructing a confidence interval for the actual, conditional error rate of a prediction rule from multivariate data is problematic because this error rate is not a population parameter in the traditional sense—it is a functional of the training set. When the training set changes, so does this “parameter.” A valid method for constructing confidence intervals for the actual error rate had been previously developed by McLachlan. However, McLachlan's method cannot be applied in many cancer research settings because it requires the number of samples to be much larger than the number of dimensions (n >> p), and it assumes that no dimension-reducing feature selection step is performed. Here, an alternative to McLachlan's method is presented that can be applied when p >> n, with an additional adjustment in the presence of feature selection. Coverage probabilities of the new method are shown to be nominal or conservative over a wide range of scenarios. The new method is relatively simple to implement and not computationally burdensome.
doi:10.1093/biostatistics/kxn035
PMCID: PMC2733174  PMID: 19039030
Accuracy; Confidence interval; Error rate; Prediction
4.  Measurement error caused by spatial misalignment in environmental epidemiology 
Biostatistics (Oxford, England)  2008;10(2):258-274.
In many environmental epidemiology studies, the locations and/or times of exposure measurements and health assessments do not match. In such settings, health effects analyses often use the predictions from an exposure model as a covariate in a regression model. Such exposure predictions contain some measurement error as the predicted values do not equal the true exposures. We provide a framework for spatial measurement error modeling, showing that smoothing induces a Berkson-type measurement error with nondiagonal error structure. From this viewpoint, we review the existing approaches to estimation in a linear regression health model, including direct use of the spatial predictions and exposure simulation, and explore some modified approaches, including Bayesian models and out-of-sample regression calibration, motivated by measurement error principles. We then extend this work to the generalized linear model framework for health outcomes. Based on analytical considerations and simulation results, we compare the performance of all these approaches under several spatial models for exposure. Our comparisons underscore several important points. First, exposure simulation can perform very poorly under certain realistic scenarios. Second, the relative performance of the different methods depends on the nature of the underlying exposure surface. Third, traditional measurement error concepts can help to explain the relative practical performance of the different methods. We apply the methods to data on the association between levels of particulate matter and birth weight in the greater Boston area.
doi:10.1093/biostatistics/kxn033
PMCID: PMC2733173  PMID: 18927119
Air pollution; Measurement error; Predictions; Spatial misalignment
5.  Generalized linear models with unspecified reference distribution 
Biostatistics (Oxford, England)  2008;10(2):205-218.
We propose a new class of semiparametric generalized linear models. As with existing models, these models are specified via a linear predictor and a link function for the mean of response Y as a function of predictors X. Here, however, the “baseline” distribution of Y at a given reference mean μ0 is left unspecified and is estimated from the data. The response distribution when the mean differs from μ0 is then generated via exponential tilting of the baseline distribution, yielding a response model that is a natural exponential family, with corresponding canonical link and variance functions. The resulting model has a level of flexibility similar to the popular proportional odds model. Maximum likelihood estimation is developed for response distributions with finite support, and the new model is studied and illustrated through simulations and example analyses from aging research.
doi:10.1093/biostatistics/kxn030
PMCID: PMC2733172  PMID: 18824517
Baseline distribution; Canonical link; Density ratio model; Exponential tilting; Linear exponential family; Natural exponential family; Quasi-likelihood; Semiparametric model
6.  Biomarker evaluation and comparison using the controls as a reference population 
Biostatistics (Oxford, England)  2008;10(2):228-244.
The classification accuracy of a continuous marker is typically evaluated with the receiver operating characteristic (ROC) curve. In this paper, we study an alternative conceptual framework, the “percentile value.” In this framework, the controls only provide a reference distribution to standardize the marker. The analysis proceeds by analyzing the standardized marker in cases. The approach is shown to be equivalent to ROC analysis. Advantages are that it provides a framework familiar to a broad spectrum of biostatisticians and it opens up avenues for new statistical techniques in biomarker evaluation. We develop several new procedures based on this framework for comparing biomarkers and biomarker performance in different populations. We develop methods that adjust such comparisons for covariates. The methods are illustrated on data from 2 cancer biomarker studies.
doi:10.1093/biostatistics/kxn029
PMCID: PMC2648906  PMID: 18755739
Biomarker; Classification; Covariate adjustment; Percentile value; ROC; Standardization
7.  Statistical monitoring of clinical trials with multivariate response and/or multiple arms: a flexible approach 
Biostatistics (Oxford, England)  2008;10(2):310-323.
Randomized clinical trials with a multivariate response and/or multiple treatment arms are increasingly common, in part because of their efficiency and a greater concern about balancing risks with benefits. In some trials, the specific types and magnitudes of treatment group differences that would warrant early termination cannot easily be specified prior to the onset of the trial and/or could change as the trial progresses. This underscores the need for more flexible monitoring methods than traditional approaches. This paper extends the repeated confidence bands approach for interim monitoring to more general settings where there can be a multivariate response and/or multiple treatment arms and where the metrics for comparing treatment groups can change during the conduct of the trial. We illustrate the approach using the results of a recent AIDS clinical trial and examine its efficiency and robustness via simulation.
doi:10.1093/biostatistics/kxn037
PMCID: PMC2648904  PMID: 19015160
Group sequential analysis; Interim review; Multiple comparisons; Multiple end points; Nonparametric inference; Repeated confidence bands
8.  Optimal 2-stage design with given power in association studies 
Biostatistics (Oxford, England)  2008;10(2):324-326.
doi:10.1093/biostatistics/kxn038
PMCID: PMC2648901  PMID: 19052147
9.  Exact and efficient inference procedure for meta-analysis and its application to the analysis of independent 2 × 2 tables with all available data but without artificial continuity correction 
Biostatistics (Oxford, England)  2008;10(2):275-281.
Recently, meta-analysis has been widely utilized to combine information across comparative clinical studies for evaluating drug efficacy or safety profile. When dealing with rather rare events, a substantial proportion of studies may not have any events of interest. Conventional methods either exclude such studies or add an arbitrary positive value to each cell of the corresponding 2×2 tables in the analysis. In this article, we present a simple, effective procedure to make valid inferences about the parameter of interest with all available data without artificial continuity corrections. We then use the procedure to analyze the data from 48 comparative trials involving rosiglitazone with respect to its possible cardiovascular toxicity.
doi:10.1093/biostatistics/kxn034
PMCID: PMC2648899  PMID: 18922759
Continuity correction for zero events; Exact inference procedure; Odds ratio; Risk difference
10.  Modified test statistics by inter-voxel variance shrinkage with an application to f MRI 
Biostatistics (Oxford, England)  2008;10(2):219-227.
Functional magnetic resonance imaging (f MRI) is a noninvasive technique which is commonly used to quantify changes in blood oxygenation and flow coupled to neuronal activation. One of the primary goals of f MRI studies is to identify localized brain regions where neuronal activation levels vary between groups. Single voxel t-tests have been commonly used to determine whether activation related to the protocol differs across groups. Due to the generally limited number of subjects within each study, accurate estimation of variance at each voxel is difficult. Thus, combining information across voxels is desirable in order to improve efficiency. Here, we construct a hierarchical model and apply an empirical Bayesian framework for the analysis of group f MRI data, employing techniques used in high-throughput genomic studies. The key idea is to shrink residual variances by combining information across voxels and subsequently to construct an improved test statistic. This hierarchical model results in a shrinkage of voxel-wise residual sample variances toward a common value. The shrunken estimator for voxel-specific variance components on the group analyses outperforms the classical residual error estimator in terms of mean-squared error. Moreover, the shrunken test statistic decreases false-positive rates when testing differences in brain contrast maps across a wide range of simulation studies. This methodology was also applied to experimental data regarding a cognitive activation task.
doi:10.1093/biostatistics/kxn028
PMCID: PMC3159431  PMID: 18723853
General liner model; Group analysis; Hierarchical models; Image analysis; Shrinkage estimation
11.  Letter to the editor 
Biostatistics (Oxford, England)  2008;10(1):201-203.
doi:10.1093/biostatistics/kxn040
PMCID: PMC2733159  PMID: 19039031
12.  Bayesian hierarchically weighted finite mixture models for samples of distributions 
Biostatistics (Oxford, England)  2008;10(1):155-171.
Finite mixtures of Gaussian distributions are known to provide an accurate approximation to any unknown density. Motivated by DNA repair studies in which data are collected for samples of cells from different individuals, we propose a class of hierarchically weighted finite mixture models. The modeling framework incorporates a collection of k Gaussian basis distributions, with the individual-specific response densities expressed as mixtures of these bases. To allow heterogeneity among individuals and predictor effects, we model the mixture weights, while treating the basis distributions as unknown but common to all distributions. This results in a flexible hierarchical model for samples of distributions. We consider analysis of variance–type structures and a parsimonious latent factor representation, which leads to simplified inferences on non-Gaussian covariance structures. Methods for posterior computation are developed, and the model is used to select genetic predictors of baseline DNA damage, susceptibility to induced damage, and rate of repair.
doi:10.1093/biostatistics/kxn024
PMCID: PMC2733158  PMID: 18708650
Comet assay; Finite mixture model; Genotoxicity; Hierarchical functional data; Latent factor; Samples of distributions; Stochastic search
13.  Effective communication of standard errors and confidence intervals 
doi:10.1093/biostatistics/kxn014
PMCID: PMC2639348  PMID: 18550565
14.  StepBrothers: inferring partially shared ancestries among recombinant viral sequences 
Biostatistics (Oxford, England)  2008;10(1):106-120.
Phylogeneticists have developed several statistical methods to infer recombination among molecular sequences that are evolutionarily related. Of these methods, Markov change-point models currently provide the most coherent framework. Yet, the Markov assumption is faulty in that the inferred relatedness of homologous sequences across regions divided by recombinant events is not independent, particularly for nonrecombinant sequences as they share the same history. To correct this limitation, we introduce a novel random tips (RT) model. The model springs from the idea that a recombinant sequence inherits its characters from an unknown number of ancestral full-length sequences, of which one only observes the incomplete portions. The RT model decomposes recombinant sequences into their ancestral portions and then augments each portion onto the data set as unique partially observed sequences. This data augmentation generates a random number of sequences related to each other through a single inferable tree with the same random number of tips. While intuitively pleasing, this single tree corrects the independence assumptions plaguing previous methods while permitting the detection of recombination. The single tree also allows for inference of the relative times of recombination events and generalizes to incorporate multiple recombinant sequences. This generalization answers important questions with which previous models struggle. For example, we demonstrate that a group of human immunodeficiency type 1 recombinant viruses from Argentina, previously thought to have the same recombinant history, actually consist of 2 groups: one, a clonal expansion of a reference sequence and another that predates the formation of the reference sequence. In another example, we demonstrate that 2 hepatitis B virus recombinant strains share similar splicing locations, suggesting a common descent of the 2 viruses. We implement and run both examples in a software package called StepBrothers, freely available to interested parties.
doi:10.1093/biostatistics/kxn019
PMCID: PMC2639346  PMID: 18562348
Bayesian; Hepatitis B virus; Human Immunodeficiency Virus; Phylogeny; Recombination
15.  Estimating the capacity for improvement in risk prediction with a marker 
Biostatistics (Oxford, England)  2008;10(1):172-186.
Consider a set of baseline predictors X to predict a binary outcome D and let Y be a novel marker or predictor. This paper is concerned with evaluating the performance of the augmented risk model P(D = 1|Y,X) compared with the baseline model P(D = 1|X). The diagnostic likelihood ratio, DLRX(y), quantifies the change in risk obtained with knowledge of Y = y for a subject with baseline risk factors X. The notion is commonly used in clinical medicine to quantify the increment in risk prediction due to Y. It is contrasted here with the notion of covariate-adjusted effect of Y in the augmented risk model. We also propose methods for making inference about DLRX(y). Case–control study designs are accommodated. The methods provide a mechanism to investigate if the predictive information in Y varies with baseline covariates. In addition, we show that when combined with a baseline risk model and information about the population distribution of Y given X, covariate-specific predictiveness curves can be estimated. These curves are useful to an individual in deciding if ascertainment of Y is likely to be informative or not for him. We illustrate with data from 2 studies: one is a study of the performance of hearing screening tests for infants, and the other concerns the value of serum creatinine in diagnosing renal artery stenosis.
doi:10.1093/biostatistics/kxn025
PMCID: PMC2639345  PMID: 18714084
Biomarker; Classification; Diagnostic likelihood ratio; Diagnostic test; Logistic regression; Posterior probability
16.  Sample size for positive and negative predictive value in diagnostic research using case–control designs 
Biostatistics (Oxford, England)  2008;10(1):94-105.
Important properties of diagnostic methods are their sensitivity, specificity, and positive and negative predictive values (PPV and NPV). These methods are typically assessed via case–control samples, which include one cohort of cases known to have the disease and a second control cohort of disease-free subjects. Such studies give direct estimates of sensitivity and specificity but only indirect estimates of PPV and NPV, which also depend on the disease prevalence in the tested population. The motivating example arises in assay testing, where usage is contemplated in populations with known prevalences. Further instances include biomarker development, where subjects are selected from a population with known prevalence and assessment of PPV and NPV is crucial, and the assessment of diagnostic imaging procedures for rare diseases, where case–control studies may be the only feasible designs. We develop formulas for optimal allocation of the sample between the case and control cohorts and for computing sample size when the goal of the study is to prove that the test procedure exceeds pre-stated bounds for PPV and/or NPV. Surprisingly, the optimal sampling schemes for many purposes are highly unbalanced, even when information is desired on both PPV and NPV.
doi:10.1093/biostatistics/kxn018
PMCID: PMC3668447  PMID: 18556677
Biomarkers; Case–control study; Diagnostic testing; Optimal allocation; Sensitivity; Specificity
17.  On outcome-dependent sampling designs for longitudinal binary response data with time-varying covariates 
Biostatistics (Oxford, England)  2008;9(4):735-749.
A typical longitudinal study prospectively collects both repeated measures of a health status outcome as well as covariates that are used either as the primary predictor of interest or as important adjustment factors. In many situations, all covariates are measured on the entire study cohort. However, in some scenarios the primary covariates are time dependent yet may be ascertained retrospectively after completion of the study. One common example would be covariate measurements based on stored biological specimens such as blood plasma. While authors have previously proposed generalizations of the standard case–control design in which the clustered outcome measurements are used to selectively ascertain covariates (Neuhaus and Jewell, 1990) and therefore provide resource efficient collection of information, these designs do not appear to be commonly used. One potential barrier to the use of longitudinal outcome-dependent sampling designs would be the lack of a flexible class of likelihood-based analysis methods. With the relatively recent development of flexible and practical methods such as generalized linear mixed models (Breslow and Clayton, 1993) and marginalized models for categorical longitudinal data (see Heagerty and Zeger, 2000, for an overview), the class of likelihood-based methods is now sufficiently well developed to capture the major forms of longitudinal correlation found in biomedical repeated measures data. Therefore, the goal of this manuscript is to promote the consideration of outcome-dependent longitudinal sampling designs and to both outline and evaluate the basic conditional likelihood analysis allowing for valid statistical inference.
doi:10.1093/biostatistics/kxn006
PMCID: PMC2733177  PMID: 18372397
Binary data; Longitudinal data analysis; Marginal models; Marginalized models; Outcome-dependent sampling; Time-dependent covariates
18.  A Bayesian approach to functional-based multilevel modeling of longitudinal data: applications to environmental epidemiology 
Biostatistics (Oxford, England)  2008;9(4):686-699.
Flexible multilevel models are proposed to allow for cluster-specific smooth estimation of growth curves in a mixed-effects modeling format that includes subject-specific random effects on the growth parameters. Attention is then focused on models that examine between-cluster comparisons of the effects of an ecologic covariate of interest (e.g. air pollution) on nonlinear functionals of growth curves (e.g. maximum rate of growth). A Gibbs sampling approach is used to get posterior mean estimates of nonlinear functionals along with their uncertainty estimates. A second-stage ecologic random-effects model is used to examine the association between a covariate of interest (e.g. air pollution) and the nonlinear functionals. A unified estimation procedure is presented along with its computational and theoretical details. The models are motivated by, and illustrated with, lung function and air pollution data from the Southern California Children's Health Study.
doi:10.1093/biostatistics/kxm059
PMCID: PMC2733176  PMID: 18349036
Air pollution; Correlated data; Growth curves; Mixed-effects; Splines
19.  Estimating time-to-event from longitudinal ordinal data using random-effects Markov models: application to multiple sclerosis progression 
Biostatistics (Oxford, England)  2008;9(4):750-764.
Longitudinal ordinal data are common in many scientific studies, including those of multiple sclerosis (MS), and are frequently modeled using Markov dependency. Several authors have proposed random-effects Markov models to account for heterogeneity in the population. In this paper, we go one step further and study prediction based on random-effects Markov models. In particular, we show how to calculate the probabilities of future events and confidence intervals for those probabilities, given observed data on the ordinal outcome and a set of covariates, and how to update them over time. We discuss the usefulness of depicting these probabilities for visualization and interpretation of model results and illustrate our method using data from a phase III clinical trial that evaluated the utility of interferon beta-1a (trademark Avonex) to MS patients of type relapsing–remitting.
doi:10.1093/biostatistics/kxn008
PMCID: PMC2536724  PMID: 18424785
Markov model; Ordinal response; Prediction; Transition model
20.  Optimal screening for promising genes in 2-stage designs 
Biostatistics (Oxford, England)  2008;9(4):700-714.
Detecting genetic markers with biologically relevant effects remains a challenge due to multiple testing. Standard analysis methods focus on evidence against the null and protect primarily the type I error. On the other hand, the worthwhile alternative is specified for power calculations at the design stage. The balanced test as proposed by Moerkerke and others (2006) and Moerkerke and Goetghebeur (2006) incorporates this alternative directly in the decision criterion to achieve better power. Genetic markers are selected and ranked in order of the balance of evidence they contain against the null and the target alternative. In this paper, we build on this guiding principle to develop 2-stage designs for screening genetic markers when the cost of measurements is high. For a given marker, a first sample may already provide sufficient evidence for or against the alternative. If not, more data are gathered at the second stage which is then followed by a binary decision based on all available data. By optimizing parameters which determine the decision process over the 2 stages (such as the area of the “gray” zone which leads to the gathering of extra data), the expected cost per marker can be reduced substantially. We also demonstrate that, compared to 1-stage designs, 2-stage designs achieve a better balance between true negatives and positives for the same cost.
doi:10.1093/biostatistics/kxn002
PMCID: PMC2536725  PMID: 18349035
Alternative p-value; Balanced test; Cost-efficient screening; False discovery rate; Gene selection; Multiple testing; Optimal designs; Two-stage designs
21.  Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies 
Biostatistics (Oxford, England)  2008;9(4):621-634.
Genome-wide association studies (GWAS) provide an important approach to identifying common genetic variants that predispose to human disease. A typical GWAS may genotype hundreds of thousands of single nucleotide polymorphisms (SNPs) located throughout the human genome in a set of cases and controls. Logistic regression is often used to test for association between a SNP genotype and case versus control status, with corresponding odds ratios (ORs) typically reported only for those SNPs meeting selection criteria. However, when these estimates are based on the original data used to detect the variant, the results are affected by a selection bias sometimes referred to the “winner's curse” (Capen and others, 1971). The actual genetic association is typically overestimated. We show that such selection bias may be severe in the sense that the conditional expectation of the standard OR estimator may be quite far away from the underlying parameter. Also standard confidence intervals (CIs) may have far from the desired coverage rate for the selected ORs. We propose and evaluate 3 bias-reduced estimators, and also corresponding weighted estimators that combine corrected and uncorrected estimators, to reduce selection bias. Their corresponding CIs are also proposed. We study the performance of these estimators using simulated data sets and show that they reduce the bias and give CI coverage close to the desired level under various scenarios, even for associations having only small statistical power.
doi:10.1093/biostatistics/kxn001
PMCID: PMC2536726  PMID: 18310059
Bias-reduced estimator; Genome-wide association study; Odds ratio; Selection adjusted confidence interval; Selection bias
22.  Extension of the SAEM algorithm for nonlinear mixed models with 2 levels of random effects 
Biostatistics (Oxford, England)  2008;10(1):121-135.
This article focuses on parameter estimation of multi-levels nonlinear mixed effects models (MNLMEMs). These models are used to analyze data presenting multiple hierarchical levels of grouping (cluster data, clinical trials with several observation periods,…). The variability of the individual parameters of the regression function is thus decomposed as a between-subject variability and higher levels of variability (for example within-subject variability). We propose maximum likelihood estimates of parameters of those MNLMEMs with two levels of random effects, using an extension of the SAEM-MCMC algorithm. The extended SAEM algorithm is split into an explicit direct EM algorithm and a stochastic EM part. Compared to the original algorithm, additional sufficient statistics have to be approximated by relying on the conditional distribution of the second level of random effects. This estimation method is evaluated on pharmacokinetic cross-over simulated trials, mimicking theophyllin concentration data. Results obtained on those datasets with either the SAEM algorithm or the FOCE algorithm (implemented in the nlme function of R software) are compared: biases and RMSEs of almost all the SAEM estimates are smaller than the FOCE ones. Finally, we apply the extended SAEM algorithm to analyze the pharmacokinetic interaction of tenofovir on atazanavir, a novel protease inhibitor, from the ANRS 107-Puzzle 2 study. A significant decrease of the area under the curve of atazanavir is found in patients receiving both treatments.
doi:10.1093/biostatistics/kxn020
PMCID: PMC2722900  PMID: 18583352
Algorithms; Anti-HIV Agents; therapeutic use; Area Under Curve; Bias (Epidemiology); Biometry; methods; Cluster Analysis; Cross-Over Studies; Drug Interactions; Humans; Likelihood Functions; Markov Chains; Monte Carlo Method; Nonlinear Dynamics; Oligopeptides; pharmacokinetics; Pyridines; pharmacokinetics; Regression Analysis; Theophylline; pharmacokinetics; Therapeutic Equivalency; Time Factors; Multilevel nonlinear mixed effects models; SAEM algorithm; Multiple periods; Cross-over trial; Bioequivalence trials.
23.  Mixture models with multiple levels, with application to the analysis of multifactor gene expression data 
Biostatistics (Oxford, England)  2008;9(3):540-554.
Model-based clustering is a popular tool for summarizing high-dimensional data. With the number of high-throughput large-scale gene expression studies still on the rise, the need for effective data- summarizing tools has never been greater. By grouping genes according to a common experimental expression profile, we may gain new insight into the biological pathways that steer biological processes of interest. Clustering of gene profiles can also assist in assigning functions to genes that have not yet been functionally annotated. In this paper, we propose 2 model selection procedures for model-based clustering. Model selection in model-based clustering has to date focused on the identification of data dimensions that are relevant for clustering. However, in more complex data structures, with multiple experimental factors, such an approach does not provide easily interpreted clustering outcomes. We propose a mixture model with multiple levels, , that provides sparse representations both “within” and “between” cluster profiles. We explore various flexible “within-cluster” parameterizations and discuss how efficient parameterizations can greatly enhance the objective interpretability of the generated clusters. Moreover, we allow for a sparse “between-cluster” representation with a different number of clusters at different levels of an experimental factor of interest. This enhances interpretability of clusters generated in multiple-factor contexts. Interpretable cluster profiles can assist in detecting biologically relevant groups of genes that may be missed with less efficient parameterizations. We use our multilevel mixture model to mine a proliferating cell line expression data set for annotational context and regulatory motifs. We also investigate the performance of the multilevel clustering approach on several simulated data sets.
doi:10.1093/biostatistics/kxm051
PMCID: PMC3294320  PMID: 18256042
Clustering; Gene expression; Mixture model; Model selection; Profile expectation–maximization
24.  A simulation-based marginal method for longitudinal data with dropout and mismeasured covariates 
Biostatistics (Oxford, England)  2008;9(3):501-512.
Longitudinal data often contain missing observations and error-prone covariates. Extensive attention has been directed to analysis methods to adjust for the bias induced by missing observations. There is relatively little work on investigating the effects of covariate measurement error on estimation of the response parameters, especially on simultaneously accounting for the biases induced by both missing values and mismeasured covariates. It is not clear what the impact of ignoring measurement error is when analyzing longitudinal data with both missing observations and error-prone covariates. In this article, we study the effects of covariate measurement error on estimation of the response parameters for longitudinal studies. We develop an inference method that adjusts for the biases induced by measurement error as well as by missingness. The proposed method does not require the full specification of the distribution of the response vector but only requires modeling its mean and variance structures. Furthermore, the proposed method employs the so-called functional modeling strategy to handle the covariate process, with the distribution of covariates left unspecified. These features, plus the simplicity of implementation, make the proposed method very attractive. In this paper, we establish the asymptotic properties for the resulting estimators. With the proposed method, we conduct sensitivity analyses on a cohort data set arising from the Framingham Heart Study. Simulation studies are carried out to evaluate the impact of ignoring covariate measurement error and to assess the performance of the proposed method.
doi:10.1093/biostatistics/kxm054
PMCID: PMC3294321  PMID: 18199691
Estimating equations; Longitudinal data; Measurement error; Missing data; Simulation and extrapolation method
25.  Microarray background correction: maximum likelihood estimation for the normal–exponential convolution 
Biostatistics (Oxford, England)  2008;10(2):352-363.
Background correction is an important preprocessing step for microarray data that attempts to adjust the data for the ambient intensity surrounding each feature. The “normexp” method models the observed pixel intensities as the sum of 2 random variables, one normally distributed and the other exponentially distributed, representing background noise and signal, respectively. Using a saddle-point approximation, Ritchie and others (2007) found normexp to be the best background correction method for 2-color microarray data. This article develops the normexp method further by improving the estimation of the parameters. A complete mathematical development is given of the normexp model and the associated saddle-point approximation. Some subtle numerical programming issues are solved which caused the original normexp method to fail occasionally when applied to unusual data sets. A practical and reliable algorithm is developed for exact maximum likelihood estimation (MLE) using high-quality optimization software and using the saddle-point estimates as starting values. “MLE” is shown to outperform heuristic estimators proposed by other authors, both in terms of estimation accuracy and in terms of performance on real data. The saddle-point approximation is an adequate replacement in most practical situations. The performance of normexp for assessing differential expression is improved by adding a small offset to the corrected intensities.
doi:10.1093/biostatistics/kxn042
PMCID: PMC2648902  PMID: 19068485
2-color microarray; Background correction; Maximum likelihood; Nelder-Mead algorithm; Newton- Raphson algorithm; Normal-exponential convolution

Results 1-25 (31)