We have developed a modified Patient Rule-Induction Method (PRIM) as an alternative strategy for analyzing representative samples of non-experimental human data to estimate and test the role of genomic variations as predictors of disease risk in etiologically heterogeneous sub-samples. A computational limit of the proposed strategy is encountered when the number of genomic variations (predictor variables) under study is large (> 500) because permutations are used to generate a null distribution to test the significance of a term (defined by values of particular variables) that characterizes a sub-sample of individuals through the peeling and pasting processes. As an alternative, in this paper we introduce a theoretical strategy that facilitates the quick calculation of Type I and Type II errors in the evaluation of terms in the peeling and pasting processes carried out in the execution of a PRIM analysis that are underestimated and non-existent, respectively, when a permutation-based hypothesis test is employed. The resultant savings in computational time makes possible the consideration of larger numbers of genomic variations (an example genome wide association study is given) in the selection of statistically significant terms in the formulation of PRIM prediction models.
PRIM; classification; GWAS
Novel uses of automated flow cytometry technology for measuring levels of protein markers on thousands to millions of cells are promoting increasing need for relevant, customized Bayesian mixture modelling approaches in many areas of biomedical research and application. In studies of immune profiling in many biological areas, traditional flow cytometry measures relative levels of abundance of marker proteins using fluorescently labeled tags that identify specific markers by a single-color. One specific and important recent development in this area is the use of combinatorial marker assays in which each marker is targeted with a probe that is labeled with two or more fluorescent tags. The use of several colors enables the identification of, in principle, combinatorially increasingly numbers of subtypes of cells, each identified by a subset of colors. This represents a major advance in the ability to characterize variation in immune responses involving larger numbers of functionally differentiated cell subtypes. We describe novel classes of Markov chain Monte Carlo methods for model fitting that exploit distributed GPU (graphics processing unit) implementation. We discuss issues of cellular subtype identification in this novel, general model framework, and provide a detailed example using simulated data. We then describe application to a data set from an experimental study of antigen-specific T-cell subtyping using combinatorially encoded assays in human blood samples. Summary comments discuss broader questions in applications in immunology, and aspects of statistical computation.
Dirichlet process mixtures; GPU computing; Hierarchical model; Immune profiling; Immune response biomarkers; Large data sets; Markov chain Monte Carlo; Massive mixture models; Multimers; Posterior simulation; Relabeling; T-cell subtyping
Cancer patients often develop multiple malignancies that may be either metastatic spread of a previous cancer (clonal tumors) or new primary cancers (independent tumors). If diagnosis cannot be easily made on the basis of the pathology review, the patterns of somatic mutations in the tumors can be compared. Previously we have developed statistical methods for testing clonality of two tumors using their loss of heterozygosity (LOH) profiles at several candidate markers. These methods can be applied to all possible pairs of tumors when multiple tumors are analyzed, but this strategy can lead to inconsistent results and loss of statistical power. In this work we will extend clonality tests to three and more malignancies from the same patient. A non-parametric test can be performed using any possible subset of tumors, with the subsequent adjustment for multiple testing. A parametric likelihood model is developed for 3 or 4 tumors, and it can be used to estimate the phylogenetic tree of tumors. The proposed tests are more powerful than combination of all possible pairwise tests.
clonality; LOH; metastasis; multiple tumors; concordant mutations test; likelihood ratio test
High-throughput expression profiling allows simultaneous measure of tens of thousands of genes at once. These data have motivated the development of reliable biomarkers for disease subtypes identification and diagnosis. Many methods have been developed in the literature for analyzing these data, such as diagonal discriminant analysis, support vector machines, and k-nearest neighbor methods. The diagonal discriminant methods have been shown to perform well for high-dimensional data with small sample sizes. Despite its popularity, the independence assumption is unlikely to be true in practice. Recently, a gene module based linear discriminant analysis strategy has been proposed by utilizing the correlation among genes in discriminant analysis. However, the approach can be underpowered when the samples of the two classes are unbalanced. In this paper, we propose to correct the biases in the discriminant scores of block-diagonal discriminant analysis. In simulation studies, our proposed method outperforms other approaches in various settings. We also illustrate our proposed discriminant analysis method for analyzing microarray data studies.
bias-correction; block-diagonal; classification; high-dimensional data; linear discriminant analysis
DNA methylation is a well-recognized epigenetic mechanism that has been the subject of a growing body of literature typically focused on the identification and study of profiles of DNA methylation and their association with human diseases and exposures. In recent years, a number of unsupervised clustering algorithms, both parametric and non-parametric, have been proposed for clustering large-scale DNA methylation data. However, most of these approaches do not incorporate known biological relationships of measured features, and in some cases, rely on unrealistic assumptions regarding the nature of DNA methylation. Here, we propose a modified version of a recursively partitioned mixture model (RPMM) that integrates information related to the proximity of CpG loci within the genome to inform correlation structures from which subsequent clustering analysis is based. Using simulations and four methylation data sets, we demonstrate that integrating biologically informative correlation structures within RPMM resulted in improved goodness-of-fit, clustering consistency, and the ability to detect biologically meaningful clusters compared to methods which ignore such correlation. Integrating biologically-informed correlation structures to enhance modeling techniques is motivated by the rapid increase in resolution of DNA methylation microarrays and the increasing understanding of the biology of this epigenetic mechanism.
finite mixture models epigenetics; genomic data; model-based clustering
RNA sequencing (RNA-Seq) is the current method of choice for characterizing
transcriptomes and quantifying gene expression changes. This next generation sequencing-based method
provides unprecedented depth and resolution. The negative binomial (NB) probability distribution has
been shown to be a useful model for frequencies of mapped RNA-Seq reads and consequently provides a
basis for statistical analysis of gene expression. Negative binomial exact tests are available for
two-group comparisons but do not extend to negative binomial regression analysis, which is important
for examining gene expression as a function of explanatory variables and for adjusted group
comparisons accounting for other factors. We address the adequacy of available large-sample tests
for the small sample sizes typically available from RNA-Seq studies and consider a higher-order
asymptotic (HOA) adjustment to likelihood ratio tests. We demonstrate that 1) the HOA-adjusted
likelihood ratio test is practically indistinguishable from the exact test in situations where the
exact test is available, 2) the type I error of the HOA test matches the nominal specification in
regression settings we examined via simulation, and 3) the power of the likelihood ratio test does
not appear to be affected by the HOA adjustment. This work helps clarify the accuracy of the
unadjusted likelihood ratio test and the degree of improvement available with the HOA adjustment.
Furthermore, the HOA test may be preferable even when the exact test is available because it does
not require ad hoc library size adjustments.
RNA-Seq; higher-order asymptotics; negative binomial; regression; overdispersion; extra-Poisson variation
Despite their importance in biology and biomedicine, genetic mapping of binary traits that change over time has not been well explored. In this article, we develop a statistical model for mapping quantitative trait loci (QTLs) that govern longitudinal responses of binary traits. The model is constructed within the maximum likelihood framework by which the association between binary responses is modeled in terms of conditional log odds-ratios. With this parameterization, the maximum likelihood estimates (MLEs) of marginal mean parameters are robust to the misspecification of time dependence. We implement an iterative procedures to obtain the MLEs of QTL genotype-specific parameters that define longitudinal binary responses. The usefulness of the model was validated by analyzing a real example in rice. Simulation studies were performed to investigate the statistical properties of the model, showing that the model has power to identify and map specific QTLs responsible for the temporal pattern of binary traits.
binary trait; dynamic trait; functional mapping; maximum likelihood estimate
Genetic and other scientific studies routinely generate very many predictor variables, which can be naturally grouped, with predictors in the same groups being highly correlated. It is desirable to incorporate the hierarchical structure of the predictor variables into generalized linear models for simultaneous variable selection and coefficient estimation. We propose two prior distributions: hierarchical Cauchy and double-exponential distributions, on coefficients in generalized linear models. The hierarchical priors include both variable-specific and group-specific tuning parameters, thereby not only adopting different shrinkage for different coefficients and different groups but also providing a way to pool the information within groups. We fit generalized linear models with the proposed hierarchical priors by incorporating flexible expectation-maximization (EM) algorithms into the standard iteratively weighted least squares as implemented in the general statistical package R. The methods are illustrated with data from an experiment to identify genetic polymorphisms for survival of mice following infection with Listeria monocytogenes. The performance of the proposed procedures is further assessed via simulation studies. The methods are implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/).
Adaptive Lasso; Bayesian inference; Generalized linear model; Genetic polymorphisms; Grouped variables; Hierarchical model; High-dimensional data; Shrinkage prior
Sequence data from a series of homologous DNA segments from related organisms are typically polymorphic at many sites, and these polymorphisms are the result of evolutionary processes. Such data may be used to estimate the substitution rates as well as the variability of these rates. Careful characterization of the distribution of this variation is essential for accurate estimation of evolutionary distances and phylogeny reconstruction among these sequences. Many researchers have recognized the importance of the variability of substitution rates, which most have modeled using a discrete gamma distribution. Some have extended these methods to explicitly account for the correlation of substitution rates among sites using hidden Markov models; others have proposed context-dependent substitution rate schemes. We accommodate these correlations using a composite likelihood method based on a bivariate gamma distribution, which is more flexible than hidden Markov models in terms of correlation structure and more computationally tractable compared to the context-dependent schemes. We show that the estimates have good theoretical properties. We also use simulations to compare the maximum composite likelihood estimates to those obtained from maximum likelihood based on the independence assumption. We use data from the mitochondrial DNA of ten primates to obtain maximum composite likelihood estimates of the mean substitution rate, overdispersion, and correlation parameters, and use these estimates in a parametric phylogenetic bootstrap to assess the impact of serial correlation on the estimates of substitution rates and branch lengths.
bivariate negative binomial distribution; composite likelihood; substitution rate; phylogeny; parametric bootstrap
To extract full information from samples of DNA sequence data, it is necessary to use sophisticated model-based techniques such as importance sampling under the coalescent. However, these are limited in the size of datasets they can handle efficiently. Chen and Liu (2000) introduced the idea of stopping-time resampling and showed that it can dramatically improve the efficiency of importance sampling methods under a finite-alleles coalescent model. In this paper, a new framework is developed for designing stopping-time resampling schemes under more general models. It is implemented on data both from infinite sites and stepwise models of mutation, and extended to incorporate crossover recombination. A simulation study shows that this new framework offers a substantial improvement in the accuracy of likelihood estimation over a range of parameters, while a direct application of the scheme of Chen and Liu (2000) can actually diminish the estimate. The method imposes no additional computational burden and is robust to the choice of parameters.
coalescence; importance sampling; resampling; infinite sites
Propensity scores are commonly used to address confounding in observational studies. However, they have not been previously adapted to deal with bias in genetic association studies. We propose an extension of our previous method (Zhao et al., 2009) that uses a multilevel propensity score approach and allows one to estimate the effect of a genotype under an additive model and also simultaneously adjusts for confounders such as genetic ancestry and patient and disease characteristics. Using simulation studies, we demonstrate that this extended genetic propensity score (eGPS) can adequately adjust and consistently correct for bias due to confounding in a variety of circumstances. Under all simulation scenarios, the eGPS method yields estimates with bias close to 0 (mean=0.018, standard error=0.01). Our method also preserves statistical properties such as coverage probability, Type I error, and power. We illustrate this approach in a population-based genetic association study of testicular germ cell tumors and KITLG and SPRY4 susceptibility genes. We conclude that our method provides a novel and broadly applicable analytic strategy for obtaining less biased and more valid estimates of genetic associations.
population-based genetic association; propensity scores; population stratification; confounding; genetic and non-genetic covariates; susceptibility genes
Most approaches for analyzing ChIP-Seq data are focused on inferring exact protein binding sites from a single library. However, frequently multiple ChIP-Seq libraries derived from differing cell lines or tissue types from the same individual may be available. In such a situation, a separate analysis for each tissue or cell line may be inefficient. Here, we describe a novel method to analyze such data that intelligently uses the joint information from multiple related ChIP-Seq libraries. We present our method as a two-stage procedure. First, separate single cell line analysis is performed for each cell line. Here, we use a novel mixture regression approach to infer the subset of genes that are most likely to be involved in protein binding in each cell line. In the second step, we combine the separate single cell line analyses using an Empirical Bayes algorithm that implicitly incorporates inter-cell line correlation. We demonstrate the usefulness of our method using both simulated data, as well as real H3K4me3 and H3K27me3 histone methylation libraries.
Empirical Bayes; EM; ChIP-Seq; histone methylation
Measurements from microarrays and other high-throughput technologies are susceptible to non-biological artifacts like batch effects. It is known that batch effects can alter or obscure the set of significant results and biological conclusions in high-throughput studies. Here we examine the impact of batch effects on predictors built from genomic technologies. To investigate batch effects, we collected publicly available gene expression measurements with known outcomes, and estimated batches using date. Using these data we show (1) the impact of batch effects on prediction depends on the correlation between outcome and batch in the training data, and (2) removing expression measurements most affected by batch before building predictors may improve the accuracy of those predictors. These results suggest that (1) training sets should be designed to minimize correlation between batches and outcome, and (2) methods for identifying batch-affected probes should be developed to improve prediction results for studies with high correlation between batches and outcome.
batch effects; prediction; microarrays; reproducibility; research design
For a better understanding of the biology of an organism, a complete description is needed of all regions of the genome that are actively transcribed. Tiling arrays are used for this purpose. They allow for the discovery of novel transcripts and the assessment of differential expression between two or more experimental conditions such as genotype, treatment, tissue, etc. In tiling array literature, many efforts are devoted to transcript discovery, whereas more recent developments also focus on differential expression. To our knowledge, however, no methods for tiling arrays have been described that can simultaneously assess transcript discovery and identify differentially expressed transcripts. In this paper, we adopt wavelet based functional models to the context of tiling arrays. The high dimensionality of the data triggered us to avoid inference based on Bayesian MCMC methods. Instead, we introduce a fast empirical Bayes method that provides adaptive regularization of the functional effects. A simulation study and a case study illustrate that our approach is well suited for the simultaneous assessment of transcript discovery and differential expression in tiling array studies, and that it outperforms methods that accomplish only one of these tasks.
tiling microarray; wavelets; adaptive regularization; transcript discovery; differential expression; genomics; Arabidopsis thaliana
Recent advances in high-throughput DNA sequencing technologies and associated statistical analyses have enabled in-depth analysis of whole-genome sequences. As this technology is applied to a growing number of individual human genomes, entire families are now being sequenced. Information contained within the pedigree of a sequenced family can be leveraged when inferring the donors' genotypes. The presence of a de novo mutation within the pedigree is indicated by a violation of Mendelian inheritance laws. Here, we present a method for probabilistically inferring genotypes across a pedigree using high-throughput sequencing data and producing the posterior probability of de novo mutation at each genomic site examined. This framework can be used to disentangle the effects of germline and somatic mutational processes and to simultaneously estimate the effect of sequencing error and the initial genetic variation in the population from which the founders of the pedigree arise. This approach is examined in detail through simulations and areas for method improvement are noted. By applying this method to data from members of a well-defined nuclear family with accurate pedigree information, the stage is set to make the most direct estimates of the human mutation rate to date.
de novo mutations; pedigree; short-read data; mutation rates; trio model
We present a method for improving the power of linkage analysis by detecting chromosome segments shared identical by descent (IBD) by individuals not known to be related. Existing Markov chain Monte Carlo methods sample descent patterns on pedigrees conditional on observed marker data. These patterns can be stored as IBD graphs, which express shared ancestry only, rather than specific family relationships. A model for IBD between unrelated individuals allows the estimation of coancestry between individuals in different pedigrees. IBD graphs on separate pedigrees can then be combined using these estimates. We report results from analyses of three sets of simulated marker data on two different pedigrees. We show that when families share a gene for a trait due to shared ancestry on the order of tens of generations, our method can detect a linkage signal when independent analyses of the families do not.
linkage; pedigrees; gene coancestry; IBD
Where causal SNPs (single nucleotide polymorphisms) tend to accumulate within biological pathways, the incorporation of prior pathways information into a statistical model is expected to increase the power to detect true associations in a genetic association study. Most existing pathways-based methods rely on marginal SNP statistics and do not fully exploit the dependence patterns among SNPs within pathways.
We use a sparse regression model, with SNPs grouped into pathways, to identify causal pathways associated with a quantitative trait. Notable features of our “pathways group lasso with adaptive weights” (P-GLAW) algorithm include the incorporation of all pathways in a single regression model, an adaptive pathway weighting procedure that accounts for factors biasing pathway selection, and the use of a bootstrap sampling procedure for the ranking of important pathways. P-GLAW takes account of the presence of overlapping pathways and uses a novel combination of techniques to optimise model estimation, making it fast to run, even on whole genome datasets.
In a comparison study with an alternative pathways method based on univariate SNP statistics, our method demonstrates high sensitivity and specificity for the detection of important pathways, showing the greatest relative gains in performance where marginal SNP effect sizes are small.
pathways; GWAs; quantitative traits; group lasso; penalised regression; Alzheimer’s disease; imaging genetics
Varying depth of high-throughput sequencing reads along a chromosome makes it
possible to observe copy number variants (CNVs) in a sample relative to a
reference. In exome and other targeted sequencing projects, technical factors
increase variation in read depth while reducing the number of observed
locations, adding difficulty to the problem of identifying CNVs. We present a
hidden Markov model for detecting CNVs from raw read count data, using
background read depth from a control set as well as other positional covariates
such as GC-content. The model, exomeCopy, is applied to a large chromosome X
exome sequencing project identifying a list of large unique CNVs. CNVs predicted
by the model and experimentally validated are then recovered using a
cross-platform control set from publicly available exome sequencing data.
Simulations show high sensitivity for detecting heterozygous and homozygous
CNVs, outperforming normalization and state-of-the-art segmentation methods.
exorne sequencing; targeted sequencing; CNV; copy number variant; HMM; hidden Markov model
In studies of dynamic molecular networks in systems biology, experiments are increasingly exploiting technologies such as flow cytometry to generate data on marginal distributions of a few network nodes at snapshots in time. For example, levels of intracellular expression of a few genes, or cell surface protein markers, can be assayed at a series of interim time points and assumed steady-states under experimentally stimulated growth conditions in small cellular systems. Such marginal data on a small number of cellular markers will typically carry very limited information on the parameters and structure of dynamic network models, though experiments will typically be designed to expose variation in cellular phenotypes that are inherently related to some aspects of model parametrization and structure. Our work addresses statistical questions of how to integrate such data with dynamic stochastic models in order to properly quantify the information—or lack of information—it carries relative to models assumed. We present a Bayesian computational strategy coupled with a novel approach to summarizing and numerically characterizing biological phenotypes that are represented in terms of the resulting sample distributions of cellular markers. We build on Bayesian simulation methods and mixture modeling to define the approach to linking mechanistic mathematical models of network dynamics to snapshot data, using a toggle switch example integrating simulated and real data as context.
approximate Bayesian computation (ABC); biological signatures; dynamic stochastic network models; flow cytometry data; posterior simulation; synthetic gene circuit; systems biology; toggle switch model
Pathway or gene set analysis has become an increasingly popular approach for analyzing high-throughput biological experiments such as microarray gene expression studies. The purpose of pathway analysis is to identify differentially expressed pathways associated with outcomes. Important challenges in pathway analysis are selecting a subset of genes contributing most to association with clinical phenotypes and conducting statistical tests of association for the pathways efficiently. We propose a two-stage analysis strategy: (1) extract latent variables representing activities within each pathway using a dimension reduction approach based on adaptive elastic-net sparse principal component analysis; (2) integrate the latent variables with the regression modeling framework to analyze studies with different types of outcomes such as binary, continuous or survival outcomes. Our proposed approach is computationally efficient. For each pathway, because the latent variables are estimated in an unsupervised fashion without using disease outcome information, in the sample label permutation testing procedure, the latent variables only need to be calculated once rather than for each permutation resample. Using both simulated and real datasets, we show our approach performed favorably when compared with five other currently available pathway testing methods.
gene expression; microarray; pathway analysis; sparse principal component analysis
Gene perturbation experiments are commonly used for the reconstruction of gene regulatory networks. Typical experimental methodology imposes persistent changes on the network. The resulting data must therefore be interpreted as a steady state from an altered gene regulatory network, rather than a direct observation of the original network. In this article an implicit modeling methodology is proposed in which the unperturbed network of interest is scored by first modeling the persistent perturbation, then predicting the steady state, which may then be compared to the observed data. This results in a many-to-one inverse problem, so a computational Bayesian approach is used to assess model uncertainty.
The methodology is first demonstrated on a number of synthetic networks. It is shown that the Bayesian approach correctly assigns high posterior probability to the network structure and steady state behavior. Further, it is demonstrated that where uncertainty of model features is indicated, the uncertainty may be accurately resolved with further perturbation experiments. The methodology is then applied to the modeling of a gene regulatory network using perturbation data from nine genes which have been shown to respond synergistically to known oncogenic mutations. A hypothetical model emerges which conforms to reported regulatory properties of these genes. Furthermore, the Bayesian methodology is shown to be consistent in the sense that multiple randomized applications of the fitting algorithm converge to an approximately common posterior density on the space of models. Such consistency is generally not feasible for algorithms which report only single models. We conclude that fully Bayesian methods, coupled with models which accurately account for experimental constraints, are a suitable tool for the inference of gene regulatory networks, in terms of accuracy, estimation of model uncertainty, and experimental design.
boolean network; gene perturbation; Bayesian modeling
Germline mosaicism is a genetic condition in which some germ cells of an individual contain a mutation. This condition violates the assumptions underlying classic genetic analysis and may lead to failure of such analysis. In this work we extend the statistical model used for genetic linkage analysis in order to incorporate germline mosaicism. We develop a likelihood ratio test for detecting whether a genetic trait has been introduced into a pedigree by germline mosaicism. We analyze the statistical properties of this test and evaluate its performance via computer simulations. We demonstrate that genetic linkage analysis has high power to identify linkage in the presence of germline mosaicism when our extended model is used. We further use this extended model to provide solid statistical evidence that the MDN syndrome studied by Genzer-Nir et al. has been introduced by germline mosaicism.
linkage analysis; germline mosaicism; statistical test; statistical model
In this paper, we develop a Genetic Algorithm that can address the fundamental problem of how one should weight the summary statistics included in an approximate Bayesian computation analysis built around an accept/reject algorithm, and how one might choose the tolerance for that analysis. We then demonstrate that using weighted statistics, and a well-chosen tolerance, in such an approximate Bayesian computation approach can result in improved performance, when compared to unweighted analyses, using one example drawn purely from statistics and two drawn from the estimation of population genetics parameters.
approximate Bayesian computation; genetic algorithms; summary statistics