To extract full information from samples of DNA sequence data, it is necessary to use sophisticated model-based techniques such as importance sampling under the coalescent. However, these are limited in the size of datasets they can handle efficiently. Chen and Liu (2000) introduced the idea of stopping-time resampling and showed that it can dramatically improve the efficiency of importance sampling methods under a finite-alleles coalescent model. In this paper, a new framework is developed for designing stopping-time resampling schemes under more general models. It is implemented on data both from infinite sites and stepwise models of mutation, and extended to incorporate crossover recombination. A simulation study shows that this new framework offers a substantial improvement in the accuracy of likelihood estimation over a range of parameters, while a direct application of the scheme of Chen and Liu (2000) can actually diminish the estimate. The method imposes no additional computational burden and is robust to the choice of parameters.
coalescence; importance sampling; resampling; infinite sites
Propensity scores are commonly used to address confounding in observational studies. However, they have not been previously adapted to deal with bias in genetic association studies. We propose an extension of our previous method (Zhao et al., 2009) that uses a multilevel propensity score approach and allows one to estimate the effect of a genotype under an additive model and also simultaneously adjusts for confounders such as genetic ancestry and patient and disease characteristics. Using simulation studies, we demonstrate that this extended genetic propensity score (eGPS) can adequately adjust and consistently correct for bias due to confounding in a variety of circumstances. Under all simulation scenarios, the eGPS method yields estimates with bias close to 0 (mean=0.018, standard error=0.01). Our method also preserves statistical properties such as coverage probability, Type I error, and power. We illustrate this approach in a population-based genetic association study of testicular germ cell tumors and KITLG and SPRY4 susceptibility genes. We conclude that our method provides a novel and broadly applicable analytic strategy for obtaining less biased and more valid estimates of genetic associations.
population-based genetic association; propensity scores; population stratification; confounding; genetic and non-genetic covariates; susceptibility genes
Most approaches for analyzing ChIP-Seq data are focused on inferring exact protein binding sites from a single library. However, frequently multiple ChIP-Seq libraries derived from differing cell lines or tissue types from the same individual may be available. In such a situation, a separate analysis for each tissue or cell line may be inefficient. Here, we describe a novel method to analyze such data that intelligently uses the joint information from multiple related ChIP-Seq libraries. We present our method as a two-stage procedure. First, separate single cell line analysis is performed for each cell line. Here, we use a novel mixture regression approach to infer the subset of genes that are most likely to be involved in protein binding in each cell line. In the second step, we combine the separate single cell line analyses using an Empirical Bayes algorithm that implicitly incorporates inter-cell line correlation. We demonstrate the usefulness of our method using both simulated data, as well as real H3K4me3 and H3K27me3 histone methylation libraries.
Empirical Bayes; EM; ChIP-Seq; histone methylation
Measurements from microarrays and other high-throughput technologies are susceptible to non-biological artifacts like batch effects. It is known that batch effects can alter or obscure the set of significant results and biological conclusions in high-throughput studies. Here we examine the impact of batch effects on predictors built from genomic technologies. To investigate batch effects, we collected publicly available gene expression measurements with known outcomes, and estimated batches using date. Using these data we show (1) the impact of batch effects on prediction depends on the correlation between outcome and batch in the training data, and (2) removing expression measurements most affected by batch before building predictors may improve the accuracy of those predictors. These results suggest that (1) training sets should be designed to minimize correlation between batches and outcome, and (2) methods for identifying batch-affected probes should be developed to improve prediction results for studies with high correlation between batches and outcome.
batch effects; prediction; microarrays; reproducibility; research design
For a better understanding of the biology of an organism, a complete description is needed of all regions of the genome that are actively transcribed. Tiling arrays are used for this purpose. They allow for the discovery of novel transcripts and the assessment of differential expression between two or more experimental conditions such as genotype, treatment, tissue, etc. In tiling array literature, many efforts are devoted to transcript discovery, whereas more recent developments also focus on differential expression. To our knowledge, however, no methods for tiling arrays have been described that can simultaneously assess transcript discovery and identify differentially expressed transcripts. In this paper, we adopt wavelet based functional models to the context of tiling arrays. The high dimensionality of the data triggered us to avoid inference based on Bayesian MCMC methods. Instead, we introduce a fast empirical Bayes method that provides adaptive regularization of the functional effects. A simulation study and a case study illustrate that our approach is well suited for the simultaneous assessment of transcript discovery and differential expression in tiling array studies, and that it outperforms methods that accomplish only one of these tasks.
tiling microarray; wavelets; adaptive regularization; transcript discovery; differential expression; genomics; Arabidopsis thaliana
Recent advances in high-throughput DNA sequencing technologies and associated statistical analyses have enabled in-depth analysis of whole-genome sequences. As this technology is applied to a growing number of individual human genomes, entire families are now being sequenced. Information contained within the pedigree of a sequenced family can be leveraged when inferring the donors' genotypes. The presence of a de novo mutation within the pedigree is indicated by a violation of Mendelian inheritance laws. Here, we present a method for probabilistically inferring genotypes across a pedigree using high-throughput sequencing data and producing the posterior probability of de novo mutation at each genomic site examined. This framework can be used to disentangle the effects of germline and somatic mutational processes and to simultaneously estimate the effect of sequencing error and the initial genetic variation in the population from which the founders of the pedigree arise. This approach is examined in detail through simulations and areas for method improvement are noted. By applying this method to data from members of a well-defined nuclear family with accurate pedigree information, the stage is set to make the most direct estimates of the human mutation rate to date.
de novo mutations; pedigree; short-read data; mutation rates; trio model
We present a method for improving the power of linkage analysis by detecting chromosome segments shared identical by descent (IBD) by individuals not known to be related. Existing Markov chain Monte Carlo methods sample descent patterns on pedigrees conditional on observed marker data. These patterns can be stored as IBD graphs, which express shared ancestry only, rather than specific family relationships. A model for IBD between unrelated individuals allows the estimation of coancestry between individuals in different pedigrees. IBD graphs on separate pedigrees can then be combined using these estimates. We report results from analyses of three sets of simulated marker data on two different pedigrees. We show that when families share a gene for a trait due to shared ancestry on the order of tens of generations, our method can detect a linkage signal when independent analyses of the families do not.
linkage; pedigrees; gene coancestry; IBD
Where causal SNPs (single nucleotide polymorphisms) tend to accumulate within biological pathways, the incorporation of prior pathways information into a statistical model is expected to increase the power to detect true associations in a genetic association study. Most existing pathways-based methods rely on marginal SNP statistics and do not fully exploit the dependence patterns among SNPs within pathways.
We use a sparse regression model, with SNPs grouped into pathways, to identify causal pathways associated with a quantitative trait. Notable features of our “pathways group lasso with adaptive weights” (P-GLAW) algorithm include the incorporation of all pathways in a single regression model, an adaptive pathway weighting procedure that accounts for factors biasing pathway selection, and the use of a bootstrap sampling procedure for the ranking of important pathways. P-GLAW takes account of the presence of overlapping pathways and uses a novel combination of techniques to optimise model estimation, making it fast to run, even on whole genome datasets.
In a comparison study with an alternative pathways method based on univariate SNP statistics, our method demonstrates high sensitivity and specificity for the detection of important pathways, showing the greatest relative gains in performance where marginal SNP effect sizes are small.
pathways; GWAs; quantitative traits; group lasso; penalised regression; Alzheimer’s disease; imaging genetics
Varying depth of high-throughput sequencing reads along a chromosome makes it
possible to observe copy number variants (CNVs) in a sample relative to a
reference. In exome and other targeted sequencing projects, technical factors
increase variation in read depth while reducing the number of observed
locations, adding difficulty to the problem of identifying CNVs. We present a
hidden Markov model for detecting CNVs from raw read count data, using
background read depth from a control set as well as other positional covariates
such as GC-content. The model, exomeCopy, is applied to a large chromosome X
exome sequencing project identifying a list of large unique CNVs. CNVs predicted
by the model and experimentally validated are then recovered using a
cross-platform control set from publicly available exome sequencing data.
Simulations show high sensitivity for detecting heterozygous and homozygous
CNVs, outperforming normalization and state-of-the-art segmentation methods.
exorne sequencing; targeted sequencing; CNV; copy number variant; HMM; hidden Markov model
In studies of dynamic molecular networks in systems biology, experiments are increasingly exploiting technologies such as flow cytometry to generate data on marginal distributions of a few network nodes at snapshots in time. For example, levels of intracellular expression of a few genes, or cell surface protein markers, can be assayed at a series of interim time points and assumed steady-states under experimentally stimulated growth conditions in small cellular systems. Such marginal data on a small number of cellular markers will typically carry very limited information on the parameters and structure of dynamic network models, though experiments will typically be designed to expose variation in cellular phenotypes that are inherently related to some aspects of model parametrization and structure. Our work addresses statistical questions of how to integrate such data with dynamic stochastic models in order to properly quantify the information—or lack of information—it carries relative to models assumed. We present a Bayesian computational strategy coupled with a novel approach to summarizing and numerically characterizing biological phenotypes that are represented in terms of the resulting sample distributions of cellular markers. We build on Bayesian simulation methods and mixture modeling to define the approach to linking mechanistic mathematical models of network dynamics to snapshot data, using a toggle switch example integrating simulated and real data as context.
approximate Bayesian computation (ABC); biological signatures; dynamic stochastic network models; flow cytometry data; posterior simulation; synthetic gene circuit; systems biology; toggle switch model
Pathway or gene set analysis has become an increasingly popular approach for analyzing high-throughput biological experiments such as microarray gene expression studies. The purpose of pathway analysis is to identify differentially expressed pathways associated with outcomes. Important challenges in pathway analysis are selecting a subset of genes contributing most to association with clinical phenotypes and conducting statistical tests of association for the pathways efficiently. We propose a two-stage analysis strategy: (1) extract latent variables representing activities within each pathway using a dimension reduction approach based on adaptive elastic-net sparse principal component analysis; (2) integrate the latent variables with the regression modeling framework to analyze studies with different types of outcomes such as binary, continuous or survival outcomes. Our proposed approach is computationally efficient. For each pathway, because the latent variables are estimated in an unsupervised fashion without using disease outcome information, in the sample label permutation testing procedure, the latent variables only need to be calculated once rather than for each permutation resample. Using both simulated and real datasets, we show our approach performed favorably when compared with five other currently available pathway testing methods.
gene expression; microarray; pathway analysis; sparse principal component analysis
Gene perturbation experiments are commonly used for the reconstruction of gene regulatory networks. Typical experimental methodology imposes persistent changes on the network. The resulting data must therefore be interpreted as a steady state from an altered gene regulatory network, rather than a direct observation of the original network. In this article an implicit modeling methodology is proposed in which the unperturbed network of interest is scored by first modeling the persistent perturbation, then predicting the steady state, which may then be compared to the observed data. This results in a many-to-one inverse problem, so a computational Bayesian approach is used to assess model uncertainty.
The methodology is first demonstrated on a number of synthetic networks. It is shown that the Bayesian approach correctly assigns high posterior probability to the network structure and steady state behavior. Further, it is demonstrated that where uncertainty of model features is indicated, the uncertainty may be accurately resolved with further perturbation experiments. The methodology is then applied to the modeling of a gene regulatory network using perturbation data from nine genes which have been shown to respond synergistically to known oncogenic mutations. A hypothetical model emerges which conforms to reported regulatory properties of these genes. Furthermore, the Bayesian methodology is shown to be consistent in the sense that multiple randomized applications of the fitting algorithm converge to an approximately common posterior density on the space of models. Such consistency is generally not feasible for algorithms which report only single models. We conclude that fully Bayesian methods, coupled with models which accurately account for experimental constraints, are a suitable tool for the inference of gene regulatory networks, in terms of accuracy, estimation of model uncertainty, and experimental design.
boolean network; gene perturbation; Bayesian modeling
Germline mosaicism is a genetic condition in which some germ cells of an individual contain a mutation. This condition violates the assumptions underlying classic genetic analysis and may lead to failure of such analysis. In this work we extend the statistical model used for genetic linkage analysis in order to incorporate germline mosaicism. We develop a likelihood ratio test for detecting whether a genetic trait has been introduced into a pedigree by germline mosaicism. We analyze the statistical properties of this test and evaluate its performance via computer simulations. We demonstrate that genetic linkage analysis has high power to identify linkage in the presence of germline mosaicism when our extended model is used. We further use this extended model to provide solid statistical evidence that the MDN syndrome studied by Genzer-Nir et al. has been introduced by germline mosaicism.
linkage analysis; germline mosaicism; statistical test; statistical model
In this paper, we develop a Genetic Algorithm that can address the fundamental problem of how one should weight the summary statistics included in an approximate Bayesian computation analysis built around an accept/reject algorithm, and how one might choose the tolerance for that analysis. We then demonstrate that using weighted statistics, and a well-chosen tolerance, in such an approximate Bayesian computation approach can result in improved performance, when compared to unweighted analyses, using one example drawn purely from statistics and two drawn from the estimation of population genetics parameters.
approximate Bayesian computation; genetic algorithms; summary statistics
Gene expression microarray experiments with few replications lead to great variability in estimates of gene variances. Several Bayesian methods have been developed to reduce this variability and to increase power. Thus far, moderated t methods assumed a constant coefficient of variation (CV) for the gene variances. We provide evidence against this assumption, and extend the method by allowing the CV to vary with gene expression. Our CV varying method, which we refer to as the fully moderated t-statistic, was compared to three other methods (ordinary t, and two moderated t predecessors). A simulation study and a familiar spike-in data set were used to assess the performance of the testing methods. The results showed that our CV varying method had higher power than the other three methods, identified a greater number of true positives in spike-in data, fit simulated data under varying assumptions very well, and in a real data set better identified higher expressing genes that were consistent with functional pathways associated with the experiments.
empirical Bayes; microarray data analysis; variance smoothing
In the past few years, several entropy-based tests have been proposed for testing either single SNP association or gene-gene interaction. These tests are mainly based on Shannon entropy and have higher statistical power when compared to standard χ2 tests. In this paper, we extend some of these tests using a more generalized entropy definition, Rényi entropy, where Shannon entropy is a special case of order 1. The order λ (>0) of Rényi entropy weights the events (genotype/haplotype) according to their probabilities (frequencies). Higher λ places more emphasis on higher probability events while smaller λ (close to 0) tends to assign weights more equally. Thus, by properly choosing the λ, one can potentially increase the power of the tests or the p-value level of significance. We conducted simulation as well as real data analyses to assess the impact of the order λ and the performance of these generalized tests. The results showed that for dominant model the order 2 test was more powerful and for multiplicative model the order 1 or 2 had similar power. The analyses indicate that the choice of λ depends on the underlying genetic model and Shannon entropy is not necessarily the most powerful entropy measure for constructing genetic association or interaction tests.
Rényi entropy; genetic association; gene-gene interaction
Certain residues have no known function yet are co-conserved across distantly related protein families and diverse organisms, suggesting that they perform critical roles associated with as-yet-unidentified molecular properties and mechanisms. This raises the question of how to obtain additional clues regarding these mysterious biochemical phenomena with a view to formulating experimentally testable hypotheses. One approach is to access the implicit biochemical information encoded within the vast amount of genomic sequence data now becoming available. Here, a new Gibbs sampling strategy is formulated and implemented that can partition hundreds of thousands of sequences within a major protein class into multiple, functionally-divergent categories based on those pattern residues that best discriminate between categories. The sampler precisely defines the partition and pattern for each category by explicitly modeling unrelated, non-functional and related-yet-divergent proteins that would otherwise obscure the analysis. To aid biological interpretation, auxiliary routines can characterize pattern residues within available crystal structures and identify those structures most likely to shed light on the roles of pattern residues. This approach can be used to define and annotate automatically subgroup-specific conserved domain profiles based on statistically-rigorous empirical criteria rather than on the subjective and labor-intensive process of manual curation. Incorporating such profiles into domain database search sites (such as the NCBI BLAST site) will provide biologists with previously inaccessible molecular information useful for hypothesis generation and experimental design. Analyses of P-loop GTPases and of AAA+ ATPases illustrate the sampler's ability to obtain such information.
protein sequence/structural analysis; Markov chain Monte Carlo sampling; Bayesian partitioning with pattern selection
One important use of statistical methods in application to biological data is measurement of evidence, or assessment of the degree to which data support one or another hypothesis. While there is a small literature on this topic, it seems safe to say that consensus has not yet been reached regarding how best, or most accurately, to measure statistical evidence. Here, we propose considering the problem as a measurement problem, rather than as a statistical problem per se, and we explore the consequences of this shift in perspective. Our arguments here are part of an ongoing research program focused on exploiting deep parallelisms between foundations of thermodynamics and foundations of “evidentialism,” in order to derive an absolute scale for the measurement of evidence, a general framework in the context of which that scale is validated, and the many ancillary benefits that come from having such a framework in place.
statistical evidence; measurement; thermodynamics
The Random Forests (RF) algorithm has become a commonly used machine learning algorithm for genetic association studies. It is well suited for genetic applications since it is both computationally efficient and models genetic causal mechanisms well. With its growing ubiquity, there has been inconsistent and less than optimal use of RF in the literature. The purpose of this review is to breakdown the theoretical and statistical basis of RF so that practitioners are able to apply it in their work. An emphasis is placed on showing how the various components contribute to bias and variance, as well as discussing variable importance measures. Applications specific to genetic studies are highlighted. To provide context, RF is compared to other commonly used machine learning algorithms.
machine learning; SNP; genome wide association studies
The general availability of reliable and affordable genotyping technology has enabled genetic association studies to move beyond small case-control studies to large prospective studies. For prospective studies, genetic information can be integrated into the analysis via haplotypes, with focus on their association with a censored survival outcome. We develop non-iterative, regression-based methods to estimate associations between common haplotypes and a censored survival outcome in large cohort studies. Our non-iterative methods—weighted estimation and weighted haplotype combination—are both based on the Cox regression model, but differ in how the imputed haplotypes are integrated into the model. Our approaches enable haplotype imputation to be performed once as a simple data-processing step, and thus avoid implementation based on sophisticated algorithms that iterate between haplotype imputation and risk estimation. We show that non-iterative weighted estimation and weighted haplotype combination provide valid tests for genetic associations and reliable estimates of moderate associations between common haplotypes and a censored survival outcome, and are straightforward to implement in standard statistical software. We apply the methods to an analysis of HSPB7-CLCNKA haplotypes and risk of adverse outcomes in a prospective cohort study of outpatients with chronic heart failure.
Cox regression; phase ambiguity; prospective study; unphased genotypes
Array-based Comparative Genomic Hybridization (aCGH) is a microarray-based technology that assists in identification of DNA sequence copy number changes across the genome. Examination of differences in instability phenotype, or pattern of copy number alterations, between cancer subtypes can aid in classification of cancers and lead to better understanding of the underlying cytogenic mechanism. Instability phenotypes are composed of a variety of copy number alteration features including height or magnitude of copy number alteration level, frequency of transition between copy number states such as gain and loss, and total number of altered clones or probes. That is, instability phenotype is multivariate in nature. Current methods of instability phenotype assessment, however, are limited to univariate measures and are therefore limited in both sensitivity and interpretability. In this paper, a novel method of instability assessment is presented that is based on the Engler et al. (2006) pseudolikelhood approach for aCGH data analysis. Through use of a pseudolikelihood ratio test (PLRT), more sensitive assessment of instability phenotype differences between cancer subtypes is possible. Evaluation of the PLRT method is conducted through analysis of a meningioma data set and through simulation studies. Results are shown to be more accurate and more easily interpretable than current measures of instability assessment.
array-CGH; genetic instability; pseudolikelihood ratio test
Missing phenotype data can be a major hurdle to mapping quantitative trait loci (QTL). Though in many cases experiments may be designed to minimize the occurrence of missing data, it is often unavoidable in practice; thus, statistical methods to account for missing data are needed. In this paper we describe an approach for conjoining multiple imputation and QTL mapping. Methods are applied to map genes associated with increased breathing effort in mice after lung inflammation due to allergen challenge in developing lines of the Collaborative Cross, a new mouse genetics resource. Missing data poses a particular challenge in this study because the desired phenotype summary to be mapped is a function of incompletely observed dose-response curves. Comparison of the multiple imputation approach to two naive approaches for handling missing data suggest that these simpler methods may yield poor results: ignoring missing data through a complete case analysis may lead to incorrect conclusions, while using a last observation carried forward procedure, which does not account for uncertainty in the imputed values, may lead to anti-conservative inference. The proposed approach is widely applicable to other studies with missing phenotype data.
multiple imputation; missing data; quantitative trait loci
Simultaneously performing many hypothesis tests is a problem commonly encountered in high-dimensional biology. In this setting, a large set of p-values is calculated from many related features measured simultaneously. Classical statistics provides a criterion for defining what a “correct” p-value is when performing a single hypothesis test. We show here that even when each p-value is marginally correct under this single hypothesis criterion, it may be the case that the joint behavior of the entire set of p-values is problematic. On the other hand, there are cases where each p-value is marginally incorrect, yet the joint distribution of the set of p-values is satisfactory. Here, we propose a criterion defining a well behaved set of simultaneously calculated p-values that provides precise control of common error rates and we introduce diagnostic procedures for assessing whether the criterion is satisfied with simulations. Multiple testing p-values that satisfy our new criterion avoid potentially large study specific errors, but also satisfy the usual assumptions for strong control of false discovery rates and family-wise error rates. We utilize the new criterion and proposed diagnostics to investigate two common issues in high-dimensional multiple testing for genomics: dependent multiple hypothesis tests and pooled versus test-specific null distributions.
false discovery rate; multiple testing dependence; pooled null statistics
An important challenge in statistical genomics concerns integrating experimental data with exogenous information about gene function. A number of statistical methods are available to address this challenge, but most do not accommodate complexities in the functional record. To infer activity of a functional category (e.g., a gene ontology term), most methods use gene-level data on that category, but do not use other functional properties of the same genes. Not doing so creates undue errors in inference. Recent developments in model-based category analysis aim to overcome this difficulty, but in attempting to do so they are faced with serious computational problems. This paper investigates statistical properties and the structure of posterior computation in one such model for the analysis of functional category data. We examine the graphical structures underlying posterior computation in the original parameterization and in a new parameterization aimed at leveraging elements of the model. We characterize identifiability of the underlying activation states, describe a new prior distribution, and introduce approximations that aim to support numerical methods for posterior inference.
probabilistic graphical modeling; role model; gene-set analysis