Varying depth of high-throughput sequencing reads along a chromosome makes it
possible to observe copy number variants (CNVs) in a sample relative to a
reference. In exome and other targeted sequencing projects, technical factors
increase variation in read depth while reducing the number of observed
locations, adding difficulty to the problem of identifying CNVs. We present a
hidden Markov model for detecting CNVs from raw read count data, using
background read depth from a control set as well as other positional covariates
such as GC-content. The model, exomeCopy, is applied to a large chromosome X
exome sequencing project identifying a list of large unique CNVs. CNVs predicted
by the model and experimentally validated are then recovered using a
cross-platform control set from publicly available exome sequencing data.
Simulations show high sensitivity for detecting heterozygous and homozygous
CNVs, outperforming normalization and state-of-the-art segmentation methods.
exorne sequencing; targeted sequencing; CNV; copy number variant; HMM; hidden Markov model
In studies of dynamic molecular networks in systems biology, experiments are increasingly exploiting technologies such as flow cytometry to generate data on marginal distributions of a few network nodes at snapshots in time. For example, levels of intracellular expression of a few genes, or cell surface protein markers, can be assayed at a series of interim time points and assumed steady-states under experimentally stimulated growth conditions in small cellular systems. Such marginal data on a small number of cellular markers will typically carry very limited information on the parameters and structure of dynamic network models, though experiments will typically be designed to expose variation in cellular phenotypes that are inherently related to some aspects of model parametrization and structure. Our work addresses statistical questions of how to integrate such data with dynamic stochastic models in order to properly quantify the information—or lack of information—it carries relative to models assumed. We present a Bayesian computational strategy coupled with a novel approach to summarizing and numerically characterizing biological phenotypes that are represented in terms of the resulting sample distributions of cellular markers. We build on Bayesian simulation methods and mixture modeling to define the approach to linking mechanistic mathematical models of network dynamics to snapshot data, using a toggle switch example integrating simulated and real data as context.
approximate Bayesian computation (ABC); biological signatures; dynamic stochastic network models; flow cytometry data; posterior simulation; synthetic gene circuit; systems biology; toggle switch model
Pathway or gene set analysis has become an increasingly popular approach for analyzing high-throughput biological experiments such as microarray gene expression studies. The purpose of pathway analysis is to identify differentially expressed pathways associated with outcomes. Important challenges in pathway analysis are selecting a subset of genes contributing most to association with clinical phenotypes and conducting statistical tests of association for the pathways efficiently. We propose a two-stage analysis strategy: (1) extract latent variables representing activities within each pathway using a dimension reduction approach based on adaptive elastic-net sparse principal component analysis; (2) integrate the latent variables with the regression modeling framework to analyze studies with different types of outcomes such as binary, continuous or survival outcomes. Our proposed approach is computationally efficient. For each pathway, because the latent variables are estimated in an unsupervised fashion without using disease outcome information, in the sample label permutation testing procedure, the latent variables only need to be calculated once rather than for each permutation resample. Using both simulated and real datasets, we show our approach performed favorably when compared with five other currently available pathway testing methods.
gene expression; microarray; pathway analysis; sparse principal component analysis
Gene perturbation experiments are commonly used for the reconstruction of gene regulatory networks. Typical experimental methodology imposes persistent changes on the network. The resulting data must therefore be interpreted as a steady state from an altered gene regulatory network, rather than a direct observation of the original network. In this article an implicit modeling methodology is proposed in which the unperturbed network of interest is scored by first modeling the persistent perturbation, then predicting the steady state, which may then be compared to the observed data. This results in a many-to-one inverse problem, so a computational Bayesian approach is used to assess model uncertainty.
The methodology is first demonstrated on a number of synthetic networks. It is shown that the Bayesian approach correctly assigns high posterior probability to the network structure and steady state behavior. Further, it is demonstrated that where uncertainty of model features is indicated, the uncertainty may be accurately resolved with further perturbation experiments. The methodology is then applied to the modeling of a gene regulatory network using perturbation data from nine genes which have been shown to respond synergistically to known oncogenic mutations. A hypothetical model emerges which conforms to reported regulatory properties of these genes. Furthermore, the Bayesian methodology is shown to be consistent in the sense that multiple randomized applications of the fitting algorithm converge to an approximately common posterior density on the space of models. Such consistency is generally not feasible for algorithms which report only single models. We conclude that fully Bayesian methods, coupled with models which accurately account for experimental constraints, are a suitable tool for the inference of gene regulatory networks, in terms of accuracy, estimation of model uncertainty, and experimental design.
boolean network; gene perturbation; Bayesian modeling
Germline mosaicism is a genetic condition in which some germ cells of an individual contain a mutation. This condition violates the assumptions underlying classic genetic analysis and may lead to failure of such analysis. In this work we extend the statistical model used for genetic linkage analysis in order to incorporate germline mosaicism. We develop a likelihood ratio test for detecting whether a genetic trait has been introduced into a pedigree by germline mosaicism. We analyze the statistical properties of this test and evaluate its performance via computer simulations. We demonstrate that genetic linkage analysis has high power to identify linkage in the presence of germline mosaicism when our extended model is used. We further use this extended model to provide solid statistical evidence that the MDN syndrome studied by Genzer-Nir et al. has been introduced by germline mosaicism.
linkage analysis; germline mosaicism; statistical test; statistical model
In this paper, we develop a Genetic Algorithm that can address the fundamental problem of how one should weight the summary statistics included in an approximate Bayesian computation analysis built around an accept/reject algorithm, and how one might choose the tolerance for that analysis. We then demonstrate that using weighted statistics, and a well-chosen tolerance, in such an approximate Bayesian computation approach can result in improved performance, when compared to unweighted analyses, using one example drawn purely from statistics and two drawn from the estimation of population genetics parameters.
approximate Bayesian computation; genetic algorithms; summary statistics
Gene expression microarray experiments with few replications lead to great variability in estimates of gene variances. Several Bayesian methods have been developed to reduce this variability and to increase power. Thus far, moderated t methods assumed a constant coefficient of variation (CV) for the gene variances. We provide evidence against this assumption, and extend the method by allowing the CV to vary with gene expression. Our CV varying method, which we refer to as the fully moderated t-statistic, was compared to three other methods (ordinary t, and two moderated t predecessors). A simulation study and a familiar spike-in data set were used to assess the performance of the testing methods. The results showed that our CV varying method had higher power than the other three methods, identified a greater number of true positives in spike-in data, fit simulated data under varying assumptions very well, and in a real data set better identified higher expressing genes that were consistent with functional pathways associated with the experiments.
empirical Bayes; microarray data analysis; variance smoothing
In the past few years, several entropy-based tests have been proposed for testing either single SNP association or gene-gene interaction. These tests are mainly based on Shannon entropy and have higher statistical power when compared to standard χ2 tests. In this paper, we extend some of these tests using a more generalized entropy definition, Rényi entropy, where Shannon entropy is a special case of order 1. The order λ (>0) of Rényi entropy weights the events (genotype/haplotype) according to their probabilities (frequencies). Higher λ places more emphasis on higher probability events while smaller λ (close to 0) tends to assign weights more equally. Thus, by properly choosing the λ, one can potentially increase the power of the tests or the p-value level of significance. We conducted simulation as well as real data analyses to assess the impact of the order λ and the performance of these generalized tests. The results showed that for dominant model the order 2 test was more powerful and for multiplicative model the order 1 or 2 had similar power. The analyses indicate that the choice of λ depends on the underlying genetic model and Shannon entropy is not necessarily the most powerful entropy measure for constructing genetic association or interaction tests.
Rényi entropy; genetic association; gene-gene interaction
Certain residues have no known function yet are co-conserved across distantly related protein families and diverse organisms, suggesting that they perform critical roles associated with as-yet-unidentified molecular properties and mechanisms. This raises the question of how to obtain additional clues regarding these mysterious biochemical phenomena with a view to formulating experimentally testable hypotheses. One approach is to access the implicit biochemical information encoded within the vast amount of genomic sequence data now becoming available. Here, a new Gibbs sampling strategy is formulated and implemented that can partition hundreds of thousands of sequences within a major protein class into multiple, functionally-divergent categories based on those pattern residues that best discriminate between categories. The sampler precisely defines the partition and pattern for each category by explicitly modeling unrelated, non-functional and related-yet-divergent proteins that would otherwise obscure the analysis. To aid biological interpretation, auxiliary routines can characterize pattern residues within available crystal structures and identify those structures most likely to shed light on the roles of pattern residues. This approach can be used to define and annotate automatically subgroup-specific conserved domain profiles based on statistically-rigorous empirical criteria rather than on the subjective and labor-intensive process of manual curation. Incorporating such profiles into domain database search sites (such as the NCBI BLAST site) will provide biologists with previously inaccessible molecular information useful for hypothesis generation and experimental design. Analyses of P-loop GTPases and of AAA+ ATPases illustrate the sampler's ability to obtain such information.
protein sequence/structural analysis; Markov chain Monte Carlo sampling; Bayesian partitioning with pattern selection
One important use of statistical methods in application to biological data is measurement of evidence, or assessment of the degree to which data support one or another hypothesis. While there is a small literature on this topic, it seems safe to say that consensus has not yet been reached regarding how best, or most accurately, to measure statistical evidence. Here, we propose considering the problem as a measurement problem, rather than as a statistical problem per se, and we explore the consequences of this shift in perspective. Our arguments here are part of an ongoing research program focused on exploiting deep parallelisms between foundations of thermodynamics and foundations of “evidentialism,” in order to derive an absolute scale for the measurement of evidence, a general framework in the context of which that scale is validated, and the many ancillary benefits that come from having such a framework in place.
statistical evidence; measurement; thermodynamics
The Random Forests (RF) algorithm has become a commonly used machine learning algorithm for genetic association studies. It is well suited for genetic applications since it is both computationally efficient and models genetic causal mechanisms well. With its growing ubiquity, there has been inconsistent and less than optimal use of RF in the literature. The purpose of this review is to breakdown the theoretical and statistical basis of RF so that practitioners are able to apply it in their work. An emphasis is placed on showing how the various components contribute to bias and variance, as well as discussing variable importance measures. Applications specific to genetic studies are highlighted. To provide context, RF is compared to other commonly used machine learning algorithms.
machine learning; SNP; genome wide association studies
Array-based Comparative Genomic Hybridization (aCGH) is a microarray-based technology that assists in identification of DNA sequence copy number changes across the genome. Examination of differences in instability phenotype, or pattern of copy number alterations, between cancer subtypes can aid in classification of cancers and lead to better understanding of the underlying cytogenic mechanism. Instability phenotypes are composed of a variety of copy number alteration features including height or magnitude of copy number alteration level, frequency of transition between copy number states such as gain and loss, and total number of altered clones or probes. That is, instability phenotype is multivariate in nature. Current methods of instability phenotype assessment, however, are limited to univariate measures and are therefore limited in both sensitivity and interpretability. In this paper, a novel method of instability assessment is presented that is based on the Engler et al. (2006) pseudolikelhood approach for aCGH data analysis. Through use of a pseudolikelihood ratio test (PLRT), more sensitive assessment of instability phenotype differences between cancer subtypes is possible. Evaluation of the PLRT method is conducted through analysis of a meningioma data set and through simulation studies. Results are shown to be more accurate and more easily interpretable than current measures of instability assessment.
array-CGH; genetic instability; pseudolikelihood ratio test
Missing phenotype data can be a major hurdle to mapping quantitative trait loci (QTL). Though in many cases experiments may be designed to minimize the occurrence of missing data, it is often unavoidable in practice; thus, statistical methods to account for missing data are needed. In this paper we describe an approach for conjoining multiple imputation and QTL mapping. Methods are applied to map genes associated with increased breathing effort in mice after lung inflammation due to allergen challenge in developing lines of the Collaborative Cross, a new mouse genetics resource. Missing data poses a particular challenge in this study because the desired phenotype summary to be mapped is a function of incompletely observed dose-response curves. Comparison of the multiple imputation approach to two naive approaches for handling missing data suggest that these simpler methods may yield poor results: ignoring missing data through a complete case analysis may lead to incorrect conclusions, while using a last observation carried forward procedure, which does not account for uncertainty in the imputed values, may lead to anti-conservative inference. The proposed approach is widely applicable to other studies with missing phenotype data.
multiple imputation; missing data; quantitative trait loci
Simultaneously performing many hypothesis tests is a problem commonly encountered in high-dimensional biology. In this setting, a large set of p-values is calculated from many related features measured simultaneously. Classical statistics provides a criterion for defining what a “correct” p-value is when performing a single hypothesis test. We show here that even when each p-value is marginally correct under this single hypothesis criterion, it may be the case that the joint behavior of the entire set of p-values is problematic. On the other hand, there are cases where each p-value is marginally incorrect, yet the joint distribution of the set of p-values is satisfactory. Here, we propose a criterion defining a well behaved set of simultaneously calculated p-values that provides precise control of common error rates and we introduce diagnostic procedures for assessing whether the criterion is satisfied with simulations. Multiple testing p-values that satisfy our new criterion avoid potentially large study specific errors, but also satisfy the usual assumptions for strong control of false discovery rates and family-wise error rates. We utilize the new criterion and proposed diagnostics to investigate two common issues in high-dimensional multiple testing for genomics: dependent multiple hypothesis tests and pooled versus test-specific null distributions.
false discovery rate; multiple testing dependence; pooled null statistics
Phylogenetic trees describe evolutionary relationships between related organisms
(taxa). One approach to estimating phylogenetic trees supposes that a matrix of
estimated evolutionary distances between taxa is available. Agglomerative
methods have been proposed in which closely related taxon-pairs are successively
combined to form ancestral taxa. Several of these computationally efficient
agglomerative algorithms involve steps to reduce the variance in estimated
distances. We propose an agglomerative phylogenetic method which focuses on
statistical modeling of variance components in distance estimates. We consider
how these variance components evolve during the agglomerative process. Our
method simultaneously produces two topologically identical rooted trees, one
tree having branch lengths proportional to elapsed time, and the other having
branch lengths proportional to underlying evolutionary divergence. The method
models two major sources of variation which have been separately discussed in
the literature: noise, reflecting inaccuracies in measuring divergences, and
distortion, reflecting randomness in the amounts of divergence in different
parts of the tree. The methodology is based on successive hierarchical
generalized least-squares regressions. It involves only means, variances and
covariances of distance estimates, thereby avoiding full distributional
assumptions. Exploitation of the algebraic structure of the estimation leads to
an algorithm with computational complexity comparable to the leading published
agglomerative methods. A parametric bootstrap procedure allows full uncertainty
in the phylogenetic reconstruction to be assessed. Software implementing the
methodology may be freely downloaded from StatTree.
agglomerative method; distance matrix; generalized least-squares regression; phylogenetic tree; variance-components model
Information-theoretic metrics have been proposed for studying gene-gene and gene-environment interactions in genetic epidemiology. Although these metrics have proven very promising, they are typically interpreted in the context of communications and information transmission, diminishing their tangibility for epidemiologists and statisticians.
In this paper, we clarify the interpretation of information-theoretic metrics. In particular, we develop the methods so that their relation to the global properties of probability models is made clear and contrast them with log-linear models for multinomial data. Hopefully, a better understanding of their properties and probabilistic implications will promote their acceptance and correct usage in genetic epidemiology. Our novel development also suggests new approaches to model search and computation.
gene-environment interaction; gene-gene interactions; K-way Interaction Index; information theory
Recently, the amount of high-dimensional data has exploded, creating new analytical challenges for human genetics. Furthermore, much evidence suggests that common complex diseases may be due to complex etiologies such as gene-gene interactions, which are difficult to identify in high-dimensional data using traditional statistical approaches. Data-mining approaches are gaining popularity for variable selection in association studies, and one of the most commonly used methods to evaluate potential gene-gene interactions is Multifactor Dimensionality Reduction (MDR). Additionally, a number of penalized regression techniques, such as Lasso, are gaining popularity within the statistical community and are now being applied to association studies, including extensions for interactions. In this study, we compare the performance of MDR, the traditional lasso with L1 penalty (TL1), and the group lasso for categorical data with group-wise L1 penalty (GL1) to detect gene-gene interactions through a broad range of simulations.
We find that each method has both advantages and disadvantages, and relative performance is context dependent. TL1 frequently over-fits, identifying false positive as well as true positive loci. MDR has higher power for epistatic models that exhibit independent main effects; for both Lasso methods, main effects tend to dominate. For purely epistatic models, GL1 has the best performance for lower minor allele frequencies, but MDR performs best for higher frequencies. These results provide guidance of when each approach might be best suited for detecting and characterizing interactions with different mechanisms.
Multifactor Dimensionality Reduction (MDR); Lasso; gene-gene interactions
We develop recent work on using graphical models for linkage disequilibrium to provide efficient programs for model fitting, phasing, and imputation of missing data in large data sets. Two important features contribute to the computational efficiency: the separation of the model fitting and phasing-imputation processes into different programs, and holding in memory only the data within a moving window of loci during model fitting. Optimal parameter values were chosen by cross-validation to maximize the probability of correctly imputing masked genotypes. The best accuracy obtained is slightly below than that from the Beagle program of Browning and Browning, and our fitting program is slower. However, for large data sets, it uses less storage. For a reference set of n individuals genotyped at m markers, the time and storage required for fitting a graphical model are approximately O(nm) and O(n+m), respectively. To impute the phases and missing data on n individuals using an already fitted graphical model requires O(nm) time and O(m) storage. While the times for fitting and imputation are both O(nm), the imputation process is considerably faster; thus, once a model is estimated from a reference data set, the marginal cost of phasing and imputing further samples is very low.
phasing-imputation; cross validation; SNP genotype assays
In longitudinal and repeated measures data analysis, often the goal is to determine the effect of a treatment or aspect on a particular outcome (e.g., disease progression). We consider a semiparametric repeated measures regression model, where the parametric component models effect of the variable of interest and any modification by other covariates. The expectation of this parametric component over the other covariates is a measure of variable importance. Here, we present a targeted maximum likelihood estimator of the finite dimensional regression parameter, which is easily estimated using standard software for generalized estimating equations.
The targeted maximum likelihood method provides double robust and locally efficient estimates of the variable importance parameters and inference based on the influence curve. We demonstrate these properties through simulation under correct and incorrect model specification, and apply our method in practice to estimating the activity of transcription factor (TF) over cell cycle in yeast. We specifically target the importance of SWI4, SWI6, MBP1, MCM1, ACE2, FKH2, NDD1, and SWI5.
The semiparametric model allows us to determine the importance of a TF at specific time points by specifying time indicators as potential effect modifiers of the TF. Our results are promising, showing significant importance trends during the expected time periods. This methodology can also be used as a variable importance analysis tool to assess the effect of a large number of variables such as gene expressions or single nucleotide polymorphisms.
targeted maximum likelihood; semiparametric; repeated measures; longitudinal; transcription factors
Characterizing dynamic gene expression pattern and predicting patient outcome is now significant and will be of more interest in the future with large scale clinical investigation of microarrays. However, there is currently no method that has been developed for prediction of patient outcome using longitudinal gene expression, where gene expression of patients is being monitored across time. Here, we propose a novel prediction approach for patient survival time that makes use of time course structure of gene expression. This method is applied to a burn study. The genes involved in the final predictors are enriched in the inflammatory response and immune system related pathways. Moreover, our method is consistently better than prediction methods using individual time point gene expression or simply pooling gene expression from each time point.
prediction; time course; gene expression; survival
Complex diseases will have multiple functional sites, and it will be invaluable to understand the cross-locus interaction in terms of linkage disequilibrium (LD) between those sites (epistasis) in addition to the haplotype-LD effects. We investigated the statistical properties of a class of matrix-based statistics to assess this epistasis. These statistical methods include two LD contrast tests (Zaykin et al., 2006) and partial least squares regression (Wang et al., 2008). To estimate Type 1 error rates and power, we simulated multiple two-variant disease models using the SIMLA software package. SIMLA allows for the joint action of up to two disease genes in the simulated data with all possible multiplicative interaction effects between them. Our goal was to detect an interaction between multiple disease-causing variants by means of their linkage disequilibrium (LD) patterns with other markers. We measured the effects of marginal disease effect size, haplotype LD, disease prevalence and minor allele frequency have on cross-locus interaction (epistasis).
In the setting of strong allele effects and strong interaction, the correlation between the two disease genes was weak (r = 0.2). In a complex system with multiple correlations (both marginal and interaction), it was difficult to determine the source of a significant result. Despite these complications, the partial least squares and modified LD contrast methods maintained adequate power to detect the epistatic effects; however, for many of the analyses we often could not separate interaction from a strong marginal effect. While we did not exhaust the entire parameter space of possible models, we do provide guidance on the effects that population parameters have on cross-locus interaction.
epistasis; linkage disequilibrium; complex disease; cardiovascular disease
The recent emergence of massively parallel sequencing technologies has enabled an increasing number of human genome re-sequencing studies, notable among them being the 1000 Genomes Project. The main aim of these studies is to identify the yet unknown genetic variants in a genomic region, mostly low frequency variants (frequency less than 5%). We propose here a set of statistical tools that address how to optimally design such studies in order to increase the number of genetic variants we expect to discover. Within this framework, the tradeoff between lower coverage for more individuals and higher coverage for fewer individuals can be naturally solved.
The methods here are also useful for estimating the number of genetic variants missed in a discovery study performed at low coverage.
We show applications to simulated data based on coalescent models and to sequence data from the ENCODE project. In particular, we show the extent to which combining data from multiple populations in a discovery study may increase the number of genetic variants identified relative to studies on single populations.
species problem; variant discovery studies; sequencing technologies
High density tiling arrays are an effective strategy for genome-wide identification of transcription factor binding regions. Sliding window methods that calculate moving averages of log ratios or t-statistics have been useful for the analysis of tiling array data. Here, we present a method that generalizes the moving average approach to evaluate sliding windows of p-values by using combined p-value statistics. In particular, the combined p-value framework can be useful in situations when taking averages of the corresponding test-statistic for the hypothesis may not be appropriate or when it is difficult to assess the significance of these averages. We exhibit the strengths of the combined p-values methods on Drosophila tiling array data and assess their ability to predict genomic regions enriched for transcription factor binding. The predictions are evaluated based on their proximity to target genes and their enrichment of known transcription factor binding sites. We also present an application for the generalization of the moving average based on integrating two different tiling array experiments.
transcription factor; binding sequence; tiling array; combined p-value
There has been increasing interest in predicting patients’ survival after therapy by investigating gene expression microarray data. In the regression and classification models with high-dimensional genomic data, boosting has been successfully applied to build accurate predictive models and conduct variable selection simultaneously. We propose the Buckley-James boosting for the semiparametric accelerated failure time models with right censored survival data, which can be used to predict survival of future patients using the high-dimensional genomic data. In the spirit of adaptive LASSO, twin boosting is also incorporated to fit more sparse models. The proposed methods have a unified approach to fit linear models, non-linear effects models with possible interactions. The methods can perform variable selection and parameter estimation simultaneously. The proposed methods are evaluated by simulations and applied to a recent microarray gene expression data set for patients with diffuse large B-cell lymphoma under the current gold standard therapy.
boosting; accelerated failure time model; Buckley-James estimator; censored survival data; LASSO; variable selection
Cellular functions of living organisms are carried out through complex systems of interacting components. Including such interactions in the analysis, and considering sub-systems defined by biological pathways instead of individual components (e.g. genes), can lead to new findings about complex biological mechanisms. Networks are often used to capture such interactions and can be incorporated in models to improve the efficiency in estimation and inference. In this paper, we propose a model for incorporating external information about interactions among genes (proteins/metabolites) in differential analysis of gene sets. We exploit the framework of mixed linear models and propose a flexible inference procedure for analysis of changes in biological pathways. The proposed method facilitates the analysis of complex experiments, including multiple experimental conditions and temporal correlations among observations. We propose an efficient iterative algorithm for estimation of the model parameters and show that the proposed framework is asymptotically robust to the presence of noise in the network information. The performance of the proposed model is illustrated through the analysis of gene expression data for environmental stress response (ESR) in yeast, as well as simulated data sets.
gene network; enrichment analysis; gene set analysis; complex experiments; spatio-temporal model; mixed linear model; systems biology