PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (82)
 

Clipboard (0)
None

Select a Filter Below

Journals
Year of Publication
Document Types
1.  Efficient identification of context dependent subgroups of risk from genome wide association studies 
We have developed a modified Patient Rule-Induction Method (PRIM) as an alternative strategy for analyzing representative samples of non-experimental human data to estimate and test the role of genomic variations as predictors of disease risk in etiologically heterogeneous sub-samples. A computational limit of the proposed strategy is encountered when the number of genomic variations (predictor variables) under study is large (> 500) because permutations are used to generate a null distribution to test the significance of a term (defined by values of particular variables) that characterizes a sub-sample of individuals through the peeling and pasting processes. As an alternative, in this paper we introduce a theoretical strategy that facilitates the quick calculation of Type I and Type II errors in the evaluation of terms in the peeling and pasting processes carried out in the execution of a PRIM analysis that are underestimated and non-existent, respectively, when a permutation-based hypothesis test is employed. The resultant savings in computational time makes possible the consideration of larger numbers of genomic variations (an example genome wide association study is given) in the selection of statistically significant terms in the formulation of PRIM prediction models.
doi:10.1515/sagmb-2013-0062
PMCID: PMC4171947  PMID: 24570412
PRIM; classification; GWAS
2.  Hierarchical Bayesian mixture modelling for antigen-specific T-cell subtyping in combinatorially encoded flow cytometry studies 
Novel uses of automated flow cytometry technology for measuring levels of protein markers on thousands to millions of cells are promoting increasing need for relevant, customized Bayesian mixture modelling approaches in many areas of biomedical research and application. In studies of immune profiling in many biological areas, traditional flow cytometry measures relative levels of abundance of marker proteins using fluorescently labeled tags that identify specific markers by a single-color. One specific and important recent development in this area is the use of combinatorial marker assays in which each marker is targeted with a probe that is labeled with two or more fluorescent tags. The use of several colors enables the identification of, in principle, combinatorially increasingly numbers of subtypes of cells, each identified by a subset of colors. This represents a major advance in the ability to characterize variation in immune responses involving larger numbers of functionally differentiated cell subtypes. We describe novel classes of Markov chain Monte Carlo methods for model fitting that exploit distributed GPU (graphics processing unit) implementation. We discuss issues of cellular subtype identification in this novel, general model framework, and provide a detailed example using simulated data. We then describe application to a data set from an experimental study of antigen-specific T-cell subtyping using combinatorially encoded assays in human blood samples. Summary comments discuss broader questions in applications in immunology, and aspects of statistical computation.
doi:10.1515/sagmb-2012-0001
PMCID: PMC4155753  PMID: 23629459
Dirichlet process mixtures; GPU computing; Hierarchical model; Immune profiling; Immune response biomarkers; Large data sets; Markov chain Monte Carlo; Massive mixture models; Multimers; Posterior simulation; Relabeling; T-cell subtyping
3.  Testing clonality of three and more tumors using their loss of heterozygosity profiles 
Statistical applications in genetics and molecular biology  2012;11(4):10.1515/1544-6115.1757 /j/sagmb.2012.11.issue-4/1544-6115.1757/1544-6115.1757.xml.
Cancer patients often develop multiple malignancies that may be either metastatic spread of a previous cancer (clonal tumors) or new primary cancers (independent tumors). If diagnosis cannot be easily made on the basis of the pathology review, the patterns of somatic mutations in the tumors can be compared. Previously we have developed statistical methods for testing clonality of two tumors using their loss of heterozygosity (LOH) profiles at several candidate markers. These methods can be applied to all possible pairs of tumors when multiple tumors are analyzed, but this strategy can lead to inconsistent results and loss of statistical power. In this work we will extend clonality tests to three and more malignancies from the same patient. A non-parametric test can be performed using any possible subset of tumors, with the subsequent adjustment for multiple testing. A parametric likelihood model is developed for 3 or 4 tumors, and it can be used to estimate the phylogenetic tree of tumors. The proposed tests are more powerful than combination of all possible pairwise tests.
doi:10.1515/1544-6115.1757
PMCID: PMC4097189  PMID: 22850066
clonality; LOH; metastasis; multiple tumors; concordant mutations test; likelihood ratio test
4.  Block-diagonal discriminant analysis and its bias-corrected rules 
High-throughput expression profiling allows simultaneous measure of tens of thousands of genes at once. These data have motivated the development of reliable biomarkers for disease subtypes identification and diagnosis. Many methods have been developed in the literature for analyzing these data, such as diagonal discriminant analysis, support vector machines, and k-nearest neighbor methods. The diagonal discriminant methods have been shown to perform well for high-dimensional data with small sample sizes. Despite its popularity, the independence assumption is unlikely to be true in practice. Recently, a gene module based linear discriminant analysis strategy has been proposed by utilizing the correlation among genes in discriminant analysis. However, the approach can be underpowered when the samples of the two classes are unbalanced. In this paper, we propose to correct the biases in the discriminant scores of block-diagonal discriminant analysis. In simulation studies, our proposed method outperforms other approaches in various settings. We also illustrate our proposed discriminant analysis method for analyzing microarray data studies.
doi:10.1515/sagmb-2012-0017
PMCID: PMC3856189  PMID: 23735433
bias-correction; block-diagonal; classification; high-dimensional data; linear discriminant analysis
5.  Recursively partitioned mixture model clustering of DNA methylation data using biologically informed correlation structures 
DNA methylation is a well-recognized epigenetic mechanism that has been the subject of a growing body of literature typically focused on the identification and study of profiles of DNA methylation and their association with human diseases and exposures. In recent years, a number of unsupervised clustering algorithms, both parametric and non-parametric, have been proposed for clustering large-scale DNA methylation data. However, most of these approaches do not incorporate known biological relationships of measured features, and in some cases, rely on unrealistic assumptions regarding the nature of DNA methylation. Here, we propose a modified version of a recursively partitioned mixture model (RPMM) that integrates information related to the proximity of CpG loci within the genome to inform correlation structures from which subsequent clustering analysis is based. Using simulations and four methylation data sets, we demonstrate that integrating biologically informative correlation structures within RPMM resulted in improved goodness-of-fit, clustering consistency, and the ability to detect biologically meaningful clusters compared to methods which ignore such correlation. Integrating biologically-informed correlation structures to enhance modeling techniques is motivated by the rapid increase in resolution of DNA methylation microarrays and the increasing understanding of the biology of this epigenetic mechanism.
doi:10.1515/sagmb-2012-0068
PMCID: PMC4007267  PMID: 23468465
finite mixture models epigenetics; genomic data; model-based clustering
7.  Higher Order Asymptotics for Negative Binomial Regression Inferences from RNA-Sequencing Data 
RNA sequencing (RNA-Seq) is the current method of choice for characterizing transcriptomes and quantifying gene expression changes. This next generation sequencing-based method provides unprecedented depth and resolution. The negative binomial (NB) probability distribution has been shown to be a useful model for frequencies of mapped RNA-Seq reads and consequently provides a basis for statistical analysis of gene expression. Negative binomial exact tests are available for two-group comparisons but do not extend to negative binomial regression analysis, which is important for examining gene expression as a function of explanatory variables and for adjusted group comparisons accounting for other factors. We address the adequacy of available large-sample tests for the small sample sizes typically available from RNA-Seq studies and consider a higher-order asymptotic (HOA) adjustment to likelihood ratio tests. We demonstrate that 1) the HOA-adjusted likelihood ratio test is practically indistinguishable from the exact test in situations where the exact test is available, 2) the type I error of the HOA test matches the nominal specification in regression settings we examined via simulation, and 3) the power of the likelihood ratio test does not appear to be affected by the HOA adjustment. This work helps clarify the accuracy of the unadjusted likelihood ratio test and the degree of improvement available with the HOA adjustment. Furthermore, the HOA test may be preferable even when the exact test is available because it does not require ad hoc library size adjustments.
doi:10.1515/sagmb-2012-0071
PMCID: PMC3683603  PMID: 23502340
RNA-Seq; higher-order asymptotics; negative binomial; regression; overdispersion; extra-Poisson variation
8.  A Maximum Likelihood Approach to Functional Mapping of Longitudinal Binary Traits 
Despite their importance in biology and biomedicine, genetic mapping of binary traits that change over time has not been well explored. In this article, we develop a statistical model for mapping quantitative trait loci (QTLs) that govern longitudinal responses of binary traits. The model is constructed within the maximum likelihood framework by which the association between binary responses is modeled in terms of conditional log odds-ratios. With this parameterization, the maximum likelihood estimates (MLEs) of marginal mean parameters are robust to the misspecification of time dependence. We implement an iterative procedures to obtain the MLEs of QTL genotype-specific parameters that define longitudinal binary responses. The usefulness of the model was validated by analyzing a real example in rice. Simulation studies were performed to investigate the statistical properties of the model, showing that the model has power to identify and map specific QTLs responsible for the temporal pattern of binary traits.
doi:10.1515/1544-6115.1675
PMCID: PMC3856886  PMID: 23183762
binary trait; dynamic trait; functional mapping; maximum likelihood estimate
9.  Hierarchical Shrinkage Priors and Model Fitting for High-dimensional Generalized Linear Models 
Statistical applications in genetics and molecular biology  2012;11(6):10.1515/1544-6115.1803 /j/sagmb.2012.11.issue-6/1544-6115.1803/1544-6115.1803.xml.
Genetic and other scientific studies routinely generate very many predictor variables, which can be naturally grouped, with predictors in the same groups being highly correlated. It is desirable to incorporate the hierarchical structure of the predictor variables into generalized linear models for simultaneous variable selection and coefficient estimation. We propose two prior distributions: hierarchical Cauchy and double-exponential distributions, on coefficients in generalized linear models. The hierarchical priors include both variable-specific and group-specific tuning parameters, thereby not only adopting different shrinkage for different coefficients and different groups but also providing a way to pool the information within groups. We fit generalized linear models with the proposed hierarchical priors by incorporating flexible expectation-maximization (EM) algorithms into the standard iteratively weighted least squares as implemented in the general statistical package R. The methods are illustrated with data from an experiment to identify genetic polymorphisms for survival of mice following infection with Listeria monocytogenes. The performance of the proposed procedures is further assessed via simulation studies. The methods are implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/).
doi:10.1515/1544-6115.1803
PMCID: PMC3658361  PMID: 23192052
Adaptive Lasso; Bayesian inference; Generalized linear model; Genetic polymorphisms; Grouped variables; Hierarchical model; High-dimensional data; Shrinkage prior
10.  Composite Likelihood Modeling of Neighboring Site Correlations of DNA Sequence Substitution Rates 
Sequence data from a series of homologous DNA segments from related organisms are typically polymorphic at many sites, and these polymorphisms are the result of evolutionary processes. Such data may be used to estimate the substitution rates as well as the variability of these rates. Careful characterization of the distribution of this variation is essential for accurate estimation of evolutionary distances and phylogeny reconstruction among these sequences. Many researchers have recognized the importance of the variability of substitution rates, which most have modeled using a discrete gamma distribution. Some have extended these methods to explicitly account for the correlation of substitution rates among sites using hidden Markov models; others have proposed context-dependent substitution rate schemes. We accommodate these correlations using a composite likelihood method based on a bivariate gamma distribution, which is more flexible than hidden Markov models in terms of correlation structure and more computationally tractable compared to the context-dependent schemes. We show that the estimates have good theoretical properties. We also use simulations to compare the maximum composite likelihood estimates to those obtained from maximum likelihood based on the independence assumption. We use data from the mitochondrial DNA of ten primates to obtain maximum composite likelihood estimates of the mean substitution rate, overdispersion, and correlation parameters, and use these estimates in a parametric phylogenetic bootstrap to assess the impact of serial correlation on the estimates of substitution rates and branch lengths.
doi:10.2202/1544-6115.1391
PMCID: PMC3839231  PMID: 19222389
bivariate negative binomial distribution; composite likelihood; substitution rate; phylogeny; parametric bootstrap
11.  Stopping-Time Resampling and Population Genetic Inference under Coalescent Models 
To extract full information from samples of DNA sequence data, it is necessary to use sophisticated model-based techniques such as importance sampling under the coalescent. However, these are limited in the size of datasets they can handle efficiently. Chen and Liu (2000) introduced the idea of stopping-time resampling and showed that it can dramatically improve the efficiency of importance sampling methods under a finite-alleles coalescent model. In this paper, a new framework is developed for designing stopping-time resampling schemes under more general models. It is implemented on data both from infinite sites and stepwise models of mutation, and extended to incorporate crossover recombination. A simulation study shows that this new framework offers a substantial improvement in the accuracy of likelihood estimation over a range of parameters, while a direct application of the scheme of Chen and Liu (2000) can actually diminish the estimate. The method imposes no additional computational burden and is robust to the choice of parameters.
doi:10.2202/1544-6115.1770
PMCID: PMC3800802  PMID: 22499685
coalescence; importance sampling; resampling; infinite sites
12.  Analyzing Genetic Association Studies with an Extended Propensity Score Approach 
Statistical applications in genetics and molecular biology  2012;11(5):10.1515/1544-6115.1790 /j/sagmb.2012.11.issue-5/1544-6115.1790/1544-6115.1790.xml.
Propensity scores are commonly used to address confounding in observational studies. However, they have not been previously adapted to deal with bias in genetic association studies. We propose an extension of our previous method (Zhao et al., 2009) that uses a multilevel propensity score approach and allows one to estimate the effect of a genotype under an additive model and also simultaneously adjusts for confounders such as genetic ancestry and patient and disease characteristics. Using simulation studies, we demonstrate that this extended genetic propensity score (eGPS) can adequately adjust and consistently correct for bias due to confounding in a variety of circumstances. Under all simulation scenarios, the eGPS method yields estimates with bias close to 0 (mean=0.018, standard error=0.01). Our method also preserves statistical properties such as coverage probability, Type I error, and power. We illustrate this approach in a population-based genetic association study of testicular germ cell tumors and KITLG and SPRY4 susceptibility genes. We conclude that our method provides a novel and broadly applicable analytic strategy for obtaining less biased and more valid estimates of genetic associations.
doi:10.1515/1544-6115.1790
PMCID: PMC3518898  PMID: 23104843
population-based genetic association; propensity scores; population stratification; confounding; genetic and non-genetic covariates; susceptibility genes
13.  A New Approach for the Joint Analysis of Multiple Chip-Seq Libraries with Application to Histone Modification 
Most approaches for analyzing ChIP-Seq data are focused on inferring exact protein binding sites from a single library. However, frequently multiple ChIP-Seq libraries derived from differing cell lines or tissue types from the same individual may be available. In such a situation, a separate analysis for each tissue or cell line may be inefficient. Here, we describe a novel method to analyze such data that intelligently uses the joint information from multiple related ChIP-Seq libraries. We present our method as a two-stage procedure. First, separate single cell line analysis is performed for each cell line. Here, we use a novel mixture regression approach to infer the subset of genes that are most likely to be involved in protein binding in each cell line. In the second step, we combine the separate single cell line analyses using an Empirical Bayes algorithm that implicitly incorporates inter-cell line correlation. We demonstrate the usefulness of our method using both simulated data, as well as real H3K4me3 and H3K27me3 histone methylation libraries.
doi:10.1515/1544-6115.1660
PMCID: PMC3770480  PMID: 22499701
Empirical Bayes; EM; ChIP-Seq; histone methylation
14.  The practical effect of batch on genomic prediction 
Measurements from microarrays and other high-throughput technologies are susceptible to non-biological artifacts like batch effects. It is known that batch effects can alter or obscure the set of significant results and biological conclusions in high-throughput studies. Here we examine the impact of batch effects on predictors built from genomic technologies. To investigate batch effects, we collected publicly available gene expression measurements with known outcomes, and estimated batches using date. Using these data we show (1) the impact of batch effects on prediction depends on the correlation between outcome and batch in the training data, and (2) removing expression measurements most affected by batch before building predictors may improve the accuracy of those predictors. These results suggest that (1) training sets should be designed to minimize correlation between batches and outcome, and (2) methods for identifying batch-affected probes should be developed to improve prediction results for studies with high correlation between batches and outcome.
PMCID: PMC3760371  PMID: 22611599
batch effects; prediction; microarrays; reproducibility; research design
15.  Fast Wavelet Based Functional Models for Transcriptome Analysis with Tiling Arrays 
For a better understanding of the biology of an organism, a complete description is needed of all regions of the genome that are actively transcribed. Tiling arrays are used for this purpose. They allow for the discovery of novel transcripts and the assessment of differential expression between two or more experimental conditions such as genotype, treatment, tissue, etc. In tiling array literature, many efforts are devoted to transcript discovery, whereas more recent developments also focus on differential expression. To our knowledge, however, no methods for tiling arrays have been described that can simultaneously assess transcript discovery and identify differentially expressed transcripts. In this paper, we adopt wavelet based functional models to the context of tiling arrays. The high dimensionality of the data triggered us to avoid inference based on Bayesian MCMC methods. Instead, we introduce a fast empirical Bayes method that provides adaptive regularization of the functional effects. A simulation study and a case study illustrate that our approach is well suited for the simultaneous assessment of transcript discovery and differential expression in tiling array studies, and that it outperforms methods that accomplish only one of these tasks.
doi:10.2202/1544-6115.1726
PMCID: PMC3750750  PMID: 22499683
tiling microarray; wavelets; adaptive regularization; transcript discovery; differential expression; genomics; Arabidopsis thaliana
16.  A Family-Based Probabilistic Method for Capturing De Novo Mutations from High-Throughput Short-Read Sequencing Data 
Statistical applications in genetics and molecular biology  2012;11(2):10.2202/1544-6115.1713 /j/sagmb.2012.11.issue-2/1544-6115.1713/1544-6115.1713.xml.
Recent advances in high-throughput DNA sequencing technologies and associated statistical analyses have enabled in-depth analysis of whole-genome sequences. As this technology is applied to a growing number of individual human genomes, entire families are now being sequenced. Information contained within the pedigree of a sequenced family can be leveraged when inferring the donors' genotypes. The presence of a de novo mutation within the pedigree is indicated by a violation of Mendelian inheritance laws. Here, we present a method for probabilistically inferring genotypes across a pedigree using high-throughput sequencing data and producing the posterior probability of de novo mutation at each genomic site examined. This framework can be used to disentangle the effects of germline and somatic mutational processes and to simultaneously estimate the effect of sequencing error and the initial genetic variation in the population from which the founders of the pedigree arise. This approach is examined in detail through simulations and areas for method improvement are noted. By applying this method to data from members of a well-defined nuclear family with accurate pedigree information, the stage is set to make the most direct estimates of the human mutation rate to date.
doi:10.2202/1544-6115.1713
PMCID: PMC3728889  PMID: 22499693
de novo mutations; pedigree; short-read data; mutation rates; trio model
17.  Improving Pedigree-based Linkage Analysis by Estimating Coancestry Among Families 
Statistical Applications in Genetics and Molecular Biology  2012;11(2):10.2202/1544-6115.1718 /j/sagmb.2012.11.issue-2/1544-6115.1718/1544-6115.1718.xml.
We present a method for improving the power of linkage analysis by detecting chromosome segments shared identical by descent (IBD) by individuals not known to be related. Existing Markov chain Monte Carlo methods sample descent patterns on pedigrees conditional on observed marker data. These patterns can be stored as IBD graphs, which express shared ancestry only, rather than specific family relationships. A model for IBD between unrelated individuals allows the estimation of coancestry between individuals in different pedigrees. IBD graphs on separate pedigrees can then be combined using these estimates. We report results from analyses of three sets of simulated marker data on two different pedigrees. We show that when families share a gene for a trait due to shared ancestry on the order of tens of generations, our method can detect a linkage signal when independent analyses of the families do not.
doi:10.2202/1544-6115.1718
PMCID: PMC3368982  PMID: 22499699
linkage; pedigrees; gene coancestry; IBD
18.  Candidate Pathway Based Analysis for Cleft Lip with or without Cleft Palate 
Statistical Applications in Genetics and Molecular Biology  2012;11(2):10.2202/1544-6115.1717 /j/sagmb.2012.11.issue-2/1544-6115.1717/1544-6115.1717.xml.
doi:10.2202/1544-6115.1717
PMCID: PMC3404742  PMID: 22499695
19.  Fast Identification of Biological Pathways Associated with a Quantitative Trait Using Group Lasso with Overlaps 
Where causal SNPs (single nucleotide polymorphisms) tend to accumulate within biological pathways, the incorporation of prior pathways information into a statistical model is expected to increase the power to detect true associations in a genetic association study. Most existing pathways-based methods rely on marginal SNP statistics and do not fully exploit the dependence patterns among SNPs within pathways.
We use a sparse regression model, with SNPs grouped into pathways, to identify causal pathways associated with a quantitative trait. Notable features of our “pathways group lasso with adaptive weights” (P-GLAW) algorithm include the incorporation of all pathways in a single regression model, an adaptive pathway weighting procedure that accounts for factors biasing pathway selection, and the use of a bootstrap sampling procedure for the ranking of important pathways. P-GLAW takes account of the presence of overlapping pathways and uses a novel combination of techniques to optimise model estimation, making it fast to run, even on whole genome datasets.
In a comparison study with an alternative pathways method based on univariate SNP statistics, our method demonstrates high sensitivity and specificity for the detection of important pathways, showing the greatest relative gains in performance where marginal SNP effect sizes are small.
doi:10.2202/1544-6115.1755
PMCID: PMC3491888  PMID: 22499682
pathways; GWAs; quantitative traits; group lasso; penalised regression; Alzheimer’s disease; imaging genetics
20.  Modeling Read Counts for CNV Detection in Exome Sequencing Data 
Varying depth of high-throughput sequencing reads along a chromosome makes it possible to observe copy number variants (CNVs) in a sample relative to a reference. In exome and other targeted sequencing projects, technical factors increase variation in read depth while reducing the number of observed locations, adding difficulty to the problem of identifying CNVs. We present a hidden Markov model for detecting CNVs from raw read count data, using background read depth from a control set as well as other positional covariates such as GC-content. The model, exomeCopy, is applied to a large chromosome X exome sequencing project identifying a list of large unique CNVs. CNVs predicted by the model and experimentally validated are then recovered using a cross-platform control set from publicly available exome sequencing data. Simulations show high sensitivity for detecting heterozygous and homozygous CNVs, outperforming normalization and state-of-the-art segmentation methods.
doi:10.2202/1544-6115.1732
PMCID: PMC3517018  PMID: 23089826
exorne sequencing; targeted sequencing; CNV; copy number variant; HMM; hidden Markov model
21.  Bayesian Learning from Marginal Data in Bionetwork Models 
In studies of dynamic molecular networks in systems biology, experiments are increasingly exploiting technologies such as flow cytometry to generate data on marginal distributions of a few network nodes at snapshots in time. For example, levels of intracellular expression of a few genes, or cell surface protein markers, can be assayed at a series of interim time points and assumed steady-states under experimentally stimulated growth conditions in small cellular systems. Such marginal data on a small number of cellular markers will typically carry very limited information on the parameters and structure of dynamic network models, though experiments will typically be designed to expose variation in cellular phenotypes that are inherently related to some aspects of model parametrization and structure. Our work addresses statistical questions of how to integrate such data with dynamic stochastic models in order to properly quantify the information—or lack of information—it carries relative to models assumed. We present a Bayesian computational strategy coupled with a novel approach to summarizing and numerically characterizing biological phenotypes that are represented in terms of the resulting sample distributions of cellular markers. We build on Bayesian simulation methods and mixture modeling to define the approach to linking mechanistic mathematical models of network dynamics to snapshot data, using a toggle switch example integrating simulated and real data as context.
doi:10.2202/1544-6115.1684
PMCID: PMC3215428  PMID: 23089812
approximate Bayesian computation (ABC); biological signatures; dynamic stochastic network models; flow cytometry data; posterior simulation; synthetic gene circuit; systems biology; toggle switch model
22.  Adaptive Elastic-Net Sparse Principal Component Analysis for Pathway Association Testing 
Pathway or gene set analysis has become an increasingly popular approach for analyzing high-throughput biological experiments such as microarray gene expression studies. The purpose of pathway analysis is to identify differentially expressed pathways associated with outcomes. Important challenges in pathway analysis are selecting a subset of genes contributing most to association with clinical phenotypes and conducting statistical tests of association for the pathways efficiently. We propose a two-stage analysis strategy: (1) extract latent variables representing activities within each pathway using a dimension reduction approach based on adaptive elastic-net sparse principal component analysis; (2) integrate the latent variables with the regression modeling framework to analyze studies with different types of outcomes such as binary, continuous or survival outcomes. Our proposed approach is computationally efficient. For each pathway, because the latent variables are estimated in an unsupervised fashion without using disease outcome information, in the sample label permutation testing procedure, the latent variables only need to be calculated once rather than for each permutation resample. Using both simulated and real datasets, we show our approach performed favorably when compared with five other currently available pathway testing methods.
doi:10.2202/1544-6115.1697
PMCID: PMC3215429  PMID: 23089825
gene expression; microarray; pathway analysis; sparse principal component analysis
23.  Fitting Boolean Networks from Steady State Perturbation Data 
Gene perturbation experiments are commonly used for the reconstruction of gene regulatory networks. Typical experimental methodology imposes persistent changes on the network. The resulting data must therefore be interpreted as a steady state from an altered gene regulatory network, rather than a direct observation of the original network. In this article an implicit modeling methodology is proposed in which the unperturbed network of interest is scored by first modeling the persistent perturbation, then predicting the steady state, which may then be compared to the observed data. This results in a many-to-one inverse problem, so a computational Bayesian approach is used to assess model uncertainty.
The methodology is first demonstrated on a number of synthetic networks. It is shown that the Bayesian approach correctly assigns high posterior probability to the network structure and steady state behavior. Further, it is demonstrated that where uncertainty of model features is indicated, the uncertainty may be accurately resolved with further perturbation experiments. The methodology is then applied to the modeling of a gene regulatory network using perturbation data from nine genes which have been shown to respond synergistically to known oncogenic mutations. A hypothetical model emerges which conforms to reported regulatory properties of these genes. Furthermore, the Bayesian methodology is shown to be consistent in the sense that multiple randomized applications of the fitting algorithm converge to an approximately common posterior density on the space of models. Such consistency is generally not feasible for algorithms which report only single models. We conclude that fully Bayesian methods, coupled with models which accurately account for experimental constraints, are a suitable tool for the inference of gene regulatory networks, in terms of accuracy, estimation of model uncertainty, and experimental design.
doi:10.2202/1544-6115.1727
PMCID: PMC3215431  PMID: 23089817
boolean network; gene perturbation; Bayesian modeling
24.  Genetic Linkage Analysis in the Presence of Germline Mosaicism 
Germline mosaicism is a genetic condition in which some germ cells of an individual contain a mutation. This condition violates the assumptions underlying classic genetic analysis and may lead to failure of such analysis. In this work we extend the statistical model used for genetic linkage analysis in order to incorporate germline mosaicism. We develop a likelihood ratio test for detecting whether a genetic trait has been introduced into a pedigree by germline mosaicism. We analyze the statistical properties of this test and evaluate its performance via computer simulations. We demonstrate that genetic linkage analysis has high power to identify linkage in the presence of germline mosaicism when our extended model is used. We further use this extended model to provide solid statistical evidence that the MDN syndrome studied by Genzer-Nir et al. has been introduced by germline mosaicism.
doi:10.2202/1544-6115.1709
PMCID: PMC3215430  PMID: 23089820
linkage analysis; germline mosaicism; statistical test; statistical model
25.  Choice of Summary Statistic Weights in Approximate Bayesian Computation 
In this paper, we develop a Genetic Algorithm that can address the fundamental problem of how one should weight the summary statistics included in an approximate Bayesian computation analysis built around an accept/reject algorithm, and how one might choose the tolerance for that analysis. We then demonstrate that using weighted statistics, and a well-chosen tolerance, in such an approximate Bayesian computation approach can result in improved performance, when compared to unweighted analyses, using one example drawn purely from statistics and two drawn from the estimation of population genetics parameters.
doi:10.2202/1544-6115.1586
PMCID: PMC3192002  PMID: 23089822
approximate Bayesian computation; genetic algorithms; summary statistics

Results 1-25 (82)