This paper details the design, evaluation, and implementation of a framework for detecting and modeling nonlinearity between a binary outcome and a continuous predictor variable adjusted for covariates in complex samples. The framework provides familiar-looking parameterizations of output in terms of linear slope coefficients and odds ratios. Estimation methods focus on maximum likelihood optimization of piecewise linear free-knot splines formulated as B-splines. Correctly specifying the optimal number and positions of the knots improves the model, but is marked by computational intensity and numerical instability. Our inference methods utilize both parametric and nonparametric bootstrapping. Unlike other nonlinear modeling packages, this framework is designed to incorporate multistage survey sample designs common to nationally representative datasets. We illustrate the approach and evaluate its performance in specifying the correct number of knots under various conditions with an example using body mass index (BMI; kg/m2) and the complex multi-stage sampling design from the Third National Health and Nutrition Examination Survey to simulate binary mortality outcomes data having realistic nonlinear sample-weighted risk associations with BMI. BMI and mortality data provide a particularly apt example and area of application since BMI is commonly recorded in large health surveys with complex designs, often categorized for modeling, and nonlinearly related to mortality. When complex sample design considerations were ignored, our method was generally similar to or more accurate than two common model selection procedures, Schwarz’s Bayesian Information Criterion (BIC) and Akaike’s Information Criterion (AIC), in terms of correctly selecting the correct number of knots. Our approach provided accurate knot selections when complex sampling weights were incorporated, while AIC and BIC were not effective under these conditions.
Free-knot splines; nonlinear modeling; logistic regression; bootstrap; complex samples; body mass index
Testing for marginal associations between numerous genetic variants and disease may miss complex relationships among variables (e.g., gene-gene interactions). Bayesian approaches can model multiple variables together and offer advantages over conventional model building strategies, including using existing biological evidence as modeling priors and acknowledging that many models may fit the data well. With many candidate variables, Bayesian approaches to variable selection rely on algorithms to approximate the posterior distribution of models, such as Markov-Chain Monte Carlo (MCMC). Unfortunately, MCMC is difficult to parallelize and requires many iterations to adequately sample the posterior. We introduce a scalable algorithm called PEAK that improves the efficiency of MCMC by dividing a large set of variables into related groups using a rooted graph that resembles a mountain peak. Our algorithm takes advantage of parallel computing and existing biological databases when available.
By using graphs to manage a model space with more than 500,000 candidate variables, we were able to improve MCMC efficiency and uncover the true simulated causal variables, including a gene-gene interaction. We applied PEAK to a case-control study of childhood asthma with 2,521 genetic variants. We used an informative graph for oxidative stress derived from Gene Ontology and identified several variants in ERBB4, OXR1, and BCL2 with strong evidence for associations with childhood asthma.
We introduced an extremely flexible analysis framework capable of efficiently performing Bayesian variable selection on many candidate variables. The PEAK algorithm can be provided with an informative graph, which can be advantageous when considering gene-gene interactions, or a symmetric graph, which simply divides the model space into manageable regions. The PEAK framework is compatible with various model forms, allowing for the algorithm to be configured for different study designs and applications, such as pathway or rare-variant analyses, by simple modifications to the model likelihood and proposal functions.
Autoregressive regression coefficients for Anopheles arabiensis aquatic habitat models are usually assessed using global error techniques and are reported as error covariance matrices. A global statistic, however, will summarize error estimates from multiple habitat locations. This makes it difficult to identify where there are clusters of An. arabiensis aquatic habitats of acceptable prediction. It is therefore useful to conduct some form of spatial error analysis to detect clusters of An. arabiensis aquatic habitats based on uncertainty residuals from individual sampled habitats. In this research, a method of error estimation for spatial simulation models was demonstrated using autocorrelation indices and eigenfunction spatial filters to distinguish among the effects of parameter uncertainty on a stochastic simulation of ecological sampled Anopheles aquatic habitat covariates. A test for diagnostic checking error residuals in an An. arabiensis aquatic habitat model may enable intervention efforts targeting productive habitats clusters, based on larval/pupal productivity, by using the asymptotic distribution of parameter estimates from a residual autocovariance matrix. The models considered in this research extends a normal regression analysis previously considered in the literature.
Field and remote-sampled data were collected during July 2006 to December 2007 in Karima rice-village complex in Mwea, Kenya. SAS 9.1.4® was used to explore univariate statistics, correlations, distributions, and to generate global autocorrelation statistics from the ecological sampled datasets. A local autocorrelation index was also generated using spatial covariance parameters (i.e., Moran's Indices) in a SAS/GIS® database. The Moran's statistic was decomposed into orthogonal and uncorrelated synthetic map pattern components using a Poisson model with a gamma-distributed mean (i.e. negative binomial regression). The eigenfunction values from the spatial configuration matrices were then used to define expectations for prior distributions using a Markov chain Monte Carlo (MCMC) algorithm. A set of posterior means were defined in WinBUGS 1.4.3®. After the model had converged, samples from the conditional distributions were used to summarize the posterior distribution of the parameters. Thereafter, a spatial residual trend analyses was used to evaluate variance uncertainty propagation in the model using an autocovariance error matrix.
By specifying coefficient estimates in a Bayesian framework, the covariate number of tillers was found to be a significant predictor, positively associated with An. arabiensis aquatic habitats. The spatial filter models accounted for approximately 19% redundant locational information in the ecological sampled An. arabiensis aquatic habitat data. In the residual error estimation model there was significant positive autocorrelation (i.e., clustering of habitats in geographic space) based on log-transformed larval/pupal data and the sampled covariate depth of habitat.
An autocorrelation error covariance matrix and a spatial filter analyses can prioritize mosquito control strategies by providing a computationally attractive and feasible description of variance uncertainty estimates for correctly identifying clusters of prolific An. arabiensis aquatic habitats based on larval/pupal productivity.
In this work, we propose penalized spline based methods for functional mixed effects models with varying coefficients. We decompose longitudinal outcomes as a sum of several terms: a population mean function, covariates with time-varying coefficients, functional subject-specific random effects and residual measurement error processes. Using penalized splines, we propose nonparametric estimation of the population mean function, varying-coefficient, random subject-specific curves and the associated covariance function which represents between-subject variation and the variance function of the residual measurement errors which represents within-subject variation. Proposed methods offer flexible estimation of both the population-level and subject-level curves. In addition, decomposing variability of the outcomes as a between-subject and a within-subject source is useful in identifying the dominant variance component therefore optimally model a covariance function. We use a likelihood based method to select multiple smoothing parameters. Furthermore, we study the asymptotics of the baseline P-spline estimator with longitudinal data. We conduct simulation studies to investigate performance of the proposed methods. The benefit of the between- and within-subject covariance decomposition is illustrated through an analysis of Berkeley growth data where we identified clearly distinct patterns of the between- and within-subject covariance functions of children's heights. We also apply the proposed methods to estimate the effect of anti-hypertensive treatment from the Framingham Heart Study data.
Multi-level functional data; Functional random effects; Semiparametric longitudinal data analysis
When introduced into a novel environment, mammals establish in it a preferred place marked by the highest number of visits and highest cumulative time spent in it. Examination of exploratory behavior in reference to this “home base” highlights important features of its organization. It might therefore be fruitful to search for other types of marked places in mouse exploratory behavior and examine their influence on overall behavior.
Examination of path curvatures of mice exploring a large empty arena revealed the presence of circumscribed locales marked by the performance of tortuous paths full of twists and turns. We term these places knots, and the behavior performed in them—knot-scribbling. There is typically no more than one knot per session; it has distinct boundaries and it is maintained both within and across sessions. Knots are mostly situated in the place of introduction into the arena, here away from walls. Knots are not characterized by the features of a home base, except for a high speed during inbound and a low speed during outbound paths. The establishment of knots is enhanced by injecting the mouse with saline and placing it in an exposed portion of the arena, suggesting that stress and the arousal associated with it consolidate a long-term contingency between a particular locale and knot-scribbling.
In an environment devoid of proximal cues mice mark a locale associated with arousal by twisting and turning in it. This creates a self-generated, often centrally located landmark. The tortuosity of the path traced during the behavior implies almost concurrent multiple views of the environment. Knot-scribbling could therefore function as a way to obtain an overview of the entire environment, allowing re-calibration of the mouse's locale map and compass directions. The rich vestibular input generated by scribbling could improve the interpretation of the visual scene.
Exploration is a central component of human and animal behavior that has been studied in rodents for almost a century. It is presently one of the main models for studying the interface between behavior, genetics, drugs, and the brain. Until recently the exploration of an open field by rodents has been considered to be largely stochastic. Lately, this behavior is being gradually deciphered, revealing reference places called home bases, from which the animals perform roundtrips into the environment, tracing well-trodden paths whose features contribute to our understanding of navigation, locational memory, cognition-, and emotion-related behavior. Using advanced computational tools we discover so-called knots, preferred places visited sporadically by mice. Mice perform in these places twists and turns. The measurement of speed on the way in and out of knots reveals that they are attractive for the mice. Knot formation is enhanced by stress, suggesting that stress-related arousal assigns these locales with a special significance that is reinstated by subsequent visits to them. The twists and turns could provide the mouse with multiple views that turn knots into navigational landmarks as well as with rich vestibular input that might improve the perception and subsequent interpretation of the visual input.
Aims: To illustrate the contribution of smoothing methods to modelling exposure-response data, Cox models with penalised splines were used to reanalyse lung cancer risk in a cohort of workers exposed to silica in California's diatomaceous earth industry. To encourage application of this approach, computer code is provided.
Methods: Relying on graphic plots of hazard ratios as smooth functions of exposure, the sensitivity of the curve to amount of smoothing, length of the exposure lag, and the influence of the highest exposures was evaluated. Trimming and data transformations were used to down-weight influential observations.
Results: The estimated hazard ratio increased steeply with cumulative silica exposure before flattening and then declining over the sparser regions of exposure. The curve was sensitive to changes in degrees of freedom, but insensitive to the number or location of knots. As the length of lag increased, so did the maximum hazard ratio, but the shape was similar. Deleting the two highest exposed subjects eliminated the top half of the range and allowed the hazard ratio to continue to rise. The shape of the splines suggested a parametric model with log hazard as a linear function of log transformed exposure would fit well.
Conclusions: This flexible statistical approach reduces the dependence on a priori assumptions, while pointing to a suitable parametric model if one exists. In the absence of an appropriate parametric form, however, splines can provide exposure-response information useful for aetiological research and public health intervention.
The assumption of proportional hazards (PH) fundamental to the Cox PH model sometimes may not hold in practice. In this paper, we propose a generalization of the Cox PH model in terms of the cumulative hazard function taking a form similar to the Cox PH model, with the extension that the baseline cumulative hazard function is raised to a power function. Our model allows for interaction between covariates and the baseline hazard and it also includes, for the two sample problem, the case of two Weibull distributions and two extreme value distributions differing in both scale and shape parameters. The partial likelihood approach can not be applied here to estimate the model parameters. We use the full likelihood approach via a cubic B-spline approximation for the baseline hazard to estimate the model parameters. A semi-automatic procedure for knot selection based on Akaike’s Information Criterion is developed. We illustrate the applicability of our approach using real-life data.
censored survival data analysis; crossing hazards; Frailty model; maximum likelihood; regression; spline function; Akaike information criterion; Weibull distribution; extreme value distribution
Longitudinal data are routinely collected in biomedical research studies. A natural model describing longitudinal data decomposes an individual’s outcome as the sum of a population mean function and random subject-specific deviations. When parametric assumptions are too restrictive, methods modeling the population mean function and the random subject-specific functions nonparametrically are in demand. In some applications, it is desirable to estimate a covariance function of random subject-specific deviations. In this work, flexible yet computationally efficient methods are developed for a general class of semiparametric mixed effects models, where the functional forms of the population mean and the subject-specific curves are unspecified. We estimate nonparametric components of the model by penalized spline (P-spline, ), and reparametrize the random curve covariance function by a modified Cholesky decomposition  which allows for unconstrained estimation of a positive semidefinite matrix. To provide smooth estimates, we penalize roughness of fitted curves and derive closed form solutions in the maximization step of an EM algorithm. In addition, we present models and methods for longitudinal family data where subjects in a family are correlated and we decompose the covariance function into a subject-level source and observation-level source. We apply these methods to the multi-level Framingham Heart Study data to estimate age-specific heritability of systolic blood pressure (SBP) nonparametrically.
Multi-level functional data; Cholesky decomposition; Age-specific heritability; Framingham Heart Study
Multilevel functional data is collected in many biomedical studies. For example, in a study of the effect of Nimodipine on patients with subarachnoid hemorrhage (SAH), patients underwent multiple 4-hour treatment cycles. Within each treatment cycle, subjects’ vital signs were reported every 10 minutes. This data has a natural multilevel structure with treatment cycles nested within subjects and measurements nested within cycles. Most literature on nonparametric analysis of such multilevel functional data focus on conditional approaches using functional mixed effects models. However, parameters obtained from the conditional models do not have direct interpretations as population average effects. When population effects are of interest, we may employ marginal regression models. In this work, we propose marginal approaches to fit multilevel functional data through penalized spline generalized estimating equation (penalized spline GEE). The procedure is effective for modeling multilevel correlated generalized outcomes as well as continuous outcomes without suffering from numerical difficulties. We provide a variance estimator robust to misspecification of correlation structure. We investigate the large sample properties of the penalized spline GEE estimator with multilevel continuous data and show that the asymptotics falls into two categories. In the small knots scenario, the estimated mean function is asymptotically efficient when the true correlation function is used and the asymptotic bias does not depend on the working correlation matrix. In the large knots scenario, both the asymptotic bias and variance depend on the working correlation. We propose a new method to select the smoothing parameter for penalized spline GEE based on an estimate of the asymptotic mean squared error (MSE). We conduct extensive simulation studies to examine property of the proposed estimator under different correlation structures and sensitivity of the variance estimation to the choice of smoothing parameter. Finally, we apply the methods to the SAH study to evaluate a recent debate on discontinuing the use of Nimodipine in the clinical community.
Penalized spline; GEE; Semiparametric models; Longitudinal data; Functional data
Regression on the basis function of B-splines has been advocated as an alternative to orthogonal polynomials in random regression analyses. Basic theory of splines in mixed model analyses is reviewed, and estimates from analyses of weights of Australian Angus cattle from birth to 820 days of age are presented. Data comprised 84 533 records on 20 731 animals in 43 herds, with a high proportion of animals with 4 or more weights recorded. Changes in weights with age were modelled through B-splines of age at recording. A total of thirteen analyses, considering different combinations of linear, quadratic and cubic B-splines and up to six knots, were carried out. Results showed good agreement for all ages with many records, but fluctuated where data were sparse. On the whole, analyses using B-splines appeared more robust against "end-of-range" problems and yielded more consistent and accurate estimates of the first eigenfunctions than previous, polynomial analyses. A model fitting quadratic B-splines, with knots at 0, 200, 400, 600 and 821 days and a total of 91 covariance components, appeared to be a good compromise between detailedness of the model, number of parameters to be estimated, plausibility of results, and fit, measured as residual mean square error.
covariance function; growth; beef cattle; random regression; B-splines
Malaria is a major public health issue in Burundi in terms of both morbidity and mortality, with around 2.5 million clinical cases and more than 15,000 deaths each year. It is still the single main cause of mortality in pregnant women and children below five years of age. Because of the severe health and economic burden of malaria, there is still a growing need for methods that will help to understand the influencing factors. Several studies/researches have been done on the subject yielding different results as which factors are most responsible for the increase in malaria transmission. This paper considers the modelling of the dependence of malaria cases on spatial determinants and climatic covariates including rainfall, temperature and humidity in Burundi.
The analysis carried out in this work exploits real monthly data collected in the area of Burundi over 12 years (1996-2007). Semi-parametric regression models are used. The spatial analysis is based on a geo-additive model using provinces as the geographic units of study. The spatial effect is split into structured (correlated) and unstructured (uncorrelated) components. Inference is fully Bayesian and uses Markov chain Monte Carlo techniques. The effects of the continuous covariates are modelled by cubic p-splines with 20 equidistant knots and second order random walk penalty. For the spatially correlated effect, Markov random field prior is chosen. The spatially uncorrelated effects are assumed to be i.i.d. Gaussian. The effects of climatic covariates and the effects of other spatial determinants are estimated simultaneously in a unified regression framework.
The results obtained from the proposed model suggest that although malaria incidence in a given month is strongly positively associated with the minimum temperature of the previous months, regional patterns of malaria that are related to factors other than climatic variables have been identified, without being able to explain them.
In this paper, semiparametric models are used to model the effects of both climatic covariates and spatial effects on malaria distribution in Burundi. The results obtained from the proposed models suggest a strong positive association between malaria incidence in a given month and the minimum temperature of the previous month. From the spatial effects, important spatial patterns of malaria that are related to factors other than climatic variables are identified. Potential explanations (factors) could be related to socio-economic conditions, food shortage, limited access to health care service, precarious housing, promiscuity, poor hygienic conditions, limited access to drinking water, land use (rice paddies for example), displacement of the population (due to armed conflicts).
The folding pathway and rate coefficients of the folding of a knotted protein are calculated for a potential energy function with minimal energetic frustration. A kinetic transition network is constructed using the discrete path sampling approach, and the resulting potential energy surface is visualized by constructing disconnectivity graphs. Owing to topological constraints, the low-lying portion of the landscape consists of three distinct regions, corresponding to the native knotted state and to configurations where either the N or C terminus is not yet folded into the knot. The fastest folding pathways from denatured states exhibit early formation of the N terminus portion of the knot and a rate-determining step where the C terminus is incorporated. The low-lying minima with the N terminus knotted and the C terminus free therefore constitute an off-pathway intermediate for this model. The insertion of both the N and C termini into the knot occurs late in the folding process, creating large energy barriers that are the rate limiting steps in the folding process. When compared to other protein folding proteins of a similar length, this system folds over six orders of magnitude more slowly.
Proteins are chains, which must fold into a compact structure for the molecule to perform its biological function. There are a large number of ways the molecule can move into this final shape. Proteins have evolved sequences that perform this difficult task by having strong biases toward the final shape, while not getting stuck in different structures along the way. One way proteins can be trapped is by forming a knot in the chain. For the most part, proteins are remarkable in avoiding knotting. However, in order to function a few proteins form knots. We show how a model protein is able to knot itself, and estimate how fast this process occurs. Our goal is to treat a small and uncomplicated protein to estimate the fastest rate possible for the folding of a knotted protein. This rate is interesting when compared to the speed of folding of other proteins. We have visualized how the molecule changes shape to its functional position, and examined other paths the molecule may take.
The functional coefficient regression models assume that the regression coefficients vary with some “threshold” variable, providing appreciable flexibility in capturing the underlying dynamics in data and avoiding the so-called “curse of dimensionality” in multivariate nonparametric estimation. We first investigate the estimation, inference, and forecasting for the functional coefficient regression models with dependent observations via penalized splines. The P-spline approach, as a direct ridge regression shrinkage type global smoothing method, is computationally efficient and stable. With established fixed-knot asymptotics, inference is readily available. Exact inference can be obtained for fixed smoothing parameter λ, which is most appealing for finite samples. Our penalized spline approach gives an explicit model expression, which also enables multi-step-ahead forecasting via simulations. Furthermore, we examine different methods of choosing the important smoothing parameter λ: modified multi-fold cross-validation (MCV), generalized cross-validation (GCV), and an extension of empirical bias bandwidth selection (EBBS) to P-splines. In addition, we implement smoothing parameter selection using mixed model framework through restricted maximum likelihood (REML) for P-spline functional coefficient regression models with independent observations. The P-spline approach also easily allows different smoothness for different functional coefficients, which is enabled by assigning different penalty λ accordingly. We demonstrate the proposed approach by both simulation examples and a real data application.
Most existing likelihood-based methods for fitting historical demographic models to DNA sequence polymorphism data to do not scale feasibly up to the level of whole-genome data sets. Computational economies can be achieved by incorporating two forms of pseudo-likelihood: composite and approximate likelihood methods. Composite likelihood enables scaling up to large data sets because it takes the product of marginal likelihoods as an estimator of the likelihood of the complete data set. This approach is especially useful when a large number of genomic regions constitutes the data set. Additionally, approximate likelihood methods can reduce the dimensionality of the data by summarizing the information in the original data by either a sufficient statistic, or a set of statistics. Both composite and approximate likelihood methods hold promise for analyzing large data sets or for use in situations where the underlying demographic model is complex and has many parameters. This paper considers a simple demographic model of allopatric divergence between two populations, in which one of the population is hypothesized to have experienced a founder event, or population bottleneck. A large resequencing data set from human populations is summarized by the joint frequency spectrum, which is a matrix of the genomic frequency spectrum of derived base frequencies in two populations. A Bayesian Metropolis-coupled Markov chain Monte Carlo (MCMCMC) method for parameter estimation is developed that uses both composite and likelihood methods and is applied to the three different pairwise combinations of the human population resequence data. The accuracy of the method is also tested on data sets sampled from a simulated population model with known parameters.
The Bayesian MCMCMC method also estimates the ratio of effective population size for the X chromosome versus that of the autosomes. The method is shown to estimate, with reasonable accuracy, demographic parameters from three simulated data sets that vary in the magnitude of a founder event and a skew in the effective population size of the X chromosome relative to the autosomes. The behavior of the Markov chain is also examined and shown to convergence to its stationary distribution, while also showing high levels of parameter mixing. The analysis of three pairwise comparisons of sub-Saharan African human populations with non-African human populations do not provide unequivocal support for a strong non-African founder event from these nuclear data. The estimates do however suggest a skew in the ratio of X chromosome to autosome effective population size that is greater than one. However in all three cases, the 95% highest posterior density interval for this ratio does include three-fourths, the value expected under an equal breeding sex ratio.
The implementation of composite and approximate likelihood methods in a framework that includes MCMCMC demographic parameter estimation shows great promise for being flexible and computationally efficient enough to scale up to the level of whole-genome polymorphism and divergence analysis. Further work must be done to characterize the effects of the assumption of linkage equilibrium among genomic regions that is crucial to the validity of applying the composite likelihood method.
Many phenomena of fundamental importance to biology and biomedicine arise as a dynamic curve, such as organ growth and HIV dynamics. The genetic mapping of these traits is challenged by longitudinal variables measured at irregular and possibly subject-specific time points, in which case nonnegative definiteness of the estimated covariance matrix needs to be guaranteed. We present a semiparametric approach for genetic mapping within the mixture-model setting by jointly modeling mean and covariance structures for irregular longitudinal data. Penalized spline is used to model the mean functions of individual QTL genotypes as latent variables while an extended generalized linear model is used to approximate the covariance matrix. The parameters for modeling the mean-covariances are estimated by MCMC, using Gibbs sampler and Metropolis Hastings algorithm. We derive the full conditional distributions for the mean and covariance parameters and compute Bayes factors to test the hypothesis about the existence of significant QTLs. The model was used to screen the existence of specific QTLs for age-specific change of body mass index with a sparse longitudinal dataset. The new model provides powerful means for broadening the application of genetic mapping to reveal the genetic control of dynamic traits.
Cholesky decomposition; genetic mapping; MCMC; penalized spline; quantitative trait loci
Accurate and fast estimation of genetic parameters that underlie quantitative traits using mixed linear models with additive and dominance effects is of great importance in both natural and breeding populations. Here, we propose a new fast adaptive Markov chain Monte Carlo (MCMC) sampling algorithm for the estimation of genetic parameters in the linear mixed model with several random effects. In the learning phase of our algorithm, we use the hybrid Gibbs sampler to learn the covariance structure of the variance components. In the second phase of the algorithm, we use this covariance structure to formulate an effective proposal distribution for a Metropolis-Hastings algorithm, which uses a likelihood function in which the random effects have been integrated out. Compared with the hybrid Gibbs sampler, the new algorithm had better mixing properties and was approximately twice as fast to run. Our new algorithm was able to detect different modes in the posterior distribution. In addition, the posterior mode estimates from the adaptive MCMC method were close to the REML (residual maximum likelihood) estimates. Moreover, our exponential prior for inverse variance components was vague and enabled the estimated mode of the posterior variance to be practically zero, which was in agreement with the support from the likelihood (in the case of no dominance). The method performance is illustrated using simulated data sets with replicates and field data in barley.
adaptive MCMC; identifiability problem; Bayesian analysis; Gibbs sampling; estimation of genetic parameters
Stochastic simulations of coarse-grained protein models are used to investigate the propensity to form knots in early stages of protein folding. The study is carried out comparatively for two homologous carbamoyltransferases, a natively-knotted N-acetylornithine carbamoyltransferase (AOTCase) and an unknotted ornithine carbamoyltransferase (OTCase). In addition, two different sets of pairwise amino acid interactions are considered: one promoting exclusively native interactions, and the other additionally including non-native quasi-chemical and electrostatic interactions. With the former model neither protein shows a propensity to form knots. With the additional non-native interactions, knotting propensity remains negligible for the natively-unknotted OTCase while for AOTCase it is much enhanced. Analysis of the trajectories suggests that the different entanglement of the two transcarbamylases follows from the tendency of the C-terminal to point away from (for OTCase) or approach and eventually thread (for AOTCase) other regions of partly-folded protein. The analysis of the OTCase/AOTCase pair clarifies that natively-knotted proteins can spontaneously knot during early folding stages and that non-native sequence-dependent interactions are important for promoting and disfavouring early knotting events.
Knotted proteins provide an ideal ground for examining how amino acid interactions (which are local) can favor their folding into a native state of non-trivial topology (which is a global property). Some of the mechanisms that can aid knot formation are investigated here by comparing coarse-grained folding simulations of two enzymes that are structurally similar, and yet have natively knotted and unknotted states, respectively. In folding simulations that exclusively promote the formation of native contacts, neither protein forms knots. Strikingly, when sequence-dependent non-native interactions between amino acids are introduced, one observes knotting events but only for the natively-knotted protein. The results support the importance of non-native interactions in favoring or disfavoring knotting events in the early stages of folding.
We wanted to compare growth differences between 13 Escherichia coli strains exposed to various concentrations of the growth inhibitor lactoferrin in two different types of broth (Syncase and Luria-Bertani (LB)). To carry this out, we present a simple statistical procedure that separates microbial growth curves that are due to natural random perturbations and growth curves that are more likely caused by biological differences.
Bacterial growth was determined using optical density data (OD) recorded for triplicates at 620 nm for 18 hours for each strain. Each resulting growth curve was divided into three equally spaced intervals. We propose a procedure using linear spline regression with two knots to compute the slopes of each interval in the bacterial growth curves. These slopes are subsequently used to estimate a 95% confidence interval based on an appropriate statistical distribution. Slopes outside the confidence interval were considered as significantly different from slopes within. We also demonstrate the use of related, but more advanced methods known collectively as generalized additive models (GAMs) to model growth. In addition to impressive curve fitting capabilities with corresponding confidence intervals, GAM’s allow for the computation of derivatives, i.e. growth rate estimation, with respect to each time point.
The results from our proposed procedure agreed well with the observed data. The results indicated that there were substantial growth differences between the E. coli strains. Most strains exhibited improved growth in the nutrient rich LB broth compared to Syncase. The inhibiting effect of lactoferrin varied between the different strains. The atypical enteropathogenic aEPEC-2 grew, on average, faster in both broths than the other strains tested while the enteroinvasive strains, EIEC-6 and EIEC-7 grew slower. The enterotoxigenic ETEC-5 strain, exhibited exceptional growth in Syncase broth, but slower growth in LB broth.
Our results do not indicate clear growth differences between pathogroups or pathogenic versus non-pathogenic E. coli.
Escherichia coli; Growth curves; Spline regression
Gene duplication with subsequent interaction divergence is one of the primary driving forces in the evolution of genetic systems. Yet little is known about the precise mechanisms and the role of duplication divergence in the evolution of protein networks from the prokaryote and eukaryote domains. We developed a novel, model-based approach for Bayesian inference on biological network data that centres on approximate Bayesian computation, or likelihood-free inference. Instead of computing the intractable likelihood of the protein network topology, our method summarizes key features of the network and, based on these, uses a MCMC algorithm to approximate the posterior distribution of the model parameters. This allowed us to reliably fit a flexible mixture model that captures hallmarks of evolution by gene duplication and subfunctionalization to protein interaction network data of Helicobacter pylori and Plasmodium falciparum. The 80% credible intervals for the duplication–divergence component are [0.64, 0.98] for H. pylori and [0.87, 0.99] for P. falciparum. The remaining parameter estimates are not inconsistent with sequence data. An extensive sensitivity analysis showed that incompleteness of PIN data does not largely affect the analysis of models of protein network evolution, and that the degree sequence alone barely captures the evolutionary footprints of protein networks relative to other statistics. Our likelihood-free inference approach enables a fully Bayesian analysis of a complex and highly stochastic system that is otherwise intractable at present. Modelling the evolutionary history of PIN data, it transpires that only the simultaneous analysis of several global aspects of protein networks enables credible and consistent inference to be made from available datasets. Our results indicate that gene duplication has played a larger part in the network evolution of the eukaryote than in the prokaryote, and suggests that single gene duplications with immediate divergence alone may explain more than 60% of biological network data in both domains.
The importance of gene duplication to biological evolution has been recognized since the 1930s. For more than a decade, substantial evidence has been collected from genomic sequence data in order to elucidate the importance and the mechanisms of gene duplication; however, most biological characteristics arise from complex interactions between the cell's numerous constituents. Recently, preliminary descriptions of the protein interaction networks have become available for species of different domains. Adapting novel techniques in stochastic simulation, the authors demonstrate that evolutionary inferences can be drawn from large-scale, incomplete network data by fitting a stochastic model of network growth that captures hallmarks of evolution by duplication and divergence. They have also analyzed the effect of summarizing protein networks in different ways, and show that a reliable and consistent analysis requires many aspects of network data to be considered jointly; in contrast to what is commonly done in practice. Their results indicate that duplication and divergence has played a larger role in the network evolution of the eukaryote P. falciparum than in the prokaryote H. pylori, and emphasize at least for the eukaryote the potential importance of subfunctionalization in network evolution.
The estimation of demographic parameters from genetic data often requires the computation of likelihoods. However, the likelihood function is computationally intractable for many realistic evolutionary models, and the use of Bayesian inference has therefore been limited to very simple models. The situation changed recently with the advent of Approximate Bayesian Computation (ABC) algorithms allowing one to obtain parameter posterior distributions based on simulations not requiring likelihood computations.
Here we present ABCtoolbox, a series of open source programs to perform Approximate Bayesian Computations (ABC). It implements various ABC algorithms including rejection sampling, MCMC without likelihood, a Particle-based sampler and ABC-GLM. ABCtoolbox is bundled with, but not limited to, a program that allows parameter inference in a population genetics context and the simultaneous use of different types of markers with different ploidy levels. In addition, ABCtoolbox can also interact with most simulation and summary statistics computation programs. The usability of the ABCtoolbox is demonstrated by inferring the evolutionary history of two evolutionary lineages of Microtus arvalis. Using nuclear microsatellites and mitochondrial sequence data in the same estimation procedure enabled us to infer sex-specific population sizes and migration rates and to find that males show smaller population sizes but much higher levels of migration than females.
ABCtoolbox allows a user to perform all the necessary steps of a full ABC analysis, from parameter sampling from prior distributions, data simulations, computation of summary statistics, estimation of posterior distributions, model choice, validation of the estimation procedure, and visualization of the results.
We consider a random effects quantile regression analysis of clustered data and propose a semiparametric approach using empirical likelihood. The random regression coefficients are assumed independent with a common mean, following parametrically specified distributions. The common mean corresponds to the population-average effects of explanatory variables on the conditional quantile of interest, while the random coefficients represent cluster specific deviations in the covariate effects. We formulate the estimation of the random coefficients as an estimating equations problem and use empirical likelihood to incorporate the parametric likelihood of the random coefficients. A likelihood-like statistical criterion function is yield, which we show is asymptotically concave in a neighborhood of the true parameter value and motivates its maximizer as a natural estimator. We use Markov Chain Monte Carlo (MCMC) samplers in the Bayesian framework, and propose the resulting quasi-posterior mean as an estimator. We show that the proposed estimator of the population-level parameter is asymptotically normal and the estimators of the random coefficients are shrunk toward the population-level parameter in the first order asymptotic sense. These asymptotic results do not require Gaussian random effects, and the empirical likelihood based likelihood-like criterion function is free of parameters related to the error densities. This makes the proposed approach both flexible and computationally simple. We illustrate the methodology with two real data examples.
Empirical likelihood; Markov Chain Monte Carlo; Quasi-posterior distribution
It was predicted recently that sufficiently complex knots on a linear wormlike chain can have a metastable size, preventing their spontaneous expansion. We tested this prediction via computer simulations for 71 and 10151 knots. We calculated the equilibrium distributions of knot size S for both knots. By using the umbrella sampling we were able to obtain the distributions over a wide range of S values. The distributions were converted into the dependencies of knot free energy on S. The obtained free energy profiles have no pronounced local minima, so there are no metastable knot sizes for these knots. We also performed Brownian dynamics simulation of 71 knot relaxation that started from a very tight knot conformation. The simulation showed that knot expansion is a fast process compared to knot displacement along the chain contour by diffusion.
knots in polymers; knot relaxation; Brownian dynamics of polymer
Fitting spline curves to data points is a very important issue in many applied fields. It is also challenging, because these curves typically depend on many continuous variables in a highly interrelated nonlinear way. In general, it is not possible to compute these parameters analytically, so the problem is formulated as a continuous nonlinear optimization problem, for which traditional optimization techniques usually fail. This paper presents a new bioinspired method to tackle this issue. In this method, optimization is performed through a combination of two techniques. Firstly, we apply the indirect approach to the knots, in which they are not initially the subject of optimization but precomputed with a coarse approximation scheme. Secondly, a powerful bioinspired metaheuristic technique, the firefly algorithm, is applied to optimization of data parameterization; then, the knot vector is refined by using De Boor's method, thus yielding a better approximation to the optimal knot vector. This scheme converts the original nonlinear continuous optimization problem into a convex optimization problem, solved by singular value decomposition. Our method is applied to some illustrative real-world examples from the CAD/CAM field. Our experimental results show that the proposed scheme can solve the original continuous nonlinear optimization problem very efficiently.
The patterning cascade model of tooth morphogenesis accounts for shape development through the interaction of a small number of genes. In the model, gene expression both directs development and is controlled by the shape of developing teeth. Enamel knots (zones of nonproliferating epithelium) mark the future sites of cusps. In order to form, a new enamel knot must escape the inhibitory fields surrounding other enamel knots before crown components become spatially fixed as morphogenesis ceases. Because cusp location on a fully formed tooth reflects enamel knot placement and tooth size is limited by the cessation of morphogenesis, the model predicts that cusp expression varies with intercusp spacing relative to tooth size. Although previous studies in humans have supported the model's implications, here we directly test the model's predictions for the expression, size, and symmetry of Carabelli cusp, a variation present in many human populations.
In a dental cast sample of upper first molars (M1s) (187 rights, 189 lefts, and 185 antimeric pairs), we measured tooth area and intercusp distances with a Hirox digital microscope. We assessed Carabelli expression quantitatively as an area in a subsample and qualitatively using two typological schemes in the full sample. As predicted, low relative intercusp distance is associated with Carabelli expression in both right and left samples using either qualitative or quantitative measures. Furthermore, asymmetry in Carabelli area is associated with asymmetry in relative intercusp spacing.
These findings support the model's predictions for Carabelli cusp expression both across and within individuals. By comparing right-left pairs of the same individual, our data show that small variations in developmental timing or spacing of enamel knots can influence cusp pattern independently of genotype. Our findings suggest that during evolution new cusps may first appear as a result of small changes in the spacing of enamel knots relative to crown size.
A Bayesian alignment model (BAM) is proposed for alignment of liquid chromatography-mass spectrometry (LC-MS) data. BAM belongs to the category of profile-based approaches, which are composed of two major components: a prototype function and a set of mapping functions. Appropriate estimation of these functions is crucial for good alignment results. BAM uses Markov chain Monte Carlo (MCMC) methods to draw inference on the model parameters and improves on existing MCMC-based alignment methods through 1) the implementation of an efficient MCMC sampler and 2) an adaptive selection of knots. A block Metropolis-Hastings algorithm that mitigates the problem of the MCMC sampler getting stuck at local modes of the posterior distribution is used for the update of the mapping function coefficients. In addition, a stochastic search variable selection (SSVS) methodology is used to determine the number and positions of knots. We applied BAM to a simulated data set, an LC-MS proteomic data set, and two LC-MS metabolomic data sets, and compared its performance with the Bayesian hierarchical curve registration (BHCR) model, the dynamic time-warping (DTW) model, and the continuous profile model (CPM). The advantage of applying appropriate profile-based retention time correction prior to performing a feature-based approach is also demonstrated through the metabolomic data sets.
Alignment; Bayesian inference; block Metropolis-Hastings algorithm; liquid chromatography-mass spectrometry (LC-MS); Markov chain Monte Carlo (MCMC); stochastic search variable selection (SSVS)