Technological developments have increased the feasibility of large scale genetic association studies. Densely typed genetic markers are obtained using SNP arrays, next-generation sequencing technologies and imputation. However, SNPs typed using these methods can be highly correlated due to linkage disequilibrium among them, and standard multiple regression techniques fail with these data sets due to their high dimensionality and correlation structure. There has been increasing interest in using penalised regression in the analysis of high dimensional data. Ridge regression is one such penalised regression technique which does not perform variable selection, instead estimating a regression coefficient for each predictor variable. It is therefore desirable to obtain an estimate of the significance of each ridge regression coefficient.
We develop and evaluate a test of significance for ridge regression coefficients. Using simulation studies, we demonstrate that the performance of the test is comparable to that of a permutation test, with the advantage of a much-reduced computational cost. We introduce the p-value trace, a plot of the negative logarithm of the p-values of ridge regression coefficients with increasing shrinkage parameter, which enables the visualisation of the change in p-value of the regression coefficients with increasing penalisation. We apply the proposed method to a lung cancer case-control data set from EPIC, the European Prospective Investigation into Cancer and Nutrition.
The proposed test is a useful alternative to a permutation test for the estimation of the significance of ridge regression coefficients, at a much-reduced computational cost. The p-value trace is an informative graphical tool for evaluating the results of a test of significance of ridge regression coefficients as the shrinkage parameter increases, and the proposed test makes its production computationally feasible.
In this study we compared different statistical procedures for estimating SNP effects using the simulated data set from the XII QTL-MAS workshop. Five procedures were considered and tested in a reference population, i.e., the first four generations, from which phenotypes and genotypes were available. The procedures can be interpreted as variants of ridge regression, with different ways for defining the shrinkage parameter. Comparisons were made with respect to the correlation between genomic and conventional estimated breeding values. Moderate correlations were obtained from all methods. Two of them were used to predict genomic breeding values in the last three generations. Correlations between these and the true breeding values were also moderate. We concluded that the ridge regression procedures applied in this study did not outperform the simple use of a ratio of variances in a mixed model method, both providing moderate accuracies of predicted genomic breeding values.
The present study was conducted to predict survival time in patients with diffuse large B-cell lymphoma, DLBCL, based on microarray data using Cox regression model combined with seven dimension reduction methods. This historical cohort included 2042 gene expression measurements from 40 patients with DLBCL. In order to predict survival, a combination of Cox regression model was used with seven methods for dimension reduction or shrinkage including univariate selection, forward stepwise selection, principal component regression, supervised principal component regression, partial least squares regression, ridge regression and Losso. The capacity of predictions was examined by three different criteria including log rank test, prognostic index and deviance. MATLAB r2008a and RKWard software were used for data analysis. Based on our findings, performance of ridge regression was better than other methods. Based on ridge regression coefficients and a given cut point value, 16 genes were selected. By using forward stepwise selection method in Cox regression model, it was indicated that the expression of genes GENE3555X and GENE3807X decreased the survival time (P=0.008 and P=0.003, respectively), whereas the genes GENE3228X and GENE1551X increased survival time (P=0.002 and P<0.001, respectively). This study indicated that ridge regression method had higher capacity than other dimension reduction methods for the prediction of survival time in patients with DLBCL. Furthermore, a combination of statistical methods and microarray data could help to detect influential genes in survival.
Lymphoma; gene expression; microarray; survival analysis; dimension reduction; ridge regression
Graphical Gaussian models are popular tools for the estimation of (undirected) gene association networks from microarray data. A key issue when the number of variables greatly exceeds the number of samples is the estimation of the matrix of partial correlations. Since the (Moore-Penrose) inverse of the sample covariance matrix leads to poor estimates in this scenario, standard methods are inappropriate and adequate regularization techniques are needed. Popular approaches include biased estimates of the covariance matrix and high-dimensional regression schemes, such as the Lasso and Partial Least Squares.
In this article, we investigate a general framework for combining regularized regression methods with the estimation of Graphical Gaussian models. This framework includes various existing methods as well as two new approaches based on ridge regression and adaptive lasso, respectively. These methods are extensively compared both qualitatively and quantitatively within a simulation study and through an application to six diverse real data sets. In addition, all proposed algorithms are implemented in the R package "parcor", available from the R repository CRAN.
In our simulation studies, the investigated non-sparse regression methods, i.e. Ridge Regression and Partial Least Squares, exhibit rather conservative behavior when combined with (local) false discovery rate multiple testing in order to decide whether or not an edge is present in the network. For networks with higher densities, the difference in performance of the methods decreases. For sparse networks, we confirm the Lasso's well known tendency towards selecting too many edges, whereas the two-stage adaptive Lasso is an interesting alternative that provides sparser solutions. In our simulations, both sparse and non-sparse methods are able to reconstruct networks with cluster structures. On six real data sets, we also clearly distinguish the results obtained using the non-sparse methods and those obtained using the sparse methods where specification of the regularization parameter automatically means model selection. In five out of six data sets, Partial Least Squares selects very dense networks. Furthermore, for data that violate the assumption of uncorrelated observations (due to replications), the Lasso and the adaptive Lasso yield very complex structures, indicating that they might not be suited under these conditions. The shrinkage approach is more stable than the regression based approaches when using subsampling.
An important challenge in analyzing high dimensional data in regression settings is that of facing a situation in which the number of covariates p in the model greatly exceeds the sample size n (sometimes termed the “p > n” problem). In this article, we develop a novel specification for a general class of prior distributions, called Information Matrix (IM) priors, for high-dimensional generalized linear models. The priors are first developed for settings in which p < n, and then extended to the p > n case by defining a ridge parameter in the prior construction, leading to the Information Matrix Ridge (IMR) prior. The IM and IMR priors are based on a broad generalization of Zellner’s g-prior for Gaussian linear models. Various theoretical properties of the prior and implied posterior are derived including existence of the prior and posterior moment generating functions, tail behavior, as well as connections to Gaussian priors and Jeffreys’ prior. Several simulation studies and an application to a nucleosomal positioning data set demonstrate its advantages over Gaussian, as well as g-priors, in high dimensional settings.
Fisher Information; g-prior; Importance sampling; Model identifiability; Prior elicitation
Motivation: Penalized regression methods have been adopted widely for high-dimensional feature selection and prediction in many bioinformatic and biostatistical contexts. While their theoretical properties are well-understood, specific methodology for their optimal application to genomic data has not been determined.
Results: Through simulation of contrasting scenarios of correlated high-dimensional survival data, we compared the LASSO, Ridge and Elastic Net penalties for prediction and variable selection. We found that a 2D tuning of the Elastic Net penalties was necessary to avoid mimicking the performance of LASSO or Ridge regression. Furthermore, we found that in a simulated scenario favoring the LASSO penalty, a univariate pre-filter made the Elastic Net behave more like Ridge regression, which was detrimental to prediction performance. We demonstrate the real-life application of these methods to predicting the survival of cancer patients from microarray data, and to classification of obese and lean individuals from metagenomic data. Based on these results, we provide an optimized set of guidelines for the application of penalized regression for reproducible class comparison and prediction with genomic data.
Availability and Implementation: A parallelized implementation of the methods presented for regression and for simulation of synthetic data is provided as the pensim R package, available at http://cran.r-project.org/web/packages/pensim/index.html.
Contact: firstname.lastname@example.org; email@example.com
Supplementary Information: Supplementary data are available at Bioinformatics online.
The Cartesian sampled three-dimensional HNCO experiment is inherently limited in time resolution and sensitivity for the real time measurement of protein hydrogen exchange. This is largely overcome by use of the radial HNCO experiment that employs the use of optimized sampling angles. The significant practical limitation presented by use of three-dimensional data is the large data storage and processing requirements necessary and is largely overcome by taking advantage of the inherent capabilities of the 2D-FT to process selective frequency space without artifact or limitation. Decomposition of angle spectra into positive and negative ridge components provides increased resolution and allows statistical averaging of intensity and therefore increased precision. Strategies for averaging ridge cross sections within and between angle spectra are developed to allow further statistical approaches for increasing the precision of measured hydrogen occupancy. Intensity artifacts potentially introduced by over-pulsing are effectively eliminated by use of the BEST approach.
hydrogen exchange; radial sampling; angle selection; two-dimensional FT
It is widely believed that both common and rare variants contribute to the risks of common diseases or complex traits and the cumulative effects of multiple rare variants can explain a significant proportion of trait variances. Advances in high-throughput DNA sequencing technologies allow us to genotype rare causal variants and investigate the effects of such rare variants on complex traits. We developed an adaptive ridge regression method to analyze the collective effects of multiple variants in the same gene or the same functional unit. Our model focuses on continuous trait and incorporates covariate factors to remove potential confounding effects. The proposed method estimates and tests multiple rare variants collectively but does not depend on the assumption of same direction of each rare variant effect. Compared with the Bayesian hierarchical generalized linear model approach, the state-of-the-art method of rare variant detection, the proposed new method is easy to implement, yet it has higher statistical power. Application of the new method is demonstrated using the well-known data from the Dallas Heart Study.
In recent years, many methods have been developed for regression in high-dimensional settings. We propose covariance-regularized regression, a family of methods that use a shrunken estimate of the inverse covariance matrix of the features in order to achieve superior prediction. An estimate of the inverse covariance matrix is obtained by maximizing its log likelihood, under a multivariate normal model, subject to a constraint on its elements; this estimate is then used to estimate coefficients for the regression of the response onto the features. We show that ridge regression, the lasso, and the elastic net are special cases of covariance-regularized regression, and we demonstrate that certain previously unexplored forms of covariance-regularized regression can outperform existing methods in a range of situations. The covariance-regularized regression framework is extended to generalized linear models and linear discriminant analysis, and is used to analyze gene expression data sets with multiple class and survival outcomes.
regression; classification; n ≪ p; covariance regularization
Survival prediction from a large number of covariates is a current focus of statistical and medical research. In this paper, we study a methodology known as the compound covariate prediction performed under univariate Cox proportional hazard models. We demonstrate via simulations and real data analysis that the compound covariate method generally competes well with ridge regression and Lasso methods, both already well-studied methods for predicting survival outcomes with a large number of covariates. Furthermore, we develop a refinement of the compound covariate method by incorporating likelihood information from multivariate Cox models. The new proposal is an adaptive method that borrows information contained in both the univariate and multivariate Cox regression estimators. We show that the new proposal has a theoretical justification from a statistical large sample theory and is naturally interpreted as a shrinkage-type estimator, a popular class of estimators in statistical literature. Two datasets, the primary biliary cirrhosis of the liver data and the non-small-cell lung cancer data, are used for illustration. The proposed method is implemented in R package “compound.Cox” available in CRAN at http://cran.r-project.org/.
Mixed-effects logistic regression models are described for analysis of longitudinal ordinal outcomes, where observations are observed clustered within subjects. Random effects are included in the model to account for the correlation of the clustered observations. Typically, the error variance and the variance of the random effects are considered to be homogeneous. These variance terms characterize the within-subjects (i.e., error variance) and between-subjects (i.e., random-effects variance) variation in the data. In this article, we describe how covariates can influence these variances, and also extend the standard logistic mixed model by adding a subject-level random effect to the within-subject variance specification. This permits subjects to have influence on the mean, or location, and variability, or (square of the) scale, of their responses. Additionally, we allow the random effects to be correlated. We illustrate application of these models for ordinal data using Ecological Momentary Assessment (EMA) data, or intensive longitudinal data, from an adolescent smoking study. These mixed-effects ordinal location scale models have useful applications in mental health research where outcomes are often ordinal and there is interest in subject heterogeneity, both between- and within-subjects.
Complex variation; Mood variation; Heterogeneity; Variance modeling
Mineral imbalance in the body may significantly contribute to the development and course of hypertension. In this paper, blood pressure figures have been linked to the levels of Fe, Ca, Mg, Zn, Cu, Na and K in hair. The research sample was composed of young men (n = 91) aged 13–21, from the town of Mafinga, Iringa District, Tanzania. The data collected included their age, tribal background and weekly diet. Based on body mass index, the participants were categorised into pre-defined subgroups. To examine how the minerals in question affect blood pressure, correlation analysis and multiple ridge regression analysis were performed. Analysis of ridge regression findings for the researched group (n = 91) shows that the minerals under scrutiny account for systolic blood pressure variation in 13 % and in 15 % for diastolic blood pressure variation. After including two additional variables—calendar age and body mass index—in regression analysis, the ultimate coefficient of determination (R2) changes for systolic blood pressure and remains the same for diastolic blood pressure (R2 = 0.194 and R2 = 0.156, respectively). Nutritional analysis shows that the students included in the study received insufficient calories per day (1,500–2,200 kcal). The group of students with abnormal blood pressure were not aware of their poor health. Research findings may result from progressive environmental changes and poor nutrition in terms of food quantity and quality, which had an impact on the subjects’ blood pressure. Hair analysis used to determine mineral content in the body may be an auxiliary tool in identifying the links between factors leading to the development of hypertension.
Trace elements; SBP; DBP; Fe; Ca; Mg; Zn; Cu; Na; K; BMI; Africa; Tanzania; Bantu
In dementia screening tests, item selection for shortening an existing screening test can be achieved using multiple logistic regression. However, maximum likelihood estimates for such logistic regression models often experience serious bias or even non-existence because of separation and multicollinearity problems resulting from a large number of highly correlated items. Firth (1993, Biometrika, 80(1), 27–38) proposed a penalized likelihood estimator for generalized linear models and it was shown to reduce bias and the non-existence problems. The ridge regression has been used in logistic regression to stabilize the estimates in cases of multicollinearity. However, neither solves the problems for each other. In this paper, we propose a double penalized maximum likelihood estimator combining Firth’s penalized likelihood equation with a ridge parameter. We present a simulation study evaluating the empirical performance of the double penalized likelihood estimator in small to moderate sample sizes. We demonstrate the proposed approach using a current screening data from a community-based dementia study.
Logistic regression; maximum likelihood; penalized maximum likelihood; ridge regression; item selection
Accurate prediction of genomic breeding values (GEBVs) requires numerous markers. However, predictive accuracy can be enhanced by excluding markers with no effects or with inconsistent effects among crosses that can adversely affect the prediction of GEBVs.
We present three different approaches for pre-selecting markers prior to predicting GEBVs using four different BLUP methods, including ridge regression and three spatial models. Performances of the models were evaluated using 5-fold cross-validation.
Results and conclusions
Ridge regression and the spatial models gave essentially similar fits. Pre-selecting markers was evidently beneficial since excluding markers with inconsistent effects among crosses increased the correlation between GEBVs and true breeding values of the non-phenotyped individuals from 0.607 (using all markers) to 0.625 (using pre-selected markers). Moreover, extension of the ridge regression model to allow for heterogeneous variances between the most significant subset and the complementary subset of pre-selected markers increased predictive accuracy (from 0.625 to 0.648) for the simulated dataset for the QTL-MAS 2010 workshop.
The success of genome-wide selection (GS) approaches will depend crucially on the availability of efficient and easy-to-use computational tools. Therefore, approaches that can be implemented using mixed models hold particular promise and deserve detailed study. A particular class of mixed models suitable for GS is given by geostatistical mixed models, when genetic distance is treated analogously to spatial distance in geostatistics.
We consider various spatial mixed models for use in GS. The analyses presented for the QTL-MAS 2009 dataset pay particular attention to the modelling of residual errors as well as of polygenetic effects.
It is shown that geostatistical models are viable alternatives to ridge regression, one of the common approaches to GS. Correlations between genome-wide estimated breeding values and true breeding values were between 0.879 and 0.889. In the example considered, we did not find a large effect of the residual error variance modelling, largely because error variances were very small. A variance components model reflecting the pedigree of the crosses did not provide an improved fit.
We conclude that geostatistical models deserve further study as a tool to GS that is easily implemented in a mixed model package.
This paper proposes an automatic algorithm for the montage of OCT data sets, which produces a composite 3D OCT image over a large field of view out of several separate, partially overlapping OCT data sets. First the OCT fundus images (OFIs) are registered, using blood vessel ridges as the feature of interest and a two step iterative procedure to minimize the distance between all matching point pairs over the set of OFIs. Then the OCT data sets are merged to form a full 3D montage using cross-correlation. The algorithm was tested using an imaging protocol consisting of 8 OCT images for each eye, overlapping to cover a total retinal region of approximately 50x35 degrees. The results for 3 normal eyes and 3 eyes with retinal degeneration are analyzed, showing registration errors of 1.5±0.3 and 2.0±0.8 pixels respectively.
(110.4500) Optical coherence tomography; (100.0100) Image processing; (170.4460) Ophthalmic optics and devices; (170.5755) Retina scanning
The problem of simultaneous covariate selection and parameter inference for spatial regression models is considered. Previous research has shown that failure to take spatial correlation into account can influence the outcome of standard model selection methods. A Markov chain Monte Carlo (MCMC) method is investigated for the calculation of parameter estimates and posterior model probabilities for spatial regression models. The method can accommodate normal and non-normal response data and a large number of covariates. Thus the method is very flexible and can be used to fit spatial linear models, spatial linear mixed models, and spatial generalized linear mixed models (GLMMs). The Bayesian MCMC method also allows a priori unequal weighting of covariates, which is not possible with many model selection methods such as Akaike's information criterion (AIC). The proposed method is demonstrated on two data sets. The first is the whiptail lizard data set which has been previously analyzed by other researchers investigating model selection methods. Our results confirmed the previous analysis suggesting that sandy soil and ant abundance were strongly associated with lizard abundance. The second data set concerned pollution tolerant fish abundance in relation to several environmental factors. Results indicate that abundance is positively related to Strahler stream order and a habitat quality index. Abundance is negatively related to percent watershed disturbance.
Axonal growth cone collapse is accompanied by a reduction in filopodial F-actin. We demonstrate here that semaphorin 3A (Sema3A) induces a coordinated rearrangement of Sema3A receptors and F-actin during growth cone collapse. Differential interference contrast microscopy reveals that some sites of Sema3A-induced F-actin reorganization correlate with discrete vacuoles, structures involved in endocytosis. Endocytosis of FITC-dextran by the growth cone is enhanced during Sema3A treatment, and sites of dextran accumulation colocalize with actin-rich vacuoles and ridges of membrane. Furthermore, the Sema3A receptor proteins, neuropilin-1 and plexin, and the Sema3A signaling molecule, rac1, also reorganize to vacuoles and membrane ridges after Sema3A treatment. These data support a model whereby Sema3A stimulates endocytosis by focal and coordinated rearrangement of receptor and cytoskeletal elements. Dextran accumulation is also increased in retinal ganglion cell (RGC) growth cones, in response to ephrin A5, and in RGC and DRG growth cones, in response to myelin and phorbol-ester. Therefore, enhanced endocytosis may be a general principle of physiologic growth cone collapse. We suggest that growth cone collapse is mediated by both actin filament rearrangements and alterations in membrane dynamics.
membrane dynamics; ephrins; dextran uptake; axon guidance; axon repulsion
Missing covariate data is common in observational studies of time to an event, especially when covariates are repeatedly measured over time. Failure to account for the missing data can lead to bias or loss of efficiency, especially when the data are non-ignorably missing. Previous work has focused on the case of fixed covariates rather than those that are repeatedly measured over the follow-up period, so here we present a selection model that allows for proportional hazards regression with time-varying covariates when some covariates may be non-ignorably missing. We develop a fully Bayesian model and obtain posterior estimates of the parameters via the Gibbs sampler in WinBUGS. We illustrate our model with an analysis of post-diagnosis weight change and survival after breast cancer diagnosis in the Long Island Breast Cancer Study Project (LIBCSP) follow-up study. Our results indicate that post-diagnosis weight gain is associated with lower all-cause and breast cancer specific survival among women diagnosed with new primary breast cancer. Our sensitivity analysis showed only slight differences between models with different assumptions on the missing data mechanism yet the complete case analysis yielded markedly different results.
proportional hazards regression; non-ignorably missing data; missing covariates; selection model
Functional data are increasingly encountered in scientific studies, and their high dimensionality and complexity lead to many analytical challenges. Various methods for functional data analysis have been developed, including functional response regression methods that involve regression of a functional response on univariate/multivariate predictors with nonparametrically represented functional coefficients. In existing methods, however, the functional regression can be sensitive to outlying curves and outlying regions of curves, so is not robust. In this paper, we introduce a new Bayesian method, robust functional mixed models (R-FMM), for performing robust functional regression within the general functional mixed model framework, which includes multiple continuous or categorical predictors and random effect functions accommodating potential between-function correlation induced by the experimental design. The underlying model involves a hierarchical scale mixture model for the fixed effects, random effect and residual error functions. These modeling assumptions across curves result in robust nonparametric estimators of the fixed and random effect functions which down-weight outlying curves and regions of curves, and produce statistics that can be used to flag global and local outliers. These assumptions also lead to distributions across wavelet coefficients that have outstanding sparsity and adaptive shrinkage properties, with great flexibility for the data to determine the sparsity and the heaviness of the tails. Together with the down-weighting of outliers, these within-curve properties lead to fixed and random effect function estimates that appear in our simulations to be remarkably adaptive in their ability to remove spurious features yet retain true features of the functions. We have developed general code to implement this fully Bayesian method that is automatic, requiring the user to only provide the functional data and design matrices. It is efficient enough to handle large data sets, and yields posterior samples of all model parameters that can be used to perform desired Bayesian estimation and inference. Although we present details for a specific implementation of the R-FMM using specific distributional choices in the hierarchical model, 1D functions, and wavelet transforms, the method can be applied more generally using other heavy-tailed distributions, higher dimensional functions (e.g. images), and using other invertible transformations as alternatives to wavelets.
Adaptive LASSO; Bayesian methods; False discovery rate; Functional Data Analysis; Mixed models; Robust regression; Scale mixtures of normals; Sparsity Priors; Variable Selection; Wavelets
American Indian children have high rates of overweight and obesity, which may be partially attributable to screen-time behavior. Young children's screen-time behavior is strongly influenced by their environment and their parents' behavior. We explored whether parental television watching time, parental perceptions of children's screen time, and media-related resources in the home are related to screen time (ie, television, DVD/video, video game, and computer use) among Oglala Lakota youth residing on or near the Pine Ridge Reservation in South Dakota.
We collected baseline data from 431 child and parent/caregiver pairs who participated in Bright Start, a group-randomized, controlled, school-based obesity prevention trial to reduce excess weight gain. Controlling for demographic characteristics, we used linear regression analysis to assess associations between children's screen time and parental television watching time, parental perceptions of children's screen time, and availability of media-related household resources.
The most parsimonious model for explaining child screen time included the children's sex, parental body mass index, parental television watching time, how often the child watched television after school or in the evening, parental perception that the child spent too much time playing video games, how often the parent limited the child's television time, and the presence of a VCR/DVD player or video game player in the home (F7,367 = 14.67; P < .001; adjusted R
2 = .37). The presence of a television in the bedroom did not contribute significantly to the model.
Changes in parental television watching time, parental influence over children's screen-time behavior, and availability of media-related resources in the home could decrease screen time and may be used as a strategy for reducing overweight and obesity in American Indian children.
We explore the applicability of an additive treatment of substituent effects to the analysis and design of HIV protease inhibitors. Affinity data for a set of inhibitors with a common chemical framework were analyzed to provide estimates of the free energy contribution of each chemical substituent. These estimates were then used to design new inhibitors, whose high affinities were confirmed by synthesis and experimental testing. Derivations of additive models by least-squares and ridge-regression methods were found to yield statistically similar results. The additivity approach was also compared with standard molecular descriptor-based QSAR; the latter was not found to provide superior predictions. Crystallographic studies of HIV protease-inhibitor complexes help explain the perhaps surprisingly high degree of substituent additivity in this system, and allow some of the additivity coefficients to be rationalized on a structural basis.
Global change affects alpine ecosystems by, among many effects, by altering plant distributions and community composition. However, forecasting alpine vegetation change is challenged by a scarcity of studies observing change in fixed plots spanning decadal-time scales. We present in this article a probabilistic modeling approach that forecasts vegetation change on Niwot Ridge, CO using plant abundance data collected from marked plots established in 1971 and resampled in 1991 and 2001. Assuming future change can be inferred from past change, we extrapolate change for 100 years from 1971 and correlate trends for each plant community with time series environmental data (1971–2001). Models predict a decreased extent of Snowbed vegetation and an increased extent of Shrub Tundra by 2071. Mean annual maximum temperature and nitrogen deposition were the primary a posteriori correlates of plant community change. This modeling effort is useful for generating hypotheses of future vegetation change that can be tested with future sampling efforts.
Probabilistic modeling; Alpine vegetation change; Snowbeds; Climate warming; Plant community change
Biological networks are constantly subjected to random perturbations, and efficient feedback and compensatory mechanisms exist to maintain their stability. There is an increased interest in building gene regulatory networks (GRNs) from temporal gene expression data because of their numerous applications in life sciences. However, because of the limited number of time points at which gene expressions can be gathered in practice, computational techniques of building GRN often lead to inaccuracies and instabilities. This paper investigates the stability of sparse auto-regressive models of building GRN from gene expression data.
Criteria for evaluating the stability of estimating GRN structure are proposed. Thereby, stability of multivariate vector autoregressive (MVAR) methods - ridge, lasso, and elastic-net - of building GRN were studied by simulating temporal gene expression datasets on scale-free topologies as well as on real data gathered over Hela cell-cycle. Effects of the number of time points on the stability of constructing GRN are investigated. When the number of time points are relatively low compared to the size of network, both accuracy and stability are adversely affected. At least, the number of time points equal to the number of genes in the network are needed to achieve decent accuracy and stability of the networks. Our results on synthetic data indicate that the stability of lasso and elastic-net MVAR methods are comparable, and their accuracies are much higher than the ridge MVAR. As the size of the network grows, the number of time points required to achieve acceptable accuracy and stability are much less relative to the number of genes in the network. The effects of false negatives are easier to improve by increasing the number time points than those due to false positives. Application to HeLa cell-cycle gene expression dataset shows that biologically stable GRN can be obtained by introducing perturbations to the data.
Accuracy and stability of building GRN are crucial for investigation of gene regulations. Sparse MVAR techniques such as lasso and elastic-net provide accurate and stable methods for building even GRN of small size. The effect of false negatives is corrected much easier with the increased number of time points than those due to false positives. With real data, we demonstrate how stable networks can be derived by introducing random perturbation to data.
Longitudinal studies are helpful in understanding how subtle associations between factors of interest change over time. Our goal is to apply statistical methods which are appropriate for analyzing longitudinal data to a repeated measures epidemiological study as a tutorial in the appropriate use and interpretation of random effects models. To motivate their use, we study the association of alcohol consumption on markers of HIV disease progression in an observational cohort. To make valid inferences, the association among measurements correlated within a subject must be taken into account.
We describe a linear mixed effects regression framework that accounts for the clustering of longitudinal data and that can be fit using standard statistical software. We apply the linear mixed effects model to a previously published dataset of HIV infected individuals with a history of alcohol problems who are receiving HAART (n = 197). The researchers were interested in determining the effect of alcohol use on HIV disease progression over time. Fitting a linear mixed effects multiple regression model with a random intercept and random slope for each subject accounts for the association of observations within subjects and yields parameters interpretable as in ordinary multiple regression. A significant interaction between alcohol use and adherence to HAART is found: subjects who use alcohol and are not fully adherent to their HIV medications had higher log RNA (ribonucleic acid) viral load levels than fully adherent non-drinkers, fully adherent alcohol users, and non-drinkers who were not fully adherent.
Longitudinal studies are increasingly common in epidemiological research. Software routines that account for correlation between repeated measures using linear mixed effects methods are now generally available and straightforward to utilize. These models allow the relaxation of assumptions needed for approaches such as repeated measures ANOVA, and should be routinely incorporated into the analysis of cohort studies.