Technological developments have increased the feasibility of large scale genetic association studies. Densely typed genetic markers are obtained using SNP arrays, next-generation sequencing technologies and imputation. However, SNPs typed using these methods can be highly correlated due to linkage disequilibrium among them, and standard multiple regression techniques fail with these data sets due to their high dimensionality and correlation structure. There has been increasing interest in using penalised regression in the analysis of high dimensional data. Ridge regression is one such penalised regression technique which does not perform variable selection, instead estimating a regression coefficient for each predictor variable. It is therefore desirable to obtain an estimate of the significance of each ridge regression coefficient.
We develop and evaluate a test of significance for ridge regression coefficients. Using simulation studies, we demonstrate that the performance of the test is comparable to that of a permutation test, with the advantage of a much-reduced computational cost. We introduce the p-value trace, a plot of the negative logarithm of the p-values of ridge regression coefficients with increasing shrinkage parameter, which enables the visualisation of the change in p-value of the regression coefficients with increasing penalisation. We apply the proposed method to a lung cancer case-control data set from EPIC, the European Prospective Investigation into Cancer and Nutrition.
The proposed test is a useful alternative to a permutation test for the estimation of the significance of ridge regression coefficients, at a much-reduced computational cost. The p-value trace is an informative graphical tool for evaluating the results of a test of significance of ridge regression coefficients as the shrinkage parameter increases, and the proposed test makes its production computationally feasible.
In this study we compared different statistical procedures for estimating SNP effects using the simulated data set from the XII QTL-MAS workshop. Five procedures were considered and tested in a reference population, i.e., the first four generations, from which phenotypes and genotypes were available. The procedures can be interpreted as variants of ridge regression, with different ways for defining the shrinkage parameter. Comparisons were made with respect to the correlation between genomic and conventional estimated breeding values. Moderate correlations were obtained from all methods. Two of them were used to predict genomic breeding values in the last three generations. Correlations between these and the true breeding values were also moderate. We concluded that the ridge regression procedures applied in this study did not outperform the simple use of a ratio of variances in a mixed model method, both providing moderate accuracies of predicted genomic breeding values.
In some biomedical studies, biomarkers are measured repeatedly along some spatial structure or over time and are subject to measurement error. In these studies, it is often of interest to evaluate associations between a clinical endpoint and these biomarkers (also known as functional biomarkers). There are potentially two levels of correlation in such data, namely, between repeated measurements of a biomarker from the same subject and between multiple biomarkers from the same subject; none of the existing methods accounts for correlation between multiple functional biomarkers. We propose a Bayesian approach to model a clinical outcome of interest (e.g., risk for colorectal cancer) in the presence of multiple functional biomarkers while accounting for potential correlation. Our simulations show that the proposed approach achieves good performance in finite samples under various settings. In the presence of substantial or moderate correlation, the proposed approach outperforms an existing approach that does not account for correlation. The proposed approach is applied to a study of biomarkers of risk for colorectal neoplasms and our results show that the risk for colorectal cancer is associated with two functional biomarkers, APC and TGF-α, in particular, with their values in the region between the proliferating zone and the differentiating zone of colorectal crypts.
Bayesian models; Correlated data; Functional biomarker; Measurement error
Within a neurodevelopmental model of schizophrenia, prenatal developmental deviations are implicated as early signs of increased risk for future illness. External markers of central nervous system maldevelopment may provide information regarding the nature and timing of prenatal disruptions among individuals with schizophrenia. One such marker is dermatoglyphic abnormalities (DAs) or unusual epidermal ridge patterns. Studies targeting DAs as a potential sign of early developmental disruption have yielded mixed results with regard to the strength of the association between DAs and schizophrenia. The current study aimed to resolve these inconsistencies by conducting a meta-analysis examining the six most commonly cited dermatoglyphic features among individuals with diagnoses of schizophrenia. Twenty-two studies published between 1968 and 2012 were included. Results indicated significant but small effects for total finger ridge count and total A-B ridge count, with lower counts among individuals with schizophrenia relative to controls. Other DAs examined in the current meta-analysis did not yield significant effects. Total finger ridge count and total A-B ridge count appear to yield the most reliable dermatoglyphic differences between individuals with and without schizophrenia.
schizophrenia; dermatoglyphics; meta-analysis; neurodevelopment
Stable myoelectric control of hand prostheses remains an open problem. The only successful human–machine interface is surface electromyography, typically allowing control of a few degrees of freedom. Machine learning techniques may have the potential to remove these limitations, but their performance is thus far inadequate: myoelectric signals change over time under the influence of various factors, deteriorating control performance. It is therefore necessary, in the standard approach, to regularly retrain a new model from scratch. We hereby propose a non-linear incremental learning method in which occasional updates with a modest amount of novel training data allow continual adaptation to the changes in the signals. In particular, Incremental Ridge Regression and an approximation of the Gaussian Kernel known as Random Fourier Features are combined to predict finger forces from myoelectric signals, both finger-by-finger and grouped in grasping patterns. We show that the approach is effective and practically applicable to this problem by first analyzing its performance while predicting single-finger forces. Surface electromyography and finger forces were collected from 10 intact subjects during four sessions spread over two different days; the results of the analysis show that small incremental updates are indeed effective to maintain a stable level of performance. Subsequently, we employed the same method on-line to teleoperate a humanoid robotic arm equipped with a state-of-the-art commercial prosthetic hand. The subject could reliably grasp, carry and release everyday-life objects, enforcing stable grasping irrespective of the signal changes, hand/arm movements and wrist pronation and supination.
surface electromyography; machine learning; incremental learning; human–machine interfaces; rehabilitation robotics; force control
Ridge regression with heteroscedastic marker variances provides an alternative to Bayesian genome-wide prediction methods. Our objectives were to suggest new methods to determine marker-specific shrinkage factors for heteroscedastic ridge regression and to investigate their properties with respect to computational efficiency and accuracy of estimated effects. We analyzed published data sets of maize, wheat, and sugar beet as well as simulated data with the new methods. Ridge regression with shrinkage factors that were proportional to single-marker analysis of variance estimates of variance components (i.e., RRWA) was the fastest method. It required computation times of less than 1 sec for medium-sized data sets, which have dimensions that are common in plant breeding. A modification of the expectation-maximization algorithm that yields heteroscedastic marker variances (i.e., RMLV) resulted in the most accurate marker effect estimates. It outperformed the homoscedastic ridge regression approach for best linear unbiased prediction in particular for situations with high marker density and strong linkage disequilibrium along the chromosomes, a situation that occurs often in plant breeding populations. We conclude that the RRWA and RMLV approaches provide alternatives to the commonly used Bayesian methods, in particular for applications in which computational feasibility or accuracy of effect estimates are important, such as detection or functional analysis of genes or planning crosses.
genome-wide prediction; ridge regression; heteroscedastic marker variances; linkage disequilibrium; plant breeding populations; GenPred; Shared data resources
The present study was conducted to predict survival time in patients with diffuse large B-cell lymphoma, DLBCL, based on microarray data using Cox regression model combined with seven dimension reduction methods. This historical cohort included 2042 gene expression measurements from 40 patients with DLBCL. In order to predict survival, a combination of Cox regression model was used with seven methods for dimension reduction or shrinkage including univariate selection, forward stepwise selection, principal component regression, supervised principal component regression, partial least squares regression, ridge regression and Losso. The capacity of predictions was examined by three different criteria including log rank test, prognostic index and deviance. MATLAB r2008a and RKWard software were used for data analysis. Based on our findings, performance of ridge regression was better than other methods. Based on ridge regression coefficients and a given cut point value, 16 genes were selected. By using forward stepwise selection method in Cox regression model, it was indicated that the expression of genes GENE3555X and GENE3807X decreased the survival time (P=0.008 and P=0.003, respectively), whereas the genes GENE3228X and GENE1551X increased survival time (P=0.002 and P<0.001, respectively). This study indicated that ridge regression method had higher capacity than other dimension reduction methods for the prediction of survival time in patients with DLBCL. Furthermore, a combination of statistical methods and microarray data could help to detect influential genes in survival.
Lymphoma; gene expression; microarray; survival analysis; dimension reduction; ridge regression
Graphical Gaussian models are popular tools for the estimation of (undirected) gene association networks from microarray data. A key issue when the number of variables greatly exceeds the number of samples is the estimation of the matrix of partial correlations. Since the (Moore-Penrose) inverse of the sample covariance matrix leads to poor estimates in this scenario, standard methods are inappropriate and adequate regularization techniques are needed. Popular approaches include biased estimates of the covariance matrix and high-dimensional regression schemes, such as the Lasso and Partial Least Squares.
In this article, we investigate a general framework for combining regularized regression methods with the estimation of Graphical Gaussian models. This framework includes various existing methods as well as two new approaches based on ridge regression and adaptive lasso, respectively. These methods are extensively compared both qualitatively and quantitatively within a simulation study and through an application to six diverse real data sets. In addition, all proposed algorithms are implemented in the R package "parcor", available from the R repository CRAN.
In our simulation studies, the investigated non-sparse regression methods, i.e. Ridge Regression and Partial Least Squares, exhibit rather conservative behavior when combined with (local) false discovery rate multiple testing in order to decide whether or not an edge is present in the network. For networks with higher densities, the difference in performance of the methods decreases. For sparse networks, we confirm the Lasso's well known tendency towards selecting too many edges, whereas the two-stage adaptive Lasso is an interesting alternative that provides sparser solutions. In our simulations, both sparse and non-sparse methods are able to reconstruct networks with cluster structures. On six real data sets, we also clearly distinguish the results obtained using the non-sparse methods and those obtained using the sparse methods where specification of the regularization parameter automatically means model selection. In five out of six data sets, Partial Least Squares selects very dense networks. Furthermore, for data that violate the assumption of uncorrelated observations (due to replications), the Lasso and the adaptive Lasso yield very complex structures, indicating that they might not be suited under these conditions. The shrinkage approach is more stable than the regression based approaches when using subsampling.
An important challenge in analyzing high dimensional data in regression settings is that of facing a situation in which the number of covariates p in the model greatly exceeds the sample size n (sometimes termed the “p > n” problem). In this article, we develop a novel specification for a general class of prior distributions, called Information Matrix (IM) priors, for high-dimensional generalized linear models. The priors are first developed for settings in which p < n, and then extended to the p > n case by defining a ridge parameter in the prior construction, leading to the Information Matrix Ridge (IMR) prior. The IM and IMR priors are based on a broad generalization of Zellner’s g-prior for Gaussian linear models. Various theoretical properties of the prior and implied posterior are derived including existence of the prior and posterior moment generating functions, tail behavior, as well as connections to Gaussian priors and Jeffreys’ prior. Several simulation studies and an application to a nucleosomal positioning data set demonstrate its advantages over Gaussian, as well as g-priors, in high dimensional settings.
Fisher Information; g-prior; Importance sampling; Model identifiability; Prior elicitation
Motivation: Penalized regression methods have been adopted widely for high-dimensional feature selection and prediction in many bioinformatic and biostatistical contexts. While their theoretical properties are well-understood, specific methodology for their optimal application to genomic data has not been determined.
Results: Through simulation of contrasting scenarios of correlated high-dimensional survival data, we compared the LASSO, Ridge and Elastic Net penalties for prediction and variable selection. We found that a 2D tuning of the Elastic Net penalties was necessary to avoid mimicking the performance of LASSO or Ridge regression. Furthermore, we found that in a simulated scenario favoring the LASSO penalty, a univariate pre-filter made the Elastic Net behave more like Ridge regression, which was detrimental to prediction performance. We demonstrate the real-life application of these methods to predicting the survival of cancer patients from microarray data, and to classification of obese and lean individuals from metagenomic data. Based on these results, we provide an optimized set of guidelines for the application of penalized regression for reproducible class comparison and prediction with genomic data.
Availability and Implementation: A parallelized implementation of the methods presented for regression and for simulation of synthetic data is provided as the pensim R package, available at http://cran.r-project.org/web/packages/pensim/index.html.
Contact: email@example.com; firstname.lastname@example.org
Supplementary Information: Supplementary data are available at Bioinformatics online.
The Cartesian sampled three-dimensional HNCO experiment is inherently limited in time resolution and sensitivity for the real time measurement of protein hydrogen exchange. This is largely overcome by use of the radial HNCO experiment that employs the use of optimized sampling angles. The significant practical limitation presented by use of three-dimensional data is the large data storage and processing requirements necessary and is largely overcome by taking advantage of the inherent capabilities of the 2D-FT to process selective frequency space without artifact or limitation. Decomposition of angle spectra into positive and negative ridge components provides increased resolution and allows statistical averaging of intensity and therefore increased precision. Strategies for averaging ridge cross sections within and between angle spectra are developed to allow further statistical approaches for increasing the precision of measured hydrogen occupancy. Intensity artifacts potentially introduced by over-pulsing are effectively eliminated by use of the BEST approach.
hydrogen exchange; radial sampling; angle selection; two-dimensional FT
It is widely believed that both common and rare variants contribute to the risks of common diseases or complex traits and the cumulative effects of multiple rare variants can explain a significant proportion of trait variances. Advances in high-throughput DNA sequencing technologies allow us to genotype rare causal variants and investigate the effects of such rare variants on complex traits. We developed an adaptive ridge regression method to analyze the collective effects of multiple variants in the same gene or the same functional unit. Our model focuses on continuous trait and incorporates covariate factors to remove potential confounding effects. The proposed method estimates and tests multiple rare variants collectively but does not depend on the assumption of same direction of each rare variant effect. Compared with the Bayesian hierarchical generalized linear model approach, the state-of-the-art method of rare variant detection, the proposed new method is easy to implement, yet it has higher statistical power. Application of the new method is demonstrated using the well-known data from the Dallas Heart Study.
In recent years, many methods have been developed for regression in high-dimensional settings. We propose covariance-regularized regression, a family of methods that use a shrunken estimate of the inverse covariance matrix of the features in order to achieve superior prediction. An estimate of the inverse covariance matrix is obtained by maximizing its log likelihood, under a multivariate normal model, subject to a constraint on its elements; this estimate is then used to estimate coefficients for the regression of the response onto the features. We show that ridge regression, the lasso, and the elastic net are special cases of covariance-regularized regression, and we demonstrate that certain previously unexplored forms of covariance-regularized regression can outperform existing methods in a range of situations. The covariance-regularized regression framework is extended to generalized linear models and linear discriminant analysis, and is used to analyze gene expression data sets with multiple class and survival outcomes.
regression; classification; n ≪ p; covariance regularization
Because every disease has its unique survival pattern, it is necessary
to find a suitable model to simulate followups. DNA microarray is a useful technique to detect thousands of gene expressions at one time and is usually employed to classify different types of cancer. We propose combination methods of penalized regression models and nonnegative matrix
factorization (NMF) for predicting survival. We tried L1- (lasso), L2- (ridge), and L1-L2 combined (elastic net) penalized regression for diffuse large B-cell lymphoma (DLBCL) patients'
microarray data and found that L1-L2 combined method predicts survival best with the smallest logrank P value. Furthermore, 80% of selected genes have been reported to correlate with carcinogenesis or lymphoma. Through NMF we found that DLBCL patients can be divided into 4 groups clearly, and it implies that DLBCL may have 4 subtypes which have a little different survival patterns. Next we excluded some patients who were indicated hard to classify in NMF and executed three penalized regression models again. We found that the performance of survival prediction has been improved with lower logrank P values. Therefore, we conclude that after preselection of patients by NMF, penalized regression models can predict DLBCL patients' survival successfully.
Survival prediction from a large number of covariates is a current focus of statistical and medical research. In this paper, we study a methodology known as the compound covariate prediction performed under univariate Cox proportional hazard models. We demonstrate via simulations and real data analysis that the compound covariate method generally competes well with ridge regression and Lasso methods, both already well-studied methods for predicting survival outcomes with a large number of covariates. Furthermore, we develop a refinement of the compound covariate method by incorporating likelihood information from multivariate Cox models. The new proposal is an adaptive method that borrows information contained in both the univariate and multivariate Cox regression estimators. We show that the new proposal has a theoretical justification from a statistical large sample theory and is naturally interpreted as a shrinkage-type estimator, a popular class of estimators in statistical literature. Two datasets, the primary biliary cirrhosis of the liver data and the non-small-cell lung cancer data, are used for illustration. The proposed method is implemented in R package “compound.Cox” available in CRAN at http://cran.r-project.org/.
Mixed-effects logistic regression models are described for analysis of longitudinal ordinal outcomes, where observations are observed clustered within subjects. Random effects are included in the model to account for the correlation of the clustered observations. Typically, the error variance and the variance of the random effects are considered to be homogeneous. These variance terms characterize the within-subjects (i.e., error variance) and between-subjects (i.e., random-effects variance) variation in the data. In this article, we describe how covariates can influence these variances, and also extend the standard logistic mixed model by adding a subject-level random effect to the within-subject variance specification. This permits subjects to have influence on the mean, or location, and variability, or (square of the) scale, of their responses. Additionally, we allow the random effects to be correlated. We illustrate application of these models for ordinal data using Ecological Momentary Assessment (EMA) data, or intensive longitudinal data, from an adolescent smoking study. These mixed-effects ordinal location scale models have useful applications in mental health research where outcomes are often ordinal and there is interest in subject heterogeneity, both between- and within-subjects.
Complex variation; Mood variation; Heterogeneity; Variance modeling
Mineral imbalance in the body may significantly contribute to the development and course of hypertension. In this paper, blood pressure figures have been linked to the levels of Fe, Ca, Mg, Zn, Cu, Na and K in hair. The research sample was composed of young men (n = 91) aged 13–21, from the town of Mafinga, Iringa District, Tanzania. The data collected included their age, tribal background and weekly diet. Based on body mass index, the participants were categorised into pre-defined subgroups. To examine how the minerals in question affect blood pressure, correlation analysis and multiple ridge regression analysis were performed. Analysis of ridge regression findings for the researched group (n = 91) shows that the minerals under scrutiny account for systolic blood pressure variation in 13 % and in 15 % for diastolic blood pressure variation. After including two additional variables—calendar age and body mass index—in regression analysis, the ultimate coefficient of determination (R2) changes for systolic blood pressure and remains the same for diastolic blood pressure (R2 = 0.194 and R2 = 0.156, respectively). Nutritional analysis shows that the students included in the study received insufficient calories per day (1,500–2,200 kcal). The group of students with abnormal blood pressure were not aware of their poor health. Research findings may result from progressive environmental changes and poor nutrition in terms of food quantity and quality, which had an impact on the subjects’ blood pressure. Hair analysis used to determine mineral content in the body may be an auxiliary tool in identifying the links between factors leading to the development of hypertension.
Trace elements; SBP; DBP; Fe; Ca; Mg; Zn; Cu; Na; K; BMI; Africa; Tanzania; Bantu
In dementia screening tests, item selection for shortening an existing screening test can be achieved using multiple logistic regression. However, maximum likelihood estimates for such logistic regression models often experience serious bias or even non-existence because of separation and multicollinearity problems resulting from a large number of highly correlated items. Firth (1993, Biometrika, 80(1), 27–38) proposed a penalized likelihood estimator for generalized linear models and it was shown to reduce bias and the non-existence problems. The ridge regression has been used in logistic regression to stabilize the estimates in cases of multicollinearity. However, neither solves the problems for each other. In this paper, we propose a double penalized maximum likelihood estimator combining Firth’s penalized likelihood equation with a ridge parameter. We present a simulation study evaluating the empirical performance of the double penalized likelihood estimator in small to moderate sample sizes. We demonstrate the proposed approach using a current screening data from a community-based dementia study.
Logistic regression; maximum likelihood; penalized maximum likelihood; ridge regression; item selection
Accurate prediction of genomic breeding values (GEBVs) requires numerous markers. However, predictive accuracy can be enhanced by excluding markers with no effects or with inconsistent effects among crosses that can adversely affect the prediction of GEBVs.
We present three different approaches for pre-selecting markers prior to predicting GEBVs using four different BLUP methods, including ridge regression and three spatial models. Performances of the models were evaluated using 5-fold cross-validation.
Results and conclusions
Ridge regression and the spatial models gave essentially similar fits. Pre-selecting markers was evidently beneficial since excluding markers with inconsistent effects among crosses increased the correlation between GEBVs and true breeding values of the non-phenotyped individuals from 0.607 (using all markers) to 0.625 (using pre-selected markers). Moreover, extension of the ridge regression model to allow for heterogeneous variances between the most significant subset and the complementary subset of pre-selected markers increased predictive accuracy (from 0.625 to 0.648) for the simulated dataset for the QTL-MAS 2010 workshop.
The problem of simultaneous covariate selection and parameter inference for spatial regression models is considered. Previous research has shown that failure to take spatial correlation into account can influence the outcome of standard model selection methods. A Markov chain Monte Carlo (MCMC) method is investigated for the calculation of parameter estimates and posterior model probabilities for spatial regression models. The method can accommodate normal and non-normal response data and a large number of covariates. Thus the method is very flexible and can be used to fit spatial linear models, spatial linear mixed models, and spatial generalized linear mixed models (GLMMs). The Bayesian MCMC method also allows a priori unequal weighting of covariates, which is not possible with many model selection methods such as Akaike's information criterion (AIC). The proposed method is demonstrated on two data sets. The first is the whiptail lizard data set which has been previously analyzed by other researchers investigating model selection methods. Our results confirmed the previous analysis suggesting that sandy soil and ant abundance were strongly associated with lizard abundance. The second data set concerned pollution tolerant fish abundance in relation to several environmental factors. Results indicate that abundance is positively related to Strahler stream order and a habitat quality index. Abundance is negatively related to percent watershed disturbance.
The success of genome-wide selection (GS) approaches will depend crucially on the availability of efficient and easy-to-use computational tools. Therefore, approaches that can be implemented using mixed models hold particular promise and deserve detailed study. A particular class of mixed models suitable for GS is given by geostatistical mixed models, when genetic distance is treated analogously to spatial distance in geostatistics.
We consider various spatial mixed models for use in GS. The analyses presented for the QTL-MAS 2009 dataset pay particular attention to the modelling of residual errors as well as of polygenetic effects.
It is shown that geostatistical models are viable alternatives to ridge regression, one of the common approaches to GS. Correlations between genome-wide estimated breeding values and true breeding values were between 0.879 and 0.889. In the example considered, we did not find a large effect of the residual error variance modelling, largely because error variances were very small. A variance components model reflecting the pedigree of the crosses did not provide an improved fit.
We conclude that geostatistical models deserve further study as a tool to GS that is easily implemented in a mixed model package.
Valid research on neglect rehabilitation demands a statistical approach commensurate with the characteristics of neglect rehabilitation data: neglect arises from impairment in distinct brain networks leading to large between-subject variability in baseline symptoms and recovery trajectories. Studies enrolling medically ill, disabled patients, may suffer from missing, unbalanced data, and small sample sizes. Finally, assessment of rehabilitation requires a description of continuous recovery trajectories. Unfortunately, the statistical method currently employed in most studies of neglect treatment [repeated measures analysis of variance (ANOVA), rANOVA] does not well-address these issues. Here we review an alternative, mixed linear modeling (MLM), that is more appropriate for assessing change over time. MLM better accounts for between-subject heterogeneity in baseline neglect severity and in recovery trajectory. MLM does not require complete or balanced data, nor does it make strict assumptions regarding the data structure. Furthermore, because MLM better models between-subject heterogeneity it often results in increased power to observe treatment effects with smaller samples. After reviewing current practices in the field, and the assumptions of rANOVA, we provide an introduction to MLM. We review its assumptions, uses, advantages, and disadvantages. Using real and simulated data, we illustrate how MLM may improve the ability to detect effects of treatment over ANOVA, particularly with the small samples typical of neglect research. Furthermore, our simulation analyses result in recommendations for the design of future rehabilitation studies. Because between-subject heterogeneity is one important reason why studies of neglect treatments often yield conflicting results, employing statistical procedures that model this heterogeneity more accurately will increase the efficiency of our efforts to find treatments to improve the lives of individuals with neglect.
spatial neglect; rehabilitation; mixed linear modeling; statistical methods; power simulation; type I error simulation
This paper proposes an automatic algorithm for the montage of OCT data sets, which produces a composite 3D OCT image over a large field of view out of several separate, partially overlapping OCT data sets. First the OCT fundus images (OFIs) are registered, using blood vessel ridges as the feature of interest and a two step iterative procedure to minimize the distance between all matching point pairs over the set of OFIs. Then the OCT data sets are merged to form a full 3D montage using cross-correlation. The algorithm was tested using an imaging protocol consisting of 8 OCT images for each eye, overlapping to cover a total retinal region of approximately 50x35 degrees. The results for 3 normal eyes and 3 eyes with retinal degeneration are analyzed, showing registration errors of 1.5±0.3 and 2.0±0.8 pixels respectively.
(110.4500) Optical coherence tomography; (100.0100) Image processing; (170.4460) Ophthalmic optics and devices; (170.5755) Retina scanning
Missing covariate data is common in observational studies of time to an event, especially when covariates are repeatedly measured over time. Failure to account for the missing data can lead to bias or loss of efficiency, especially when the data are non-ignorably missing. Previous work has focused on the case of fixed covariates rather than those that are repeatedly measured over the follow-up period, so here we present a selection model that allows for proportional hazards regression with time-varying covariates when some covariates may be non-ignorably missing. We develop a fully Bayesian model and obtain posterior estimates of the parameters via the Gibbs sampler in WinBUGS. We illustrate our model with an analysis of post-diagnosis weight change and survival after breast cancer diagnosis in the Long Island Breast Cancer Study Project (LIBCSP) follow-up study. Our results indicate that post-diagnosis weight gain is associated with lower all-cause and breast cancer specific survival among women diagnosed with new primary breast cancer. Our sensitivity analysis showed only slight differences between models with different assumptions on the missing data mechanism yet the complete case analysis yielded markedly different results.
proportional hazards regression; non-ignorably missing data; missing covariates; selection model
Axonal growth cone collapse is accompanied by a reduction in filopodial F-actin. We demonstrate here that semaphorin 3A (Sema3A) induces a coordinated rearrangement of Sema3A receptors and F-actin during growth cone collapse. Differential interference contrast microscopy reveals that some sites of Sema3A-induced F-actin reorganization correlate with discrete vacuoles, structures involved in endocytosis. Endocytosis of FITC-dextran by the growth cone is enhanced during Sema3A treatment, and sites of dextran accumulation colocalize with actin-rich vacuoles and ridges of membrane. Furthermore, the Sema3A receptor proteins, neuropilin-1 and plexin, and the Sema3A signaling molecule, rac1, also reorganize to vacuoles and membrane ridges after Sema3A treatment. These data support a model whereby Sema3A stimulates endocytosis by focal and coordinated rearrangement of receptor and cytoskeletal elements. Dextran accumulation is also increased in retinal ganglion cell (RGC) growth cones, in response to ephrin A5, and in RGC and DRG growth cones, in response to myelin and phorbol-ester. Therefore, enhanced endocytosis may be a general principle of physiologic growth cone collapse. We suggest that growth cone collapse is mediated by both actin filament rearrangements and alterations in membrane dynamics.
membrane dynamics; ephrins; dextran uptake; axon guidance; axon repulsion