Estimation of genetic covariance matrices for multivariate problems comprising more than a few traits is inherently problematic, since sampling variation increases dramatically with the number of traits. This paper investigates the efficacy of regularized estimation of covariance components in a maximum likelihood framework, imposing a penalty on the likelihood designed to reduce sampling variation. In particular, penalties that "borrow strength" from the phenotypic covariance matrix are considered.
An extensive simulation study was carried out to investigate the reduction in average 'loss', i.e. the deviation in estimated matrices from the population values, and the accompanying bias for a range of parameter values and sample sizes. A number of penalties are examined, penalizing either the canonical eigenvalues or the genetic covariance or correlation matrices. In addition, several strategies to determine the amount of penalization to be applied, i.e. to estimate the appropriate tuning factor, are explored.
It is shown that substantial reductions in loss for estimates of genetic covariance can be achieved for small to moderate sample sizes. While no penalty performed best overall, penalizing the variance among the estimated canonical eigenvalues on the logarithmic scale or shrinking the genetic towards the phenotypic correlation matrix appeared most advantageous. Estimating the tuning factor using cross-validation resulted in a loss reduction 10 to 15% less than that obtained if population values were known. Applying a mild penalty, chosen so that the deviation in likelihood from the maximum was non-significant, performed as well if not better than cross-validation and can be recommended as a pragmatic strategy.
Penalized maximum likelihood estimation provides the means to 'make the most' of limited and precious data and facilitates more stable estimation for multi-dimensional analyses. It should become part of our everyday toolkit for multivariate estimation in quantitative genetics.
This paper proposes a two-stage maximum likelihood (ML) approach to normal mixture structural equation modeling (SEM), and develops statistical inference that allows distributional misspecification. Saturated means and covariances are estimated at stage-1 together with a sandwich-type covariance matrix. These are used to evaluate structural models at stage-2. Techniques accumulated in the conventional SEM literature for model diagnosis and evaluation can be used to study the model structure for each component. Examples show that the two-stage ML approach leads to correct or nearly correct models even when the normal mixture assumptions are violated and initial models are misspecified. Compared to single-stage ML, two-stage ML avoids the confounding effect of model specification and the number of components, and is computationally more efficient. Monte-Carlo results indicate that two-stage ML loses only minimal efficiency under the condition where single-stage ML performs best. Monte-Carlo results also indicate that the commonly used model selection criterion BIC is more robust to distribution violations for the saturated model than that for a structural model at moderate sample sizes. The proposed two-stage ML approach is also extremely flexible in modeling different components with different models. Potential new developments in the mixture modeling literature can be easily adapted to study issues with normal mixture SEM.
Asymptotics; efficiency; distribution violation; model misspecification; model modification; model evaluation; sandwich-type covariance matrix
Estimation of longitudinal data covariance structure poses significant challenges because the data are usually collected at irregular time points. A viable semiparametric model for covariance matrices was proposed in Fan, Huang and Li (2007) that allows one to estimate the variance function nonparametrically and to estimate the correlation function parametrically via aggregating information from irregular and sparse data points within each subject. However, the asymptotic properties of their quasi-maximum likelihood estimator (QMLE) of parameters in the covariance model are largely unknown. In the current work, we address this problem in the context of more general models for the conditional mean function including parametric, nonparametric, or semi-parametric. We also consider the possibility of rough mean regression function and introduce the difference-based method to reduce biases in the context of varying-coefficient partially linear mean regression models. This provides a more robust estimator of the covariance function under a wider range of situations. Under some technical conditions, consistency and asymptotic normality are obtained for the QMLE of the parameters in the correlation function. Simulation studies and a real data example are used to illustrate the proposed approach.
Correlation structure; difference-based estimation; quasi-maximum likelihood; varying-coefficient partially linear model
Variance component (VC) approaches based on restricted maximum likelihood (REML) have been used as an attractive method for positioning of quantitative trait loci (QTL). Linkage disequilibrium (LD) information can be easily implemented in the covariance structure among QTL effects (e.g. genotype relationship matrix) and mapping resolution appears to be high. Because of the use of LD information, the covariance structure becomes much richer and denser compared to the use of linkage information alone. This makes an average information (AI) REML algorithm based on mixed model equations and sparse matrix techniques less useful. In addition, (near-) singularity problems often occur with high marker densities, which is common in fine-mapping, causing numerical problems in AIREML based on mixed model equations. The present study investigates the direct use of the variance covariance matrix of all observations in AIREML for LD mapping with a general complex pedigree. The method presented is more efficient than the usual approach based on mixed model equations and robust to numerical problems caused by near-singularity due to closely linked markers. It is also feasible to fit multiple QTL simultaneously in the proposed method whereas this would drastically increase computing time when using mixed model equation-based methods.
quantitative trait loci; fine-mapping; linkage disequilibrium; average information; genotype relationships matrix
The selection of random effects in linear mixed models is an important yet challenging problem in practice. We propose a robust and unified framework for automatically selecting random effects and estimating covariance components in linear mixed models. A moment-based loss function is first constructed for estimating the covariance matrix of random effects. Two types of shrinkage penalties, a hard thresholding operator and a new sandwich-type soft-thresholding penalty, are then imposed for sparse estimation and random effects selection. Compared with existing approaches, the new procedure does not require any distributional assumption on the random effects and error terms. We establish the asymptotic properties of the resulting estimator in terms of its consistency in both random effects selection and variance component estimation. Optimization strategies are suggested to tackle the computational challenges involved in estimating the sparse variance-covariance matrix. Furthermore, we extend the procedure to incorporate the selection of fixed effects as well. Numerical results show promising performance of the new approach in selecting both random and fixed effects and, consequently, improving the efficiency of estimating model parameters. Finally, we apply the approach to a data set from the Amsterdam Growth and Health study.
Hard thresholding; Linear mixed model; Shrinkage estimation; Variance component selection
We present a novel method for estimating tree-structured covariance matrices directly from observed continuous data. Specifically, we estimate a covariance matrix from observations of p continuous random variables encoding a stochastic process over a tree with p leaves. A representation of these classes of matrices as linear combinations of rank-one matrices indicating object partitions is used to formulate estimation as instances of well-studied numerical optimization problems.
In particular, our estimates are based on projection, where the covariance estimate is the nearest tree-structured covariance matrix to an observed sample covariance matrix. The problem is posed as a linear or quadratic mixed-integer program (MIP) where a setting of the integer variables in the MIP specifies a set of tree topologies of the structured covariance matrix. We solve these problems to optimality using efficient and robust existing MIP solvers.
We present a case study in phylogenetic analysis of gene expression and a simulation study comparing our method to distance-based tree estimating procedures.
Increasingly, scientific studies yield functional data, in which the ideal units of observation are curves and the observed data consist of sets of curves that are sampled on a fine grid. We present new methodology that generalizes the linear mixed model to the functional mixed model framework, with model fitting done by using a Bayesian wavelet-based approach. This method is flexible, allowing functions of arbitrary form and the full range of fixed effects structures and between-curve covariance structures that are available in the mixed model framework. It yields nonparametric estimates of the fixed and random-effects functions as well as the various between-curve and within-curve covariance matrices. The functional fixed effects are adaptively regularized as a result of the non-linear shrinkage prior that is imposed on the fixed effects’ wavelet coefficients, and the random-effect functions experience a form of adaptive regularization because of the separately estimated variance components for each wavelet coefficient. Because we have posterior samples for all model quantities, we can perform pointwise or joint Bayesian inference or prediction on the quantities of the model. The adaptiveness of the method makes it especially appropriate for modelling irregular functional data that are characterized by numerous local features like peaks.
Bayesian methods; Functional data analysis; Mixed models; Model averaging; Nonparametric regression; Proteomics; Wavelets
A heteroskedastic random coefficients model was described for analyzing weight performances between the 100th and the 650th days of age of Maine-Anjou beef cattle. This model contained both fixed effects, random linear regression and heterogeneous variance components. The objective of this study was to analyze the difference of growth curves between animals born as twin and single bull calves. The method was based on log-linear models for residual and individual variances expressed as functions of explanatory variables. An expectation-maximization (EM) algorithm was proposed for calculating restricted maximum likelihood (REML) estimates of the residual and individual components of variances and covariances. Likelihood ratio tests were used to assess hypotheses about parameters of this model. Growth of Maine-Anjou cattle was described by a third order regression on age for a mean growth curve, two correlated random effects for the individual variability and independent errors. Three sources of heterogeneity of residual variances were detected. The difference of weight performance between bulls born as single and twin bull calves was estimated to be equal to about 15 kg for the growth period considered.
heteroskedastic random coefficient model; EM-REML; robust estimators; growth curve; Maine-Anjou breed
Length-biased sampling has been well recognized in economics, industrial reliability, etiology applications, epidemiological, genetic and cancer screening studies. Length-biased right-censored data have a unique data structure different from traditional survival data. The nonparametric and semiparametric estimations and inference methods for traditional survival data are not directly applicable for length-biased right-censored data. We propose new expectation-maximization algorithms for estimations based on full likelihoods involving infinite dimensional parameters under three settings for length-biased data: estimating nonparametric distribution function, estimating nonparametric hazard function under an increasing failure rate constraint, and jointly estimating baseline hazards function and the covariate coefficients under the Cox proportional hazards model. Extensive empirical simulation studies show that the maximum likelihood estimators perform well with moderate sample sizes and lead to more efficient estimators compared to the estimating equation approaches. The proposed estimates are also more robust to various right-censoring mechanisms. We prove the strong consistency properties of the estimators, and establish the asymptotic normality of the semi-parametric maximum likelihood estimators under the Cox model using modern empirical processes theory. We apply the proposed methods to a prevalent cohort medical study. Supplemental materials are available online.
Cox regression model; EM algorithm; Increasing failure rate; Non-parametric likelihood; Profile likelihood; Right-censored data
We establish a general asymptotic theory for nonparametric maximum likelihood estimation in semiparametric regression models with right censored data. We identify a set of regularity conditions under which the nonparametric maximum likelihood estimators are consistent, asymptotically normal, and asymptotically efficient with a covariance matrix that can be consistently estimated by the inverse information matrix or the profile likelihood method. The general theory allows one to obtain the desired asymptotic properties of the nonparametric maximum likelihood estimators for any specific problem by verifying a set of conditions rather than by proving technical results from first principles. We demonstrate the usefulness of this powerful theory through a variety of examples.
Counting process; empirical process; multivariate failure times; nonparametric likelihood; profile likelihood; survival data
In recent years, many methods have been developed for regression in high-dimensional settings. We propose covariance-regularized regression, a family of methods that use a shrunken estimate of the inverse covariance matrix of the features in order to achieve superior prediction. An estimate of the inverse covariance matrix is obtained by maximizing its log likelihood, under a multivariate normal model, subject to a constraint on its elements; this estimate is then used to estimate coefficients for the regression of the response onto the features. We show that ridge regression, the lasso, and the elastic net are special cases of covariance-regularized regression, and we demonstrate that certain previously unexplored forms of covariance-regularized regression can outperform existing methods in a range of situations. The covariance-regularized regression framework is extended to generalized linear models and linear discriminant analysis, and is used to analyze gene expression data sets with multiple class and survival outcomes.
regression; classification; n ≪ p; covariance regularization
Principal component analysis is a widely used 'dimension reduction' technique, albeit generally at a phenotypic level. It is shown that we can estimate genetic principal components directly through a simple reparameterisation of the usual linear, mixed model. This is applicable to any analysis fitting multiple, correlated genetic effects, whether effects for individual traits or sets of random regression coefficients to model trajectories. Depending on the magnitude of genetic correlation, a subset of the principal component generally suffices to capture the bulk of genetic variation. Corresponding estimates of genetic covariance matrices are more parsimonious, have reduced rank and are smoothed, with the number of parameters required to model the dispersion structure reduced from k(k + 1)/2 to m(2k - m + 1)/2 for k effects and m principal components. Estimation of these parameters, the largest eigenvalues and pertaining eigenvectors of the genetic covariance matrix, via restricted maximum likelihood using derivatives of the likelihood, is described. It is shown that reduced rank estimation can reduce computational requirements of multivariate analyses substantially. An application to the analysis of eight traits recorded via live ultrasound scanning of beef cattle is given.
covariances; principal components; restricted maximum likelihood; reduced rank
An observer performing a detection task analyzes an image and produces a single number, a test statistic, for that image. This test statistic represents the observers “confidence” that a signal (e.g., a tumor) is present. The linear observer that maximizes the test-statistic SNR is known as the Hotelling observer. Generally, computation of the Hotelling SNR, or Hotelling trace, requires the inverse of a large covariance matrix. Recent developments have resulted in methods for the estimation and inversion of these large covariance matrices with relatively small numbers of images. The estimation and inversion of these matrices is made possible by a covariance-matrix decomposition that splits the full covariance matrix into an average detector-noise component and a background-variability component. Because the average detector-noise component is often diagonal and/or easily estimated, a full-rank, invertible covariance matrix can be produced with few images. We have studied the bias of estimates of the Hotelling trace using this decomposition for high-detector-noise and low-detector-noise situations. In extremely low-noise situations, this covariance decomposition may result in a significant bias. We will present a theoretical evaluation of the Hotelling-trace bias, as well as extensive simulation studies.
Image quality; Hotelling observer; bias
The additive relationship matrix plays an important role in mixed model prediction of breeding values. For genotype matrix X (loci in columns), the product XX′ is widely used as a realized relationship matrix, but the scaling of this matrix is ambiguous. Our first objective was to derive a proper scaling such that the mean diagonal element equals 1+f, where f is the inbreeding coefficient of the current population. The result is a formula involving the covariance matrix for sampling genomic loci, which must be estimated with markers. Our second objective was to investigate whether shrinkage estimation of this covariance matrix can improve the accuracy of breeding value (GEBV) predictions with low-density markers. Using an analytical formula for shrinkage intensity that is optimal with respect to mean-squared error, simulations revealed that shrinkage can significantly increase GEBV accuracy in unstructured populations, but only for phenotyped lines; there was no benefit for unphenotyped lines. The accuracy gain from shrinkage increased with heritability, but at high heritability (> 0.6) this benefit was irrelevant because phenotypic accuracy was comparable. These trends were confirmed in a commercial pig population with progeny-test-estimated breeding values. For an anonymous trait where phenotypic accuracy was 0.58, shrinkage increased the average GEBV accuracy from 0.56 to 0.62 (SE < 0.00) when using random sets of 384 markers from a 60K array. We conclude that when moderate-accuracy phenotypes and low-density markers are available for the candidates of genomic selection, shrinkage estimation of the relationship matrix can improve genetic gain.
realized relationship matrix; genomic selection; breeding value prediction; shrinkage estimation; GenPred; Shared Data Resources
We consider methods for estimating the effect of a covariate on a disease onset distribution when the observed data structure consists of right-censored data on diagnosis times and current status data on onset times amongst individuals who have not yet been diagnosed. Dunson and Baird (2001, Biometrics 57, 306–403) approached this problem using maximum likelihood, under the assumption that the ratio of the diagnosis and onset distributions is monotonic nondecreasing. As an alternative, we propose a two-step estimator, an extension of the approach of van der Laan, Jewell, and Petersen (1997, Biometrika 84, 539–554) in the single sample setting, which is computationally much simpler and requires no assumptions on this ratio. A simulation study is performed comparing estimates obtained from these two approaches, as well as that from a standard current status analysis that ignores diagnosis data. Results indicate that the Dunson and Baird estimator outperforms the two-step estimator when the monotonicity assumption holds, but the reverse is true when the assumption fails. The simple current status estimator loses only a small amount of precision in comparison to the two-step procedure but requires monitoring time information for all individuals. In the data that motivated this work, a study of uterine fibroids and chemical exposure to dioxin, the monotonicity assumption is seen to fail. Here, the two-step and current status estimators both show no significant association between the level of dioxin exposure and the hazard for onset of uterine fibroids; the two-step estimator of the relative hazard associated with increasing levels of exposure has the least estimated variance amongst the three estimators considered.
Current status data; Proportional hazards; Uterine fibroids
Spatial data with covariate measurement errors have been commonly observed in public health studies. Existing work mainly concentrates on parameter estimation using Gibbs sampling, and no work has been conducted to understand and quantify the theoretical impact of ignoring measurement error on spatial data analysis in the form of the asymptotic biases in regression coefficients and variance components when measurement error is ignored. Plausible implementations, from frequentist perspectives, of maximum likelihood estimation in spatial covariate measurement error models are also elusive. In this paper, we propose a new class of linear mixed models for spatial data in the presence of covariate measurement errors. We show that the naive estimators of the regression coefficients are attenuated while the naive estimators of the variance components are inflated, if measurement error is ignored. We further develop a structural modeling approach to obtaining the maximum likelihood estimator by accounting for the measurement error. We study the large sample properties of the proposed maximum likelihood estimator, and propose an EM algorithm to draw inference. All the asymptotic properties are shown under the increasing-domain asymptotic framework. We illustrate the method by analyzing the Scottish lip cancer data, and evaluate its performance through a simulation study, all of which elucidate the importance of adjusting for covariate measurement errors.
Measurement error; Spatial data; Structural modeling; Variance components; Asymptotic bias; Consistency and asymptotic normality; Increasing domain asymptotics; EM algorithm
This paper studies the sparsistency and rates of convergence for estimating sparse covariance and precision matrices based on penalized likelihood with nonconvex penalty functions. Here, sparsistency refers to the property that all parameters that are zero are actually estimated as zero with probability tending to one. Depending on the case of applications, sparsity priori may occur on the covariance matrix, its inverse or its Cholesky decomposition. We study these three sparsity exploration problems under a unified framework with a general penalty function. We show that the rates of convergence for these problems under the Frobenius norm are of order (sn log pn/n)1/2, where sn is the number of nonzero elements, pn is the size of the covariance matrix and n is the sample size. This explicitly spells out the contribution of high-dimensionality is merely of a logarithmic factor. The conditions on the rate with which the tuning parameter λn goes to 0 have been made explicit and compared under different penalties. As a result, for the L1-penalty, to guarantee the sparsistency and optimal rate of convergence, the number of nonzero elements should be small: sn′=O(pn) at most, among O(pn2) parameters, for estimating sparse covariance or correlation matrix, sparse precision or inverse correlation matrix or sparse Cholesky factor, where sn′ is the number of the nonzero elements on the off-diagonal entries. On the other hand, using the SCAD or hard-thresholding penalty functions, there is no such a restriction.
Covariance matrix; high dimensionality; consistency; nonconcave penalized likelihood; sparsistency; asymptotic normality
This paper presents a spatio-temporal framework for estimating single-trial response latencies and amplitudes from evoked response MEG/EEG data. Spatial and temporal bases are employed to capture the aspects of the evoked response that are consistent across trials. Trial amplitudes are assumed independent but have the same underlying normal distribution with unknown mean and variance. The trial latency is assumed to be deterministic but unknown. We assume the noise is spatially correlated with unknown covariance matrix. We introduce a generalized expectation-maximization algorithm called TriViAL (Trial Variability in Amplitude and Latency) which computes the maximum likelihood (ML) estimates of the amplitudes, latencies, basis coefficients, and noise covariance matrix. The proposed approach also performs ML source localization by scanning the TriViAL algorithm over spatial bases corresponding to different locations on the cortical surface. Source locations are identified as the locations corresponding to large likelihood values.
The effectiveness of the TriViAL algorithm is demonstrated using simulated data and human evoked response experiments. The localization performance is validated using tactile stimulation of the finger. The efficacy of the algorithm in estimating latency variability is shown using the known dependence of the M100 auditory response latency to stimulus tone frequency. We also demonstrate that estimation of response amplitude is improved when latency is included in the signal model.
Amplitude and/or latency variability; Evoked-response MEG/EEG; Expectation-maximization; Independent response; Maximum likelihood
In longitudinal and repeated measures data analysis, often the goal is to determine the effect of a treatment or aspect on a particular outcome (e.g., disease progression). We consider a semiparametric repeated measures regression model, where the parametric component models effect of the variable of interest and any modification by other covariates. The expectation of this parametric component over the other covariates is a measure of variable importance. Here, we present a targeted maximum likelihood estimator of the finite dimensional regression parameter, which is easily estimated using standard software for generalized estimating equations.
The targeted maximum likelihood method provides double robust and locally efficient estimates of the variable importance parameters and inference based on the influence curve. We demonstrate these properties through simulation under correct and incorrect model specification, and apply our method in practice to estimating the activity of transcription factor (TF) over cell cycle in yeast. We specifically target the importance of SWI4, SWI6, MBP1, MCM1, ACE2, FKH2, NDD1, and SWI5.
The semiparametric model allows us to determine the importance of a TF at specific time points by specifying time indicators as potential effect modifiers of the TF. Our results are promising, showing significant importance trends during the expected time periods. This methodology can also be used as a variable importance analysis tool to assess the effect of a large number of variables such as gene expressions or single nucleotide polymorphisms.
targeted maximum likelihood; semiparametric; repeated measures; longitudinal; transcription factors
We suggest a method for estimating a covariance matrix on the basis of a sample of vectors drawn from a multivariate normal distribution. In particular, we penalize the likelihood with a lasso penalty on the entries of the covariance matrix. This penalty plays two important roles: it reduces the effective number of parameters, which is important even when the dimension of the vectors is smaller than the sample size since the number of parameters grows quadratically in the number of variables, and it produces an estimate which is sparse. In contrast to sparse inverse covariance estimation, our method’s close relative, the sparsity attained here is in the covariance matrix itself rather than in the inverse matrix. Zeros in the covariance matrix correspond to marginal independencies; thus, our method performs model selection while providing a positive definite estimate of the covariance. The proposed penalized maximum likelihood problem is not convex, so we use a majorize-minimize approach in which we iteratively solve convex approximations to the original nonconvex problem. We discuss tuning parameter selection and demonstrate on a flow-cytometry dataset how our method produces an interpretable graphical display of the relationship between variables. We perform simulations that suggest that simple elementwise thresholding of the empirical covariance matrix is competitive with our method for identifying the sparsity structure. Additionally, we show how our method can be used to solve a previously studied special case in which a desired sparsity pattern is prespecified.
Concave-convex procedure; Covariance graph; Covariance matrix; Generalized gradient descent; Lasso; Majorization-minimization; Regularization; Sparsity
In this article we introduce a new procedure for estimating population parameters under inequality constraints (known as order restrictions) when the unrestricted maximum liklelihood estimator (UMLE) is multivariate normally distributed with a known covariance matrix. Furthermore, a Dunnett-type test procedure along with the corresponding simultaneous confidence intervals are proposed for drawing inferences on elementary contrasts of population parameters under order restrictions. The proposed methodology is motivated by estimation and testing problems encountered in the analysis of covariance models. It is well-known that the restricted maximum likelihood estimator (RMLE) may perform poorly under certain conditions in terms of quadratic loss. For example, when the UMLE is distributed according to multivariate normal distribution with means satisfying simple tree order restriction and the dimension of the population mean vector is large. We investigate the performance of the proposed estimator analytically as well as using computer simulations and discover that the proposed method does not fail in the situations where RMLE fails. We illustrate the proposed methodology by re-analyzing a recently published rat uterotrophic bioassay data.
Linear model; linked parameters; nodal parameter; order restrictions; restricted maximum likelihood estimators (RMLE); simple order; simple tree order; umbrella order
Pooling DNA samples of multiple individuals has been advocated as a method to reduce genotyping costs. Under such a scheme, only the allele counts at each locus, not the haplotype information, are observed. We develop a systematic way for handling such data by formulating the problem in terms of contingency tables, where pooled allele counts are expressed as the margins and the haplotype counts correspond to the unobserved cell counts. We show that the cell frequencies can be uniquely determined from the marginal frequencies under the usual Hardy-Weinberg equilibrium (HWE) assumption and that the maximum likelihood estimates of haplotype frequencies are consistent and asymptotically normal as the number of pools increases. The limiting covariance matrix is shown to be closely related to the extended hypergeometric distribution. Our results are used to derive Wald-type tests for linkage disequilibrium coefficient using pooled data. It is discovered that pooling is not efficient in testing weak linkage disequilibrium despite its efficiency in estimating haplotype frequencies. We also show by simulations that the proposed linkage disequilibrium tests are robust to slight deviation from HWE and to minor genotype error. Applications to two real angiotensinogen gene data sets are also provided.
Contingency table; DNA pooling; haplotype; hypergeometric distribution; linkage disequilibrium
Improving efficiency for regression coefficients and predicting trajectories of individuals are two important aspects in analysis of longitudinal data. Both involve estimation of the covariance function. Yet, challenges arise in estimating the covariance function of longitudinal data collected at irregular time points. A class of semiparametric models for the covariance function is proposed by imposing a parametric correlation structure while allowing a nonparametric variance function. A kernel estimator is developed for the estimation of the nonparametric variance function. Two methods, a quasi-likelihood approach and a minimum generalized variance method, are proposed for estimating parameters in the correlation structure. We introduce a semiparametric varying coefficient partially linear model for longitudinal data and propose an estimation procedure for model coefficients by using a profile weighted least squares approach. Sampling properties of the proposed estimation procedures are studied and asymptotic normality of the resulting estimators is established. Finite sample performance of the proposed procedures is assessed by Monte Carlo simulation studies. The proposed methodology is illustrated by an analysis of a real data example.
Kernel regression; local linear regression; profile weighted least squares; semiparametric varying coefficient model
Motivated by the spatial modeling of aberrant crypt foci (ACF) in colon carcinogenesis, we consider binary data with probabilities modeled as the sum of a nonparametric mean plus a latent Gaussian spatial process that accounts for short-range dependencies. The mean is modeled in a general way using regression splines. The mean function can be viewed as a fixed effect and is estimated with a penalty for regularization. With the latent process viewed as another random effect, the model becomes a generalized linear mixed model. In our motivating data set and other applications, the sample size is too large to easily accommodate maximum likelihood or restricted maximum likelihood estimation (REML), so pairwise likelihood, a special case of composite likelihood, is used instead. We develop an asymptotic theory for models that are sufficiently general to be used in a wide variety of applications, including, but not limited to, the problem that motivated this work. The splines have penalty parameters that must converge to zero asymptotically: we derive theory for this along with a data-driven method for selecting the penalty parameter, a method that is shown in simulations to improve greatly upon standard devices, such as likelihood crossvalidation. Finally, we apply the methods to the data from our experiment ACF. We discover an unexpected location for peak formation of ACF.
Aberrant crypt foci; Colon carcinogenesis; Composite likelihood; Generalized linear mixed models; Longitudinal data; Pairwise likelihood; Partially linear model; Semiparametric regression; Single index models; Spatial statistics
A generalized self-consistency approach to maximum likelihood estimation (MLE) and model building was developed in (Tsodikov, 2003) and applied to a survival analysis problem. We extend the framework to obtain second-order results such as information matrix and properties of the variance. Multinomial model motivates the paper and is used throughout as an example. Computational challenges with the multinomial likelihood motivated Baker (1994) to develop the Multinomial-Poisson (MP) transformation for a large variety of regression models with multinomial likelihood kernel. Multinomial regression is transformed into a Poisson regression at the cost of augmenting model parameters and restricting the problem to discrete covariates. Imposing normalization restrictions by means of Lagrange multipliers (Lang, 1996) justifies the approach. Using the self-consistency framework we develop an alternative solution to multinomial model fitting that does not require augmenting parameters while allowing for a Poisson likelihood and arbitrary covariate structures. Normalization restrictions are imposed by averaging over artificial “missing data” (fake mixture). Lack of probabilistic interpretation at the “complete-data” level makes the use of the generalized self-consistency machinery essential.