Rationale: Previous studies of chronic obstructive pulmonary disease (COPD) have suggested that genetic factors play an important role in the development of disease. However, single-nucleotide polymorphisms that are associated with COPD in genome-wide association studies have been shown to account for only a small percentage of the genetic variance in phenotypes of COPD, such as spirometry and imaging variables. These phenotypes are highly predictive of disease, and family studies have shown that spirometric phenotypes are heritable.
Objectives: To assess the heritability and coheritability of four major COPD-related phenotypes (measurements of FEV1, FEV1/FVC, percent emphysema, and percent gas trapping), and COPD affection status in smokers of non-Hispanic white and African American descent using a population design.
Methods: Single-nucleotide polymorphisms from genome-wide association studies chips were used to calculate the relatedness of pairs of individuals and a mixed model was adopted to estimate genetic variance and covariance.
Measurements and Main Results: In the non-Hispanic whites, estimated heritabilities of FEV1 and FEV1/FVC were both about 37%, consistent with estimates in the literature from family-based studies. For chest computed tomography scan phenotypes, estimated heritabilities were both close to 25%. Heritability of COPD affection status was estimated as 37.7% in both populations.
Conclusions: This study suggests that a large portion of the genetic risk of COPD is yet to be discovered and gives rationale for additional genetic studies of COPD. The estimates of coheritability (genetic covariance) for pairs of the phenotypes suggest considerable overlap of causal genetic loci.
missing heritability; pleiotropy; pulmonary function; imaging phenotypes; chromosomal partition
We previously reported genome-wide significant evidence for linkage between chromosome 6q and bipolar I disorder (BPI) by performing a meta-analysis of original genotype data from 11 genome scan linkage studies. We now present follow-up linkage disequilibrium mapping of the linked region utilizing 3,047 single nucleotide polymorphism (SNP) markers in a case–control sample (N = 530 cases, 534 controls) and family-based sample (N = 256 nuclear families, 1,301 individuals). The strongest single SNP result (rs6938431, P=6.72× 10−5) was observed in the case–control sample, near the solute carrier family 22, member 16 gene (SLC22A16). In a replication study, we genotyped 151 SNPs in an independent sample (N = 622 cases, 1,181 controls) and observed further evidence of association between variants at SLC22A16 and BPI. Although consistent evidence of association with any single variant was not seen across samples, SNP-wise and gene-based test results in the three samples provided convergent evidence for association with SLC22A16, a carnitine transporter, implicating this gene as a novel candidate for BPI risk. Further studies in larger samples are warranted to clarify which, if any, genes in the 6q region confer risk for bipolar disorder.
bipolar disorder; genetic; association; SLC22A16; 6q
The revolution in next-generation sequencing has made obtaining both common and rare high-quality sequence variants across the entire genome feasible. Because researchers are now faced with the analytical challenges of handling a massive amount of genetic variant information from sequencing studies, numerous methods have been developed to assess the impact of both common and rare variants on disease traits. In this report, whole genome sequencing data from Genetic Analysis Workshop 18 was used to compare the power of several methods, considering both family-based and population-based designs, to detect association with variants in the MAP4 gene region and on chromosome 3 with blood pressure. To prioritize variants across the genome for testing, variants were first functionally assessed using prediction algorithms and expression quantitative trait loci (eQTLs) data. Four set-based tests in the family-based association tests (FBAT) framework--FBAT-v, FBAT-lmm, FBAT-m, and FBAT-l--were used to analyze 20 pedigrees, and 2 variance component tests, sequence kernel association test (SKAT) and genome-wide complex trait analysis (GCTA), were used with 142 unrelated individuals in the sample. Both set-based and variance-component-based tests had high power and an adequate type I error rate. Of the various FBATs, FBAT-l demonstrated superior performance, indicating the potential for it to be used in rare-variant analysis. The updated FBAT package is available at: http://www.hsph.harvard.edu/fbat/.
Anorexia nervosa and bulimia nervosa (BN) are rare, but eating disorders not otherwise specified (EDNOS) are relatively common among female participants. Our objective was to evaluate whether BN and subtypes of EDNOS are predictive of developing adverse outcomes.
This study comprised a prospective analysis of 8594 female participants from the ongoing Growing Up Today Study. Questionnaires were sent annually from 1996 through 2001, then biennially through 2007 and 2008. Participants who were 9 to 15 years of age in 1996 and completed at least 2 consecutive questionnaires between 1996 and 2008 were included in the analyses. Participants were classified as having BN (≥weekly binge eating and purging), binge eating disorder (BED; ≥weekly binge eating, infrequent purging), purging disorder (PD; ≥weekly purging, infrequent binge eating), other EDNOS (binge eating and/or purging monthly), or nondisordered.
BN affected ∼1% of adolescent girls; 2% to 3% had PD and another 2% to 3% had BED. Girls with BED were almost twice as likely as their nondisordered peers to become overweight or obese (odds ratio [OR]: 1.9 [95% confidence interval: 1.0–3.5]) or develop high depressive symptoms (OR: 2.3 [95% confidence interval: 1.0–5.0]). Female participants with PD had a significantly increased risk of starting to use drugs (OR: 1.7) and starting to binge drink frequently (OR: 1.8).
PD and BED are common and predict a range of adverse outcomes. Primary care clinicians should be made aware of these disorders, which may be underrepresented in eating disorder clinic samples. Efforts to prevent eating disorders should focus on cases of subthreshold severity.
adolescents; eating disorders; epidemiology; obesity; substance use
To identify predictors of becoming eating disordered among adolescents.
Prospective cohort study.
Girls (n=6916) and boys (n=5618), aged 9 to 15 years at baseline, in the ongoing Growing Up Today Study (GUTS).
Parent, peer, and media influences.
Main Outcome Measures
Onset of starting to binge eat or purge (ie, vomiting or using laxatives) at least weekly.
During 7 years of follow-up, 4.3% of female subjects and 2.3% of male subjects (hereafter referred to as “females” and “males”) started to binge eat and 5.3% of females and 0.8% of males started to purge to control their weight. Few participants started to both binge eat and purge. Rates and risk factors varied by sex and age group (<14 vs ≥14 years). Females younger than 14 years whose mothers had a history of an eating disorder were nearly 3 times more likely than their peers to start purging at least weekly (odds ratio, 2.8; 95% confidence interval, 1.3–5.9); however, maternal history of an eating disorder was unrelated to risk of starting to binge eat or purge in older adolescent females. Frequent dieting and trying to look like persons in the media were independent predictors of binge eating in females of all ages. In males, negative comments about weight by fathers was predictive of starting to binge at least weekly.
Risk factors for the development of binge eating and purging differ by sex and by age group in females. Maternal history of an eating disorder is a risk factor only in younger adolescent females.
We previously reported that asthmatic children with GSTM1 null genotype may be more susceptible to the acute effect of ozone on the small airways and might benefit from antioxidant supplementation. This study aims to assess the acute effect of ozone on lung function (FEF25-75) in asthmatic children according to dietary intake of vitamin C and the number of putative risk alleles in three antioxidant genes: GSTM1, GSTP1 (rs1695), and NQO1 (rs1800566).
257 asthmatic children from two cohort studies conducted in Mexico City were included. Stratified linear mixed models with random intercepts and random slopes on ozone were used. Potential confounding by ethnicity was assessed. Analyses were conducted under single gene and genotype score approaches.
The change in FEF25-75 per interquartile range (60 ppb) of ozone in persistent asthmatic children with low vitamin C intake and GSTM1 null was −91.2 ml/s (p = 0.06). Persistent asthmatic children with 4 to 6 risk alleles and low vitamin C intake showed an average decrement in FEF25-75 of 97.2 ml/s per 60 ppb of ozone (p = 0.03). In contrast in children with 1 to 3 risk alleles, acute effects of ozone on FEF25-75 did not differ by vitamin C intake.
Our results provide further evidence that asthmatic children predicted to have compromised antioxidant defense by virtue of genetic susceptibility combined with deficient antioxidant intake may be at increased risk of adverse effects of ozone on pulmonary function.
Air pollution; Asthmatic children; Antioxidant genes; Mexico City; Vitamin C
It is useful to have robust gene-environment interaction tests that can utilize a variety of family structures in an efficient way. This paper focuses on tests for gene-environment interaction in the presence of main genetic and environmental effects. The objective is to develop powerful tests that can combine trio data with parental genotypes and discordant sibships when parents genotypes are missing. We first make a modest improvement on a method for discordant sibs (discordant on phenotype), but the approach does not allow one to use families when all offspring are affected, e.g. trios. We then make a modest improvement on a Mendelian transmission-based approach that is inefficient when discordant sibs are available, but can be applied to any nuclear family. Finally, we propose a hybrid approach that utilizes the most efficient method for a specific family type, then combines over families. We utilize this hybrid approach to analyze a chronic obstructive pulmonary disorder dataset to test for gene-environment interaction in the Serpine2 gene with smoking. The methods are freely available in the R package fbati.
Gene-Environment Interaction; Family-Based Association Tests; Candidate Gene Analysis; Binary Trait; COPD; Serpine2
Compositional epistasis is said to be present when the effect of a genetic factor at one locus is masked by a variant at another locus. Although such compositional epistasis is not equivalent to the presence of an interaction in a statistical model, non-standard tests can sometimes be used to detect compositional epistasis. In this paper we consider empirical tests for compositional epistasis under models for the joint effect of two genetic factors which place no restrictions on the main effects of each factor but constrain the interactive effects of the two factors so as to be captured by a single parameter in the model. We describe the implications of these tests for cohort, case-control, case-only and family-based study designs and we illustrate the methods using an example of gene-gene interaction already reported in the literature.
In clinical trials multiple outcomes are often used to assess treatment interventions. This paper presents an evaluation of likelihood-based methods for jointly testing treatment effects in clinical trials with multiple continuous outcomes. Specifically, we compare the power of joint tests of treatment effects obtained from joint models for the multiple outcomes with univariate tests based on modelling the outcomes separately. We also consider the power and bias of tests when data are missing, a common feature of many trials, especially in psychiatry. Our results suggest that joint tests capitalize on the correlation of multiple outcomes and are more powerful than standard univariate methods, especially when outcomes are missing completely at random. When outcomes are missing at random, test procedures based on correctly specified joint models are unbiased, while standard univariate procedures are not. Results of a simulation study are reported, and the methods are illustrated in an example from the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) for schizophrenia.
joint tests; multiple outcomes; power; missing data; psychiatry
Investigators sometimes use information about a given variable obtained from multiple informants. We focus on estimating the effect of a predictor on a continuous outcome, when that predictor cannot be observed directly but is measured by two informants. We describe various approaches to using information from two informants to estimate a regression or correlation coefficient for the effect of the (true) predictor on the outcome. These approaches include methods we refer to as single informant, simple average, optimal weighted average, principal components analysis, and classical measurement error. Each of these five methods effectively uses a weighted average of the informants' reports as a proxy for the true predictor in calculating the correlation or regression coefficient. We compare the performance of these methods in simulation experiments that assume a rounded congeneric measurement model for the relationship between the informants' reports and a true predictor that is a mixture of zeros and positively-distributed continuous values. We also compare the methods' performance in a real data example -the relationship between vigorous physical activity (the predictor) and body mass index (the continuous outcome). The results of the simulations and the example suggest that the simple average is a reasonable choice when there are only two informants.
Recent advances in next-generation sequencing technologies have made it possible to generate large amounts of sequence data with rare variants in a cost-effective way. Statistical methods that test variants individually are underpowered to detect rare variants, so it is desirable to perform association analysis of rare variants by combining the information from all variants. In this study, we use a Bayesian regression method to model all variants simultaneously to identify rare variants in a data set from Genetic Analysis Workshop 17. We studied the association between the quantitative risk traits Q1, Q2, and Q4 and the single-nucleotide polymorphisms and identified several positive single-nucleotide polymorphisms for traits Q1 and Q2. However, the model also generated several apparent false positives and missed many true positives, suggesting that there is room for improvement in this model.
In this article, we propose and explore a multivariate logistic regression model for analyzing multiple binary outcomes with incomplete covariate data where auxiliary information is available. The auxiliary data are extraneous to the regression model of interest but predictive of the covariate with missing data. describe how the auxiliary information can be incorporated into a regression model for a single binary outcome with missing covariates, and hence the efficiency of the regression estimators can be improved. We consider extending the method of Horton and Laird (2001) to the case of a multivariate logistic regression model for multiple correlated outcomes, and with missing covariates and completely observed auxiliary information. We demonstrate that in the case of moderate to strong associations among the multiple outcomes, one can achieve considerable gains in efficiency from estimators in a multivariate model as compared to the marginal estimators of the same parameters.
Asymptotic relative efficiency; Auxiliary information; Incomplete data; Logistic regression model; Missing covariates; Multiple outcomes
The recent emergence of massively parallel sequencing technologies has enabled an increasing number of human genome re-sequencing studies, notable among them being the 1000 Genomes Project. The main aim of these studies is to identify the yet unknown genetic variants in a genomic region, mostly low frequency variants (frequency less than 5%). We propose here a set of statistical tools that address how to optimally design such studies in order to increase the number of genetic variants we expect to discover. Within this framework, the tradeoff between lower coverage for more individuals and higher coverage for fewer individuals can be naturally solved.
The methods here are also useful for estimating the number of genetic variants missed in a discovery study performed at low coverage.
We show applications to simulated data based on coalescent models and to sequence data from the ENCODE project. In particular, we show the extent to which combining data from multiple populations in a discovery study may increase the number of genetic variants identified relative to studies on single populations.
species problem; variant discovery studies; sequencing technologies
R package is designed for developers of
R packages, to help rapidly, and sometimes fully automatically, create a graphical user interface for a command line
R package. The interface is built upon the
Tcl/Tk graphical interface included in
R. The package further facilitates the developer by loading in the help files from the command line functions to provide context sensitive help to the user with no additional effort from the developer. Passing a function as the argument to the routines in the fgui package creates a graphical interface for the function, and further options are available to tweak this interface for those who want more flexibility.
GUI; interface; fgui
When testing for genetic effects, failure to account for a gene-environment interaction can mask the true association effects of a genetic marker with disease. Family-based association tests are popular because they are completely robust to population substructure and model misspecification. However, when testing for an interaction, failure to model the main genetic effect correctly can lead to spurious results. Here we propose a family-based test for interaction that is robust to model misspecification, but still sensitive to an interaction effect, and can handle continuous covariates and missing parents. We extend the FBAT-I gene-environment interaction test for dichotomous traits to using both trios and sibships. We then compare this extension to joint tests of gene and gene-environment interaction, and compare the joint test additionally to the main effects test of the gene. Lastly we apply these three tests to a group of nuclear families ascertained according to affection with Bipolar Disorder.
genetic association; genetic interaction; family-based test; FBAT-I
We introduce a method of estimating disease prevalence from case-control family study data. Case-control family studies are performed to investigate the familial aggregation of disease; families are sampled via either a case or a control proband, and the resulting data contain information on disease status and covariates for the probands and their relatives. Here, we introduce estimators for overall prevalence and for covariate-stratum-specific (e.g., sex-specific) prevalence. These estimators combine the proportion of affected relatives of control probands with the proportion of affected relatives of case probands and are designed to yield approximately unbiased estimates of their population counterparts under certain commonly-made assumptions. We also introduce corresponding confidence intervals designed to have good coverage properties even for small prevalences. Next, we describe simulation experiments where our estimators and intervals were applied to case-control family data sampled from fictional populations with various levels of familial aggregation. At all aggregation levels, the resulting estimates varied closely and symmetrically around their population counterparts, and the resulting intervals had good coverage properties, even for small sample sizes. Finally, we discuss the assumptions required for our estimators to be approximately unbiased, highlighting situations where an alternative estimator based only on relatives of control probands may perform better.
Case-control family study; Population prevalence; Proband; Propositus method
Rapid advances in sequencing technologies set the stage for the large-scale medical sequencing efforts to be performed in the near future, with the goal of assessing the importance of rare variants in complex diseases. The discovery of new disease susceptibility genes requires powerful statistical methods for rare variant analysis. The low frequency and the expected large number of such variants pose great difficulties for the analysis of these data. We propose here a robust and powerful testing strategy to study the role rare variants may play in affecting susceptibility to complex traits. The strategy is based on assessing whether rare variants in a genetic region collectively occur at significantly higher frequencies in cases compared with controls (or vice versa). A main feature of the proposed methodology is that, although it is an overall test assessing a possibly large number of rare variants simultaneously, the disease variants can be both protective and risk variants, with moderate decreases in statistical power when both types of variants are present. Using simulations, we show that this approach can be powerful under complex and general disease models, as well as in larger genetic regions where the proportion of disease susceptibility variants may be small. Comparisons with previously published tests on simulated data show that the proposed approach can have better power than the existing methods. An application to a recently published study on Type-1 Diabetes finds rare variants in gene IFIH1 to be protective against Type-1 Diabetes.
Risk to common diseases, such as diabetes, heart disease, etc., is influenced by a complex interaction among genetic and environmental factors. Most of the disease-association studies conducted so far have focused on common variants, widely available on genotyping platforms. However, recent advances in sequencing technologies pave the way for large-scale medical sequencing studies with the goal of elucidating the role rare variants may play in affecting susceptibility to complex traits. The large number of rare variants and their low frequencies pose great challenges for the analysis of these data. We present here a novel testing strategy, based on a weighted-sum statistic, that is less sensitive than existing methods to the presence of both risk and protective variants in the genetic region under investigation. We show applications to simulated data and to a real dataset on Type-1 Diabetes.
We introduce a stepwise approach for family-based designs for selecting a set of markers in a gene that are independently associated with the disease. The approach is based on testing the effect of a set of markers conditional on another set of markers. Several likelihood-based approaches have been proposed for special cases, but no model-free based tests have been proposed. We propose two types of tests in a family-based framework that are applicable to arbitrary family structures and completely robust to population stratification. We propose methods for ascertained dichotomous traits and unascertained quantitative traits. We first propose a completely model-free extension of the FBAT main genetic effect test. Then, for power issues, we introduce two model-based tests, one for dichotomous traits and one for continuous traits. Lastly, we utilize these tests to analyze a continuous lung function phenotype as a proxy for asthma in the Childhood Asthma Management Program. The methods are implemented in the free R package fbati.
Binary trait; Candidate gene analysis; Family-based association tests; FBAT-C; Linkage disequilibrium (LD); Model-based test; Model-free test; Nuclear families; Quantitative trait
Longitudinal studies provide an important tool for analyzing traits that change over time depending on the individual characteristics and the environmental exposures. Complex quantitative traits, such as lung function, may change over time and appear to depend on both genetic and environmental factors, as well as on potential gene-environment interactions. There is a growing interest in modeling both marginal genetic effects and gene-environment interactions. In an admixed population, the use of traditional statistical models may fail to adjust for confounding by ethnicity, leading to bias in the genetic effect estimates. A variety of methods have been developed to account for genetic substructure of human populations. Family-based designs provide an important resource for avoiding confounding due to admixture. However to date, most genetic analyses have been applied to cross-sectional designs. In this paper we propose a methodology which aims to improve the assessment of main and gene-environment interaction effects by combining the advantages of both longitudinal studies for continuous phenotypes, and the family-based designs. This approach is based on an extension of Ordinary Linear Mixed Models for quantitative phenotypes which incorporates information from a case-parent design.
Our results indicate that using this method permit both main genetic and gene-environment interaction effects to be estimated without bias, even in the presence of population substructure.
gene-environment interaction; longitudinal phenotypes; power; bias; population substructure
Longitudinal studies are an important tool for analysing traits that change over time, depending on individual characteristics and environmental exposures. Complex quantitative traits, such as lung function, may change over time and appear to depend on genetic and environmental factors, as well as on potential gene-environment interactions. There is a growing interest in modelling both marginal genetic effects and gene-environment interactions. In an admixed population, the use of traditional statistical models may fail to adjust for confounding by ethnicity, leading to bias in the genetic effect estimates. A variety of methods have been developed to account for the genetic substructure of human populations. Family-based designs provide an important resource for avoiding confounding due to admixture. To date, however, most genetic analyses have been applied to cross-sectional designs. In this paper, we propose a methodology which aims to improve the assessment of main genetic effect and gene-environment interaction effects by combining the advantages of both longitudinal studies for continuous phenotypes, and the family-based designs. This approach is based on an extension of ordinary linear mixed models for quantitative phenotypes, which incorporates information from a case-parent design. Our results indicate that use of this method allows both main genetic and gene-environment interaction effects to be estimated without bias, even in the presence of population substructure.
gene-environment interaction; longitudinal phenotypes; power; bias; population substructure
Several family-based approaches for testing genetic association with traits obtained from longitudinal or repeated measurement studies have been previously proposed. These approaches utilize the multivariate data more efficiently by using estimated optimal weights to combine univariate tests. We show that these FBAT approaches are still robust against hidden population stratification, but their power can be heavily affected since the estimated weights might provide poor approximation of the true theoretical optimal weights with the presence of population stratification. We introduce a permutation-based approach FBAT-MinP and an equal combination approach FBAT-EW, both of which do not involve the use of estimated weights. Through simulation studies, FBAT-MinP and FBAT-EW are shown to be powerful even in the presence of population stratification, when other approaches may substantially lose their power. An application of these approaches to the Childhood Asthma Management Program (CAMP) study data for testing an association between body mass index and a previously reported candidate SNP is given as an example.
Motivation: Estimating the frequency distribution of copy number variants (CNVs) is an important aspect of the effort to characterize this new type of genetic variation. Currently, most studies report a strong skew toward low-frequency CNVs. In this article, our goal is to investigate the frequencies of CNVs. We employ a two-step procedure for the CNV frequency estimation process. We use family information a posteriori to select only the most reliable CNV regions, i.e. those showing high rates of Mendelian transmission.
Results: Our results suggest that the current skew toward low-frequency CNVs may not be representative of the true frequency distribution, but may be due, among other reasons, to the non-negligible false negative rates that characterize CNV detection methods. Moreover, false positives are also likely, as low-frequency CNVs are hard to detect with small sample sizes and technologies that are not ideally suited for their detection. Without appropriate validation methods, such as incorporation of biologically relevant information (for example, in our case, the transmission of heritable CNVs from parents to offspring), it is difficult to assess the validity of specific CNVs, and even harder to obtain reliable frequency estimates.
Availability: Software implementing the methods described in this article is available for download at the following address: http://www.isites.harvard.edu/icb/icb.do?keyword=k36162
Supplementary informantion: Supplementary data are available at Bioinformatics online.
This study concerns the question of whether obese subjects in a community sample experience depression in a different way from the non-obese, especially whether they over-eat to the point of gaining weight during periods of depression.
A representative sample of adults was interviewed regarding depression and obesity.
The sample consisted of 1396 subjects whose interviews were studied regarding relationships between obesity and depression and among whom 114 had experienced a Major Depressive Episode at some point in their lives and provided information about the symptoms experienced during the worst or only episode of Major Depression.
The Diagnostic Interview Schedule (DIS) was used to identify Major Depressive Episodes. Information was also derived from the section on Depression and Anxiety (DPAX) of the Stirling Study Schedule. Obesity was calculated as a Body Mass Index (BMI) >30. Logistic regressions were employed to assess relationships, controlling for age and gender, by means of Odds Ratios and 95% Confidence Intervals.
In the sample as a whole, obesity was not related to depression although it was associated with the symptom of hopelessness. Among those who had ever experienced a Major Depressive Episode, obese persons were 5 times more likely than the non-obese to over-eat leading to weight gain during a period of depression (p <0.002). These obese subjects, compared to the non-obese, also experienced longer episodes of depression, a larger number of episodes, and were more preoccupied with death during such episodes.
Depression among obese subjects in a community sample tends to be more severe than among the non-obese. Gaining weight while depressed is an important marker of that severity. Further research is needed to understand and possibly prevent the associations, sequences, and outcomes among depression, obesity, weight gain, and other adversities.
Obesity; Major Depression; Over-eating; Gaining Weight; Atypical Depression
Recent technological advances in continuous biological monitoring and personal exposure assessment have led to the collection of subject-specific functional data. A primary goal in such studies is to assess the relationship between the functional predictors and the functional responses. The historical functional linear model (HFLM) can be used to model such dependencies of the response on the history of the predictor values. An estimation procedure for the regression coefficients that uses a variety of regularization techniques is proposed. An approximation of the regression surface relating the predictor to the outcome by a finite-dimensional basis expansion is used, followed by penalization of the coefficients of the neighboring basis functions by restricting the size of the coefficient differences to be small. Penalties based on the absolute values of the basis function coefficient differences (corresponding to the LASSO) and the squares of these differences (corresponding to the penalized spline methodology) are studied. The fits are compared using an extension of the Akaike Information Criterion that combines the error variance estimate, degrees of freedom of the fit and the norm of the bases function coefficients. The performance of the proposed methods is evaluated via simulations. The LASSO penalty applied to the linearly transformed coefficients yields sparser representations of the estimated regression surface, while the quadratic penalty provides solutions with the smallest L2-norm of the basis functions coefficients. Finally, the new estimation procedure is applied to the analysis of the effects of occupational particulate matter (PM) exposure on the heart rate variability (HRV) in a cohort of boilermaker workers. Results suggest that the strongest association between PM exposure and HRV in these workers occurs as a result of point exposures to the increased levels of particulate matter corresponding to smoking breaks.
environmental assessment; functional data; heart rate variability; LASSO; penalized regression splines
Information from multiple informants is frequently used to assess psychopathology. We consider marginal regression models with multiple informants as discrete predictors and a time to event outcome. We fit these models to data from the Stirling County Study; specifically, the models predict mortality from self report of psychiatric disorders and also predict mortality from physician report of psychiatric disorders. Previously, Horton et al. found little relationship between self and physician reports of psychopathology, but that the relationship of self report of psychopathology with mortality was similar to that of physician report of psychopathology with mortality. Generalized estimating equations (GEE) have been used to fit marginal models with multiple informant covariates; here we develop a maximum likelihood (ML) approach and show how it relates to the GEE approach. In a simple setting using a saturated model, the ML approach can be constructed to provide estimates that match those found using GEE. We extend the ML technique to consider multiple informant predictors with missingness and compare the method to using inverse probability weighted (IPW) GEE. Our simulation study illustrates that IPW GEE loses little efficiency compared with ML in the presence of monotone missingness. Our example data has non-monotone missingness; in this case, ML offers a modest decrease in variance compared with IPW GEE, particularly for estimating covariates in the marginal models. In more general settings, e.g. categorical predictors and piecewise exponential models, the likelihood parameters from the ML technique do not have the same interpretation as the GEE. Thus, the GEE is recommended to fit marginal models for its flexibility, ease of interpretation and comparable efficiency to ML in the presence of missing data.
multiple informants; censored survival data; maximum likelihood; generalized estimating equations; inverse probability weights