We propose a method for testing gene–environment (G × E) interactions on a complex trait in family-based studies in which a phenotypic ascertainment criterion has been imposed. This novel approach employs G-estimation, a semiparametric estimation technique from the causal inference literature, to avoid modeling of the association between the environmental exposure and the phenotype, to gain robustness against unmeasured confounding due to population substructure, and to acknowledge the ascertainment conditions. The proposed test allows for incomplete parental genotypes. It is compared by simulation studies to an analogous conditional likelihood–based approach and to the QBAT-I test, which also invokes the G-estimation principle but ignores ascertainment. We apply our approach to a study of chronic obstructive pulmonary disorder.
Causal inference; COPD; Family-based association; G-estimation; Gene–environment interaction
Accurate model comparison requires extensive computation times, especially for parameter-rich models of sequence evolution. In the Bayesian framework, model selection is typically performed through the evaluation of a Bayes factor, the ratio of two marginal likelihoods (one for each model). Recently introduced techniques to estimate (log) marginal likelihoods, such as path sampling and stepping-stone sampling, offer increased accuracy over the traditional harmonic mean estimator at an increased computational cost. Most often, each model’s marginal likelihood will be estimated individually, which leads the resulting Bayes factor to suffer from errors associated with each of these independent estimation processes.
We here assess the original ‘model-switch’ path sampling approach for direct Bayes factor estimation in phylogenetics, as well as an extension that uses more samples, to construct a direct path between two competing models, thereby eliminating the need to calculate each model’s marginal likelihood independently. Further, we provide a competing Bayes factor estimator using an adaptation of the recently introduced stepping-stone sampling algorithm and set out to determine appropriate settings for accurately calculating such Bayes factors, with context-dependent evolutionary models as an example. While we show that modest efforts are required to roughly identify the increase in model fit, only drastically increased computation times ensure the accuracy needed to detect more subtle details of the evolutionary process.
We show that our adaptation of stepping-stone sampling for direct Bayes factor calculation outperforms the original path sampling approach as well as an extension that exploits more samples. Our proposed approach for Bayes factor estimation also has preferable statistical properties over the use of individual marginal likelihood estimates for both models under comparison. Assuming a sigmoid function to determine the path between two competing models, we provide evidence that a single well-chosen sigmoid shape value requires less computational efforts in order to approximate the true value of the (log) Bayes factor compared to the original approach. We show that the (log) Bayes factors calculated using path sampling and stepping-stone sampling differ drastically from those estimated using either of the harmonic mean estimators, supporting earlier claims that the latter systematically overestimate the performance of high-dimensional models, which we show can lead to erroneous conclusions. Based on our results, we argue that highly accurate estimation of differences in model fit for high-dimensional models requires much more computational effort than suggested in recent studies on marginal likelihood estimation.
It is useful to have robust gene-environment interaction tests that can utilize a variety of family structures in an efficient way. This paper focuses on tests for gene-environment interaction in the presence of main genetic and environmental effects. The objective is to develop powerful tests that can combine trio data with parental genotypes and discordant sibships when parents genotypes are missing. We first make a modest improvement on a method for discordant sibs (discordant on phenotype), but the approach does not allow one to use families when all offspring are affected, e.g. trios. We then make a modest improvement on a Mendelian transmission-based approach that is inefficient when discordant sibs are available, but can be applied to any nuclear family. Finally, we propose a hybrid approach that utilizes the most efficient method for a specific family type, then combines over families. We utilize this hybrid approach to analyze a chronic obstructive pulmonary disorder dataset to test for gene-environment interaction in the Serpine2 gene with smoking. The methods are freely available in the R package fbati.
Gene-Environment Interaction; Family-Based Association Tests; Candidate Gene Analysis; Binary Trait; COPD; Serpine2
Estimates of additive interaction from case-control data are often obtained by logistic regression; such models can also be used to adjust for covariates. This approach to estimating additive interaction has come under some criticism because of possible misspecification of the logistic model: If the underlying model is linear, the logistic model will be misspecified. The authors propose an inverse probability of treatment weighting approach to causal effects and additive interaction in case-control studies. Under the assumption of no unmeasured confounding, the approach amounts to fitting a marginal structural linear odds model. The approach allows for the estimation of measures of additive interaction between dichotomous exposures, such as the relative excess risk due to interaction, using case-control data without having to rely on modeling assumptions for the outcome conditional on the exposures and covariates. Rather than using conditional models for the outcome, models are instead specified for the exposures conditional on the covariates. The approach is illustrated by assessing additive interaction between genetic and environmental factors using data from a case-control study.
case-control studies; interaction; linear model; structural model; synergism; weighting
For dichotomous outcomes, the authors discuss when the standard approaches to mediation analysis used in epidemiology and the social sciences are valid, and they provide alternative mediation analysis techniques when the standard approaches will not work. They extend definitions of controlled direct effects and natural direct and indirect effects from the risk difference scale to the odds ratio scale. A simple technique to estimate direct and indirect effect odds ratios by combining logistic and linear regressions is described that applies when the outcome is rare and the mediator continuous. Further discussion is given as to how this mediation analysis technique can be extended to settings in which data come from a case-control study design. For the standard mediation analysis techniques used in the epidemiologic and social science literatures to be valid, an assumption of no interaction between the effects of the exposure and the mediator on the outcome is needed. The approach presented here, however, will apply even when there are interactions between the effect of the exposure and the mediator on the outcome.
case-control studies; causal inference; decomposition; dichotomous response; epidemiologic methods; interaction; logistic regression; odds ratio
Crucial connections between sexual network structure and the distribution of HIV remain inadequately understood, especially in regard to the role of concurrency and age disparity in relationships, and how these network characteristics correlate with each other and other risk factors. Social desirability bias and inaccurate recall are obstacles to obtaining valid, detailed information about sexual behaviour and relationship histories. Therefore, this study aims to use novel research methods in order to determine whether HIV status is associated with age-disparity and sexual connectedness as well as establish the primary behavioural and socio-demographic predictors of the egocentric and community sexual network structures.
We will conduct a cross-sectional survey that uses a questionnaire exploring one-year sexual histories, with a focus on timing and age disparity of relationships, as well as other risk factors such as unprotected intercourse and the use of alcohol and recreational drugs. The questionnaire will be administered in a safe and confidential mobile interview space, using audio computer-assisted self-interview (ACASI) technology on touch screen computers. The ACASI features a choice of languages and visual feedback of temporal information. The survey will be administered in three peri-urban disadvantaged communities in the greater Cape Town area with a high burden of HIV. The study communities participated in a previous TB/HIV study, from which HIV test results will be anonymously linked to the survey dataset. Statistical analyses of the data will include descriptive statistics, linear mixed-effects models for the inter- and intra-subject variability in the age difference between sexual partners, survival analysis for correlated event times to model concurrency patterns, and logistic regression for association of HIV status with age disparity and sexual connectedness.
This study design is intended to facilitate more accurate recall of sensitive sexual history data and has the potential to provide substantial insights into the relationship between key sexual network attributes and additional risk factors for HIV infection. This will help to inform the design of context-specific HIV prevention programmes.
Accurate modelling of substitution processes in protein-coding sequences is often hampered by the computational burdens associated with full codon models. Lately, codon partition models have been proposed as a viable alternative, mimicking the substitution behaviour of codon models at a low computational cost. Such codon partition models however impose independent evolution of the different codon positions, which is overly restrictive from a biological point of view. Given that empirical research has provided indications of context-dependent substitution patterns at four-fold degenerate sites, we take those indications into account in this paper.
We present so-called context-dependent codon partition models to assess previous empirical claims that the evolution of four-fold degenerate sites is strongly dependent on the composition of its two flanking bases. To this end, we have estimated and compared various existing independent models, codon models, codon partition models and context-dependent codon partition models for the atpB and rbcL genes of the chloroplast genome, which are frequently used in plant systematics. Such context-dependent codon partition models employ a full dependency scheme for four-fold degenerate sites, whilst maintaining the independence assumption for the first and second codon positions.
We show that, both in the atpB and rbcL alignments of a collection of land plants, these context-dependent codon partition models significantly improve model fit over existing codon partition models. Using Bayes factors based on thermodynamic integration, we show that in both datasets the same context-dependent codon partition model yields the largest increase in model fit compared to an independent evolutionary model. Context-dependent codon partition models hence perform closer to codon models, which remain the best performing models at a drastically increased computational cost, compared to codon partition models, but remain computationally interesting alternatives to codon models. Finally, we observe that the substitution patterns in both datasets are drastically different, leading to the conclusion that combined analysis of these two genes using a single model may not be advisable from a context-dependent point of view.
A primary focus of an increasing number of scientific studies is to determine whether two exposures interact in the effect that they produce on an outcome of interest. Interaction is commonly assessed by fitting regression models in which the linear predictor includes the product between those exposures. When the main interest lies in the interaction, this approach is not entirely satisfactory because it is prone to (possibly severe) bias when the main exposure effects or the association between outcome and extraneous factors are misspecified. In this article, we therefore consider conditional mean models with identity or log link which postulate the statistical interaction in terms of a finite-dimensional parameter, but which are otherwise unspecified. We show that estimation of the interaction parameter is often not feasible in this model because it would require nonparametric estimation of auxiliary conditional expectations given high-dimensional variables. We thus consider ‘multiply robust estimation’ under a union model that assumes at least one of several working submodels holds. Our approach is novel in that it makes use of information on the joint distribution of the exposures conditional on the extraneous factors in making inferences about the interaction parameter of interest. In the special case of a randomized trial or a family-based genetic study in which the joint exposure distribution is known by design or by Mendelian inheritance, the resulting multiply robust procedure leads to asymptotically distribution-free tests of the null hypothesis of no interaction on an additive scale. We illustrate the methods via simulation and the analysis of a randomized follow-up study.
Double robustness; Gene-environment interaction; Gene-gene interaction; Longitudinal data; Semiparametric inference
When testing for genetic effects, failure to account for a gene-environment interaction can mask the true association effects of a genetic marker with disease. Family-based association tests are popular because they are completely robust to population substructure and model misspecification. However, when testing for an interaction, failure to model the main genetic effect correctly can lead to spurious results. Here we propose a family-based test for interaction that is robust to model misspecification, but still sensitive to an interaction effect, and can handle continuous covariates and missing parents. We extend the FBAT-I gene-environment interaction test for dichotomous traits to using both trios and sibships. We then compare this extension to joint tests of gene and gene-environment interaction, and compare the joint test additionally to the main effects test of the gene. Lastly we apply these three tests to a group of nuclear families ascertained according to affection with Bipolar Disorder.
genetic association; genetic interaction; family-based test; FBAT-I
We develop a locally efficient test for (multiplicative) gene–environment interaction in family studies that collect genotypic information and environmental exposures for affected offspring along with genotypic information for their parents or relatives. The proposed test does not require modeling the effects of environmental exposures and is doubly robust in the sense of being valid if either a model for the main genetic effect holds or a model for the expected environmental exposure (given the offspring affection status and parental mating types) but not necessarily both. It extends the FBAT-I to allow for missing parental mating types and families of arbitrary size. Simulation studies and the analysis of an Alzheimer's disease study confirm the adequate performance of the proposed test.
Causal inference; Double robustness; Effect modification; Gene–environment interaction; Genetic association; Nuclear family; Semiparametric interaction model
Sufficient cause interactions concern cases in which there is a particular causal mechanism for some outcome that requires the presence of 2 or more specific causes to operate. Empirical conditions have been derived to test for sufficient cause interactions. However, when regression outcome models are used to control for confounding variables in tests for sufficient cause interactions, the outcome models impose restrictions on the relation between the confounding variables and certain unidentified background causes within the sufficient cause framework; often, these assumptions are implausible. By using marginal structural models, rather than outcome regression models, to test for sufficient cause interactions, modeling assumptions are instead made on the relation between the causes of interest and the confounding variables; these assumptions will often be more plausible. The use of marginal structural models also allows for testing for sufficient cause interactions in the presence of time-dependent confounding. Such time-dependent confounding may arise in cases in which one factor of interest affects both the second factor of interest and the outcome. It is furthermore shown that marginal structural models can be used not only to test for sufficient cause interactions but also to give lower bounds on the prevalence of such sufficient cause interactions.
causal inference; interaction; marginal structural models; sufficient causes; synergism; weighting
We introduce a stepwise approach for family-based designs for selecting a set of markers in a gene that are independently associated with the disease. The approach is based on testing the effect of a set of markers conditional on another set of markers. Several likelihood-based approaches have been proposed for special cases, but no model-free based tests have been proposed. We propose two types of tests in a family-based framework that are applicable to arbitrary family structures and completely robust to population stratification. We propose methods for ascertained dichotomous traits and unascertained quantitative traits. We first propose a completely model-free extension of the FBAT main genetic effect test. Then, for power issues, we introduce two model-based tests, one for dichotomous traits and one for continuous traits. Lastly, we utilize these tests to analyze a continuous lung function phenotype as a proxy for asthma in the Childhood Asthma Management Program. The methods are implemented in the free R package fbati.
Binary trait; Candidate gene analysis; Family-based association tests; FBAT-C; Linkage disequilibrium (LD); Model-based test; Model-free test; Nuclear families; Quantitative trait
Recent approaches for context-dependent evolutionary modelling assume that the evolution of a given site depends upon its ancestor and that ancestor's immediate flanking sites. Because such dependency pattern cannot be imposed on the root sequence, we consider the use of different orders of Markov chains to model dependence at the ancestral root sequence. Root distributions which are coupled to the context-dependent model across the underlying phylogenetic tree are deemed more realistic than decoupled Markov chains models, as the evolutionary process is responsible for shaping the composition of the ancestral root sequence.
We find strong support, in terms of Bayes Factors, for using a second-order Markov chain at the ancestral root sequence along with a context-dependent model throughout the remainder of the phylogenetic tree in an ancestral repeats dataset, and for using a first-order Markov chain at the ancestral root sequence in a pseudogene dataset. Relaxing the assumption of a single context-independent set of independent model frequencies as presented in previous work, yields a further drastic increase in model fit. We show that the substitution rates associated with the CpG-methylation-deamination process can be modelled through context-dependent model frequencies and that their accuracy depends on the (order of the) Markov chain imposed at the ancestral root sequence. In addition, we provide evidence that this approach (which assumes that root distribution and evolutionary model are decoupled) outperforms an approach inspired by the work of Arndt et al., where the root distribution is coupled to the evolutionary model. We show that the continuous-time approximation of Hwang and Green has stronger support in terms of Bayes Factors, but the parameter estimates show minimal differences.
We show that the combination of a dependency scheme at the ancestral root sequence and a context-dependent evolutionary model across the remainder of the tree allows for accurate estimation of the model's parameters. The different assumptions tested in this manuscript clearly show that designing accurate context-dependent models is a complex process, with many different assumptions that require validation. Further, these assumptions are shown to change across different datasets, making the search for an adequate model for a given dataset quite challenging.
Instrumental variables (IV) estimators are well established to correct for measurement error on exposure in a broad range of fields. In a distinct prominent stream of research IV’s are becoming increasingly popular for estimating causal effects of exposure on outcome since they allow for unmeasured confounders which are hard to avoid. Because many causal questions emerge from data which suffer severe measurement error problems, we combine both IV approaches in this article to correct IV-based causal effect estimators in linear (structural mean) models for possibly systematic measurement error on the exposure. The estimators rely on the presence of a baseline measurement which is associated with the observed exposure and known not to modify the target effect. Simulation studies and the analysis of a small blood pressure reduction trial (n = 105) with treatment noncompliance confirm the adequate performance of our estimators in finite samples. Our results also demonstrate that incorporating limited prior knowledge about a weakly identified parameter (such as the error mean) in a frequentist analysis can yield substantial improvements.
causal inference; instrumental variables; measurement error; noncompliance; prior information; two-stage least squares estimators; weak identifiability
Many recent studies that relax the assumption of independent evolution of sites have done so at the expense of a drastic increase in the number of substitution parameters. While additional parameters cannot be avoided to model context-dependent evolution, a large increase in model dimensionality is only justified when accompanied with careful model-building strategies that guard against overfitting. An increased dimensionality leads to increases in numerical computations of the models, increased convergence times in Bayesian Markov chain Monte Carlo algorithms and even more tedious Bayes Factor calculations.
We have developed two model-search algorithms which reduce the number of Bayes Factor calculations by clustering posterior densities to decide on the equality of substitution behavior in different contexts. The selected model's fit is evaluated using a Bayes Factor, which we calculate via model-switch thermodynamic integration. To reduce computation time and to increase the precision of this integration, we propose to split the calculations over different computers and to appropriately calibrate the individual runs. Using the proposed strategies, we find, in a dataset of primate Ancestral Repeats, that careful modeling of context-dependent evolution may increase model fit considerably and that the combination of a context-dependent model with the assumption of varying rates across sites offers even larger improvements in terms of model fit. Using a smaller nuclear SSU rRNA dataset, we show that context-dependence may only become detectable upon applying model-building strategies.
While context-dependent evolutionary models can increase the model fit over traditional independent evolutionary models, such complex models will often contain too many parameters. Justification for the added parameters is thus required so that only those parameters that model evolutionary processes previously unaccounted for are added to the evolutionary model. To obtain an optimal balance between the number of parameters in a context-dependent model and the performance in terms of model fit, we have designed two parameter-reduction strategies and we have shown that model fit can be greatly improved by reducing the number of parameters in a context-dependent evolutionary model.
Objectives To assess gestational length and prevalence of preterm birth among medically and naturally conceived twins; to establish the role of zygosity and chorionicity in assessing gestational length in twins born after subfertility treatment.
Design Population based cohort study.
Setting Collaborative network of 19 maternity facilities in East Flanders, Belgium (East Flanders prospective twin survey).
Participants 4368 twin pairs born between 1976 and 2002, including 2915 spontaneous twin pairs, 710 twin pairs born after ovarian stimulation, and 743 twin pairs born after in vitro fertilisation or intracytoplasmic sperm injection.
Main outcome measures Gestational length and prevalence of preterm birth.
Results Compared with naturally conceived twins, twins resulting from subfertility treatment had on average a slightly decreased gestational age at birth (mean difference 4.0 days, 95% confidence interval 2.7 to 5.2), corresponding to an odds ratio of 1.6 (1.4 to 1.8) for preterm birth, albeit confined to mild preterm birth (34-36 weeks). The adjusted odds ratios of preterm birth after subfertility treatment were 1.3 (1.1 to 1.5) when controlled for birth year, maternal age, and parity and 1.6 (1.3 to 1.8) with additional control for fetal sex, caesarean section, zygosity, and chorionicity. Although an increased risk of preterm birth was therefore seen among twins resulting from subfertility treatment, the risk was largely caused by a first birth effect among subfertile couples; conversely, the risk of prematurity was substantially levelled off by the protective effect of dizygotic twinning.
Conclusions Twins resulting from subfertility treatment have an increased risk of preterm birth, but the risk is limited to mild preterm birth, primarily by virtue of dizygotic twinning.