In the accompanying commentary, Rose and van der Laan (Am J Epidemiol. 2014;179(6):663–669) criticize the relative excess risk due to interaction (RERI) measure, the use of additive interaction, and the weighting approach we developed to assess RERI with case-control data. In this commentary, we note some of the advantages of using additive measures of interaction, such as RERI, in making decisions about targeting interventions toward certain subgroups and in assessing mechanistic interaction. We discuss the relationship between Rose and van der Laan's estimator for case-control data and the one we had previously proposed. We also develop a new doubly robust estimator for determining the RERI with case-control data when the prevalence or incidence of the outcome is known.
doi:10.1093/aje/kwt316
PMCID: PMC3939844
PMID: 24488514
additive interaction; case-control studies; doubly robust estimator; subgroup analysis; synergism
Methods from causal mediation analysis have generalized the traditional approach to direct and indirect effects in the epidemiologic and social science literature by allowing for interaction and non-linearities. However, the methods from the causal inference literature have themselves been subject to a major limitation in that the so-called natural direct and indirect effects that are employed are not identified from data whenever there is a variable that is affected by the exposure, which also confounds the relationship between the mediator and the outcome. In this paper we describe three alternative approaches to effect decomposition that give quantities that can be interpreted as direct and indirect effects, and that can be identified from data even in the presence of an exposure-induced mediator-outcome confounder. We describe a simple weighting-based estimation method for each of these three approaches, illustrated with data from perinatal epidemiology. The methods described here can shed insight into pathways and questions of mediation even when an exposure-induced mediator-outcome confounder is present.
doi:10.1097/EDE.0000000000000034
PMCID: PMC4214081
PMID: 24487213
Recent simulation studies have pointed to the higher power of the test for the mediated effect vs. the test for the total effect, even in the presence of a direct effect. This has motivated applied researchers to investigate mediation in settings where there is no evidence of a total effect. In this paper we provide analytical insight into the circumstances under which higher power of the test for the mediated effect vs. the test for the total effect can be expected in the absence of a direct effect. We argue that the acclaimed power gain is somewhat deceptive and comes with a big price. On the basis of the results, we recommend that when the primary interest lies in mediation only, a significant test for the total effect should not be used as a prerequisite for the test for the indirect effect. However, because the test for the indirect effect is vulnerable to bias when common causes of mediator and outcome are not measured or not accounted for, it should be evaluated in a sensitivity analysis.
doi:10.3389/fpsyg.2014.01549
PMCID: PMC4290592
PMID: 25628585
mediation analysis; power; indirect effect; type I error; confounding; sensitivity analysis
A sufficient cause interaction between two exposures signals the presence of individuals for whom the outcome would occur only under certain values of the two exposures. When the outcome is dichotomous and all exposures are categorical, then under certain no confounding assumptions, empirical conditions for sufficient cause interactions can be constructed based on the sign of linear contrasts of conditional outcome probabilities between differently exposed subgroups, given confounders. It is argued that logistic regression models are unsatisfactory for evaluating such contrasts, and that Bernoulli regression models with linear link are prone to misspecification. We therefore develop semiparametric tests for sufficient cause interactions under models which postulate probability contrasts in terms of a finite-dimensional parameter, but which are otherwise unspecified. Estimation is often not feasible in these models because it would require nonparametric estimation of auxiliary conditional expectations given high-dimensional variables. We therefore develop ‘multiply robust tests’ under a union model that assumes at least one of several working submodels holds. In the special case of a randomized experiment or a family-based genetic study in which the joint exposure distribution is known by design or Mendelian inheritance, the procedure leads to asymptotically distribution-free tests of the null hypothesis of no sufficient cause interaction.
doi:10.1111/j.1467-9868.2011.01011.x
PMCID: PMC4280915
PMID: 25558182
Double robustness; Effect modification; Gene-environment interaction; Gene-gene interaction; Semiparametric inference; Sufficient cause; Synergism
Missing values in covariates of regression models are a pervasive problem in
empirical research. Popular approaches for analyzing partially observed datasets
include complete case analysis (CCA), multiple imputation (MI), and inverse
probability weighting (IPW). In the case of missing covariate values, these
methods (as typically implemented) are valid under different missingness
assumptions. In particular, CCA is valid under missing not at random (MNAR)
mechanisms in which missingness in a covariate depends on the value of that
covariate, but is conditionally independent of outcome. In this paper, we argue
that in some settings such an assumption is more plausible than the missing at
random assumption underpinning most implementations of MI and IPW. When the
former assumption holds, although CCA gives consistent estimates, it does not
make use of all observed information. We therefore propose an augmented CCA
approach which makes the same conditional independence assumption for
missingness as CCA, but which improves efficiency through specification of an
additional model for the probability of missingness, given the fully observed
variables. The new method is evaluated using simulations and illustrated through
application to data on reported alcohol consumption and blood pressure from the
US National Health and Nutrition Examination Survey, in which data are likely
MNAR independent of outcome.
doi:10.1093/biostatistics/kxu023
PMCID: PMC4173105
PMID: 24907708
Complete case analysis; Missing covariates; Missing not at random; Multiple imputation
We consider statistical methods for benchmarking clinical centers based on a dichotomous outcome indicator. Borrowing ideas from the causal inference literature, we aim to reveal how the entire study population would have fared under the current care level of each center. To this end, we evaluate direct standardization based on fixed versus random center effects outcome models that incorporate patient-specific baseline covariates to adjust for differential case-mix. We explore fixed effects (FE) regression with Firth correction and normal mixed effects (ME) regression to maintain convergence in the presence of very small centers. Moreover, we study doubly robust FE regression to avoid outcome model extrapolation. Simulation studies show that shrinkage following standard ME modeling can result in substantial power loss relative to the considered alternatives, especially for small centers. Results are consistent with findings in the analysis of 30-day mortality risk following acute stroke across 90 centers in the Swedish Stroke Register.
doi:10.1093/biostatistics/kxu019
PMCID: PMC4173104
PMID: 24812420
Causal inference; Double robustness; Firth correction; Profiling center performance; Propensity score; Quality of care; Random and fixed effects
Background
The negative impact of rising progesterone levels on pregnancy rates is well known, but data on mature oocyte yield are conflicting. We examined whether delaying the oocyte maturation trigger in IVF/ICSI affected the number of mature oocytes and investigated the potential influence of serum progesterone levels in this process.
Methods
Between January 31, 2011, and December 31, 2011, 262 consecutive patients were monitored using ultrasound plus hormonal evaluation. Those with > =3 follicles with a mean diameter of > =18 mm were divided into 2 groups depending on their serum progesterone levels. In cases with a progesterone level < = 1 ng/ml, which was observed in 59 patients, 30-50% of their total number of follicles (only counting those larger than 10 mm) were at least 18 mm in diameter. These patients were randomised into 2 groups: in one group, final oocyte maturation was triggered the same day; for the other, maturation was triggered 24 hours later. Seventy-two patients with progesterone levels > 1 ng/ml were randomised in the same manner, irrespective of the percentage of larger follicles (> = 18 mm). The number of metaphase II oocytes was our primary outcome variable. Because some patients were included more than once, correction for duplicate patients was performed.
Results
In the study arm with low progesterone (<= 1 ng/ml), the mean number of metaphase II oocytes (+/-SD) was 10.29 (+/-6.35) in the group with delayed administration of the oocyte maturation trigger versus 7.64 (+/-3.26) in the control group. After adjusting for age, the mean difference was 2.41 (95% CI: 0.22-4.61; p = 0.031). In the study arm with elevated progesterone (>1 ng/ml), the mean numbers of metaphase II oocytes (+/-SD) were 11.81 (+/-9.91) and 12.03 (+/-7.09) for the delayed and control groups, respectively. After adjusting for PCOS (polycystic ovary syndrome) and female pathology, the mean difference was -0.44 (95% CI: -3.65-2.78; p = 0.79).
Conclusions
Delaying oocyte maturation in patients with low progesterone levels yields greater numbers of mature oocytes.
Trial registration
B67020108975 (Belgian registration) and NCT01980563 (ClinicalTrials.gov).
doi:10.1186/1477-7827-12-31
PMCID: PMC4008411
PMID: 24758641
Summary
We define natural direct and indirect effects on the exposed. We show that these allow for effect decomposition under weaker identification conditions than population natural direct and indirect effects. When no confounders of the mediator-outcome association are affected by the exposure, identification is possible under essentially the same conditions as for controlled direct effects. Otherwise, identification is still possible with additional knowledge on a nonidentifiable selection-bias function which measures the dependence of the mediator effect on the observed exposure within confounder levels, and which evaluates to zero in a large class of realistic data-generating mechanisms. We argue that natural direct and indirect effects on the exposed are of intrinsic interest in various applications. We moreover show that they coincide with the corresponding population natural direct and indirect effects when the exposure is randomly assigned. In such settings, our results are thus also of relevance for assessing population natural direct and indirect effects in the presence of exposure-induced mediator-outcome confounding, which existing methodology has not been able to address.
doi:10.1111/j.1541-0420.2012.01777.x
PMCID: PMC3989894
PMID: 22989075
Causal inference; Direct effect; Indirect effect; Mediation; Pathway; Time-varying confounding
In genome wide association studies (GWAS), family-based studies tend to have less power to detect genetic associations than population-based studies, such as case-control studies. This can be an issue when testing if genes in a family-based GWAS have a direct effect on the phenotype of interest over and above their possible indirect effect through a secondary phenotype. When multiple SNPs are tested for a direct effect in the family-based study, a screening step can be used to minimize the burden of multiple comparisons in the causal analysis. We propose a 2-stage screening step that can be incorporated into the family-based association test (FBAT) approach similar to the conditional mean model approach in the Van Steen-algorithm (Van Steen et al., 2005). Simulations demonstrate that the type 1 error is preserved and this method is advantageous when multiple markers are tested. This method is illustrated by an application to the Framingham Heart Study.
doi:10.3389/fgene.2013.00243
PMCID: PMC3836057
PMID: 24312120
family-based association analysis; causal inference; genetic pathway; mediation; pleiotropy
We propose a method for testing gene–environment (G × E) interactions on a complex trait in family-based studies in which a phenotypic ascertainment criterion has been imposed. This novel approach employs G-estimation, a semiparametric estimation technique from the causal inference literature, to avoid modeling of the association between the environmental exposure and the phenotype, to gain robustness against unmeasured confounding due to population substructure, and to acknowledge the ascertainment conditions. The proposed test allows for incomplete parental genotypes. It is compared by simulation studies to an analogous conditional likelihood–based approach and to the QBAT-I test, which also invokes the G-estimation principle but ignores ascertainment. We apply our approach to a study of chronic obstructive pulmonary disorder.
doi:10.1093/biostatistics/kxr035
PMCID: PMC3372944
PMID: 22084302
Causal inference; COPD; Family-based association; G-estimation; Gene–environment interaction
Background
Accurate model comparison requires extensive computation times, especially for parameter-rich models of sequence evolution. In the Bayesian framework, model selection is typically performed through the evaluation of a Bayes factor, the ratio of two marginal likelihoods (one for each model). Recently introduced techniques to estimate (log) marginal likelihoods, such as path sampling and stepping-stone sampling, offer increased accuracy over the traditional harmonic mean estimator at an increased computational cost. Most often, each model’s marginal likelihood will be estimated individually, which leads the resulting Bayes factor to suffer from errors associated with each of these independent estimation processes.
Results
We here assess the original ‘model-switch’ path sampling approach for direct Bayes factor estimation in phylogenetics, as well as an extension that uses more samples, to construct a direct path between two competing models, thereby eliminating the need to calculate each model’s marginal likelihood independently. Further, we provide a competing Bayes factor estimator using an adaptation of the recently introduced stepping-stone sampling algorithm and set out to determine appropriate settings for accurately calculating such Bayes factors, with context-dependent evolutionary models as an example. While we show that modest efforts are required to roughly identify the increase in model fit, only drastically increased computation times ensure the accuracy needed to detect more subtle details of the evolutionary process.
Conclusions
We show that our adaptation of stepping-stone sampling for direct Bayes factor calculation outperforms the original path sampling approach as well as an extension that exploits more samples. Our proposed approach for Bayes factor estimation also has preferable statistical properties over the use of individual marginal likelihood estimates for both models under comparison. Assuming a sigmoid function to determine the path between two competing models, we provide evidence that a single well-chosen sigmoid shape value requires less computational efforts in order to approximate the true value of the (log) Bayes factor compared to the original approach. We show that the (log) Bayes factors calculated using path sampling and stepping-stone sampling differ drastically from those estimated using either of the harmonic mean estimators, supporting earlier claims that the latter systematically overestimate the performance of high-dimensional models, which we show can lead to erroneous conclusions. Based on our results, we argue that highly accurate estimation of differences in model fit for high-dimensional models requires much more computational effort than suggested in recent studies on marginal likelihood estimation.
doi:10.1186/1471-2105-14-85
PMCID: PMC3651733
PMID: 23497171
Summary
It is useful to have robust gene-environment interaction tests that can utilize a variety of family structures in an efficient way. This paper focuses on tests for gene-environment interaction in the presence of main genetic and environmental effects. The objective is to develop powerful tests that can combine trio data with parental genotypes and discordant sibships when parents genotypes are missing. We first make a modest improvement on a method for discordant sibs (discordant on phenotype), but the approach does not allow one to use families when all offspring are affected, e.g. trios. We then make a modest improvement on a Mendelian transmission-based approach that is inefficient when discordant sibs are available, but can be applied to any nuclear family. Finally, we propose a hybrid approach that utilizes the most efficient method for a specific family type, then combines over families. We utilize this hybrid approach to analyze a chronic obstructive pulmonary disorder dataset to test for gene-environment interaction in the Serpine2 gene with smoking. The methods are freely available in the R package fbati.
doi:10.1111/j.1541-0420.2011.01581.x
PMCID: PMC3120904
PMID: 21401569
Gene-Environment Interaction; Family-Based Association Tests; Candidate Gene Analysis; Binary Trait; COPD; Serpine2
Estimates of additive interaction from case-control data are often obtained by logistic regression; such models can also be used to adjust for covariates. This approach to estimating additive interaction has come under some criticism because of possible misspecification of the logistic model: If the underlying model is linear, the logistic model will be misspecified. The authors propose an inverse probability of treatment weighting approach to causal effects and additive interaction in case-control studies. Under the assumption of no unmeasured confounding, the approach amounts to fitting a marginal structural linear odds model. The approach allows for the estimation of measures of additive interaction between dichotomous exposures, such as the relative excess risk due to interaction, using case-control data without having to rely on modeling assumptions for the outcome conditional on the exposures and covariates. Rather than using conditional models for the outcome, models are instead specified for the exposures conditional on the covariates. The approach is illustrated by assessing additive interaction between genetic and environmental factors using data from a case-control study.
doi:10.1093/aje/kwr334
PMCID: PMC3246690
PMID: 22058231
case-control studies; interaction; linear model; structural model; synergism; weighting
For dichotomous outcomes, the authors discuss when the standard approaches to mediation analysis used in epidemiology and the social sciences are valid, and they provide alternative mediation analysis techniques when the standard approaches will not work. They extend definitions of controlled direct effects and natural direct and indirect effects from the risk difference scale to the odds ratio scale. A simple technique to estimate direct and indirect effect odds ratios by combining logistic and linear regressions is described that applies when the outcome is rare and the mediator continuous. Further discussion is given as to how this mediation analysis technique can be extended to settings in which data come from a case-control study design. For the standard mediation analysis techniques used in the epidemiologic and social science literatures to be valid, an assumption of no interaction between the effects of the exposure and the mediator on the outcome is needed. The approach presented here, however, will apply even when there are interactions between the effect of the exposure and the mediator on the outcome.
doi:10.1093/aje/kwq332
PMCID: PMC2998205
PMID: 21036955
case-control studies; causal inference; decomposition; dichotomous response; epidemiologic methods; interaction; logistic regression; odds ratio
Background
Crucial connections between sexual network structure and the distribution of HIV remain inadequately understood, especially in regard to the role of concurrency and age disparity in relationships, and how these network characteristics correlate with each other and other risk factors. Social desirability bias and inaccurate recall are obstacles to obtaining valid, detailed information about sexual behaviour and relationship histories. Therefore, this study aims to use novel research methods in order to determine whether HIV status is associated with age-disparity and sexual connectedness as well as establish the primary behavioural and socio-demographic predictors of the egocentric and community sexual network structures.
Method/Design
We will conduct a cross-sectional survey that uses a questionnaire exploring one-year sexual histories, with a focus on timing and age disparity of relationships, as well as other risk factors such as unprotected intercourse and the use of alcohol and recreational drugs. The questionnaire will be administered in a safe and confidential mobile interview space, using audio computer-assisted self-interview (ACASI) technology on touch screen computers. The ACASI features a choice of languages and visual feedback of temporal information. The survey will be administered in three peri-urban disadvantaged communities in the greater Cape Town area with a high burden of HIV. The study communities participated in a previous TB/HIV study, from which HIV test results will be anonymously linked to the survey dataset. Statistical analyses of the data will include descriptive statistics, linear mixed-effects models for the inter- and intra-subject variability in the age difference between sexual partners, survival analysis for correlated event times to model concurrency patterns, and logistic regression for association of HIV status with age disparity and sexual connectedness.
Discussion
This study design is intended to facilitate more accurate recall of sensitive sexual history data and has the potential to provide substantial insights into the relationship between key sexual network attributes and additional risk factors for HIV infection. This will help to inform the design of context-specific HIV prevention programmes.
doi:10.1186/1471-2458-11-616
PMCID: PMC3161892
PMID: 21810237
Background
Accurate modelling of substitution processes in protein-coding sequences is often hampered by the computational burdens associated with full codon models. Lately, codon partition models have been proposed as a viable alternative, mimicking the substitution behaviour of codon models at a low computational cost. Such codon partition models however impose independent evolution of the different codon positions, which is overly restrictive from a biological point of view. Given that empirical research has provided indications of context-dependent substitution patterns at four-fold degenerate sites, we take those indications into account in this paper.
Results
We present so-called context-dependent codon partition models to assess previous empirical claims that the evolution of four-fold degenerate sites is strongly dependent on the composition of its two flanking bases. To this end, we have estimated and compared various existing independent models, codon models, codon partition models and context-dependent codon partition models for the atpB and rbcL genes of the chloroplast genome, which are frequently used in plant systematics. Such context-dependent codon partition models employ a full dependency scheme for four-fold degenerate sites, whilst maintaining the independence assumption for the first and second codon positions.
Conclusions
We show that, both in the atpB and rbcL alignments of a collection of land plants, these context-dependent codon partition models significantly improve model fit over existing codon partition models. Using Bayes factors based on thermodynamic integration, we show that in both datasets the same context-dependent codon partition model yields the largest increase in model fit compared to an independent evolutionary model. Context-dependent codon partition models hence perform closer to codon models, which remain the best performing models at a drastically increased computational cost, compared to codon partition models, but remain computationally interesting alternatives to codon models. Finally, we observe that the substitution patterns in both datasets are drastically different, leading to the conclusion that combined analysis of these two genes using a single model may not be advisable from a context-dependent point of view.
doi:10.1186/1471-2148-11-145
PMCID: PMC3126739
PMID: 21619569
A primary focus of an increasing number of scientific studies is to determine whether two exposures interact in the effect that they produce on an outcome of interest. Interaction is commonly assessed by fitting regression models in which the linear predictor includes the product between those exposures. When the main interest lies in the interaction, this approach is not entirely satisfactory because it is prone to (possibly severe) bias when the main exposure effects or the association between outcome and extraneous factors are misspecified. In this article, we therefore consider conditional mean models with identity or log link which postulate the statistical interaction in terms of a finite-dimensional parameter, but which are otherwise unspecified. We show that estimation of the interaction parameter is often not feasible in this model because it would require nonparametric estimation of auxiliary conditional expectations given high-dimensional variables. We thus consider ‘multiply robust estimation’ under a union model that assumes at least one of several working submodels holds. Our approach is novel in that it makes use of information on the joint distribution of the exposures conditional on the extraneous factors in making inferences about the interaction parameter of interest. In the special case of a randomized trial or a family-based genetic study in which the joint exposure distribution is known by design or by Mendelian inheritance, the resulting multiply robust procedure leads to asymptotically distribution-free tests of the null hypothesis of no interaction on an additive scale. We illustrate the methods via simulation and the analysis of a randomized follow-up study.
doi:10.1198/016214508000001084
PMCID: PMC3097121
PMID: 21603124
Double robustness; Gene-environment interaction; Gene-gene interaction; Longitudinal data; Semiparametric inference
When testing for genetic effects, failure to account for a gene-environment interaction can mask the true association effects of a genetic marker with disease. Family-based association tests are popular because they are completely robust to population substructure and model misspecification. However, when testing for an interaction, failure to model the main genetic effect correctly can lead to spurious results. Here we propose a family-based test for interaction that is robust to model misspecification, but still sensitive to an interaction effect, and can handle continuous covariates and missing parents. We extend the FBAT-I gene-environment interaction test for dichotomous traits to using both trios and sibships. We then compare this extension to joint tests of gene and gene-environment interaction, and compare the joint test additionally to the main effects test of the gene. Lastly we apply these three tests to a group of nuclear families ascertained according to affection with Bipolar Disorder.
doi:10.1002/gepi.20421
PMCID: PMC3082448
PMID: 19365860
genetic association; genetic interaction; family-based test; FBAT-I
We develop a locally efficient test for (multiplicative) gene–environment interaction in family studies that collect genotypic information and environmental exposures for affected offspring along with genotypic information for their parents or relatives. The proposed test does not require modeling the effects of environmental exposures and is doubly robust in the sense of being valid if either a model for the main genetic effect holds or a model for the expected environmental exposure (given the offspring affection status and parental mating types) but not necessarily both. It extends the FBAT-I to allow for missing parental mating types and families of arbitrary size. Simulation studies and the analysis of an Alzheimer's disease study confirm the adequate performance of the proposed test.
doi:10.1093/biostatistics/kxp061
PMCID: PMC2830581
PMID: 20154305
Causal inference; Double robustness; Effect modification; Gene–environment interaction; Genetic association; Nuclear family; Semiparametric interaction model
Sufficient cause interactions concern cases in which there is a particular causal mechanism for some outcome that requires the presence of 2 or more specific causes to operate. Empirical conditions have been derived to test for sufficient cause interactions. However, when regression outcome models are used to control for confounding variables in tests for sufficient cause interactions, the outcome models impose restrictions on the relation between the confounding variables and certain unidentified background causes within the sufficient cause framework; often, these assumptions are implausible. By using marginal structural models, rather than outcome regression models, to test for sufficient cause interactions, modeling assumptions are instead made on the relation between the causes of interest and the confounding variables; these assumptions will often be more plausible. The use of marginal structural models also allows for testing for sufficient cause interactions in the presence of time-dependent confounding. Such time-dependent confounding may arise in cases in which one factor of interest affects both the second factor of interest and the outcome. It is furthermore shown that marginal structural models can be used not only to test for sufficient cause interactions but also to give lower bounds on the prevalence of such sufficient cause interactions.
doi:10.1093/aje/kwp396
PMCID: PMC2877448
PMID: 20067916
causal inference; interaction; marginal structural models; sufficient causes; synergism; weighting
We introduce a stepwise approach for family-based designs for selecting a set of markers in a gene that are independently associated with the disease. The approach is based on testing the effect of a set of markers conditional on another set of markers. Several likelihood-based approaches have been proposed for special cases, but no model-free based tests have been proposed. We propose two types of tests in a family-based framework that are applicable to arbitrary family structures and completely robust to population stratification. We propose methods for ascertained dichotomous traits and unascertained quantitative traits. We first propose a completely model-free extension of the FBAT main genetic effect test. Then, for power issues, we introduce two model-based tests, one for dichotomous traits and one for continuous traits. Lastly, we utilize these tests to analyze a continuous lung function phenotype as a proxy for asthma in the Childhood Asthma Management Program. The methods are implemented in the free R package fbati.
doi:10.1159/000264447
PMCID: PMC2956011
PMID: 19996607
Binary trait; Candidate gene analysis; Family-based association tests; FBAT-C; Linkage disequilibrium (LD); Model-based test; Model-free test; Nuclear families; Quantitative trait
Background
Recent approaches for context-dependent evolutionary modelling assume that the evolution of a given site depends upon its ancestor and that ancestor's immediate flanking sites. Because such dependency pattern cannot be imposed on the root sequence, we consider the use of different orders of Markov chains to model dependence at the ancestral root sequence. Root distributions which are coupled to the context-dependent model across the underlying phylogenetic tree are deemed more realistic than decoupled Markov chains models, as the evolutionary process is responsible for shaping the composition of the ancestral root sequence.
Results
We find strong support, in terms of Bayes Factors, for using a second-order Markov chain at the ancestral root sequence along with a context-dependent model throughout the remainder of the phylogenetic tree in an ancestral repeats dataset, and for using a first-order Markov chain at the ancestral root sequence in a pseudogene dataset. Relaxing the assumption of a single context-independent set of independent model frequencies as presented in previous work, yields a further drastic increase in model fit. We show that the substitution rates associated with the CpG-methylation-deamination process can be modelled through context-dependent model frequencies and that their accuracy depends on the (order of the) Markov chain imposed at the ancestral root sequence. In addition, we provide evidence that this approach (which assumes that root distribution and evolutionary model are decoupled) outperforms an approach inspired by the work of Arndt et al., where the root distribution is coupled to the evolutionary model. We show that the continuous-time approximation of Hwang and Green has stronger support in terms of Bayes Factors, but the parameter estimates show minimal differences.
Conclusions
We show that the combination of a dependency scheme at the ancestral root sequence and a context-dependent evolutionary model across the remainder of the tree allows for accurate estimation of the model's parameters. The different assumptions tested in this manuscript clearly show that designing accurate context-dependent models is a complex process, with many different assumptions that require validation. Further, these assumptions are shown to change across different datasets, making the search for an adequate model for a given dataset quite challenging.
doi:10.1186/1471-2148-10-244
PMCID: PMC2928787
PMID: 20698960
Instrumental variables (IV) estimators are well established to correct for measurement error on exposure in a broad range of fields. In a distinct prominent stream of research IV’s are becoming increasingly popular for estimating causal effects of exposure on outcome since they allow for unmeasured confounders which are hard to avoid. Because many causal questions emerge from data which suffer severe measurement error problems, we combine both IV approaches in this article to correct IV-based causal effect estimators in linear (structural mean) models for possibly systematic measurement error on the exposure. The estimators rely on the presence of a baseline measurement which is associated with the observed exposure and known not to modify the target effect. Simulation studies and the analysis of a small blood pressure reduction trial (n = 105) with treatment noncompliance confirm the adequate performance of our estimators in finite samples. Our results also demonstrate that incorporating limited prior knowledge about a weakly identified parameter (such as the error mean) in a frequentist analysis can yield substantial improvements.
PMCID: PMC2743431
PMID: 20046952
causal inference; instrumental variables; measurement error; noncompliance; prior information; two-stage least squares estimators; weak identifiability
Background
Many recent studies that relax the assumption of independent evolution of sites have done so at the expense of a drastic increase in the number of substitution parameters. While additional parameters cannot be avoided to model context-dependent evolution, a large increase in model dimensionality is only justified when accompanied with careful model-building strategies that guard against overfitting. An increased dimensionality leads to increases in numerical computations of the models, increased convergence times in Bayesian Markov chain Monte Carlo algorithms and even more tedious Bayes Factor calculations.
Results
We have developed two model-search algorithms which reduce the number of Bayes Factor calculations by clustering posterior densities to decide on the equality of substitution behavior in different contexts. The selected model's fit is evaluated using a Bayes Factor, which we calculate via model-switch thermodynamic integration. To reduce computation time and to increase the precision of this integration, we propose to split the calculations over different computers and to appropriately calibrate the individual runs. Using the proposed strategies, we find, in a dataset of primate Ancestral Repeats, that careful modeling of context-dependent evolution may increase model fit considerably and that the combination of a context-dependent model with the assumption of varying rates across sites offers even larger improvements in terms of model fit. Using a smaller nuclear SSU rRNA dataset, we show that context-dependence may only become detectable upon applying model-building strategies.
Conclusion
While context-dependent evolutionary models can increase the model fit over traditional independent evolutionary models, such complex models will often contain too many parameters. Justification for the added parameters is thus required so that only those parameters that model evolutionary processes previously unaccounted for are added to the evolutionary model. To obtain an optimal balance between the number of parameters in a context-dependent model and the performance in terms of model fit, we have designed two parameter-reduction strategies and we have shown that model fit can be greatly improved by reducing the number of parameters in a context-dependent evolutionary model.
doi:10.1186/1471-2148-9-87
PMCID: PMC2695821
PMID: 19405957
Objectives To assess gestational length and prevalence of preterm birth among medically and naturally conceived twins; to establish the role of zygosity and chorionicity in assessing gestational length in twins born after subfertility treatment.
Design Population based cohort study.
Setting Collaborative network of 19 maternity facilities in East Flanders, Belgium (East Flanders prospective twin survey).
Participants 4368 twin pairs born between 1976 and 2002, including 2915 spontaneous twin pairs, 710 twin pairs born after ovarian stimulation, and 743 twin pairs born after in vitro fertilisation or intracytoplasmic sperm injection.
Main outcome measures Gestational length and prevalence of preterm birth.
Results Compared with naturally conceived twins, twins resulting from subfertility treatment had on average a slightly decreased gestational age at birth (mean difference 4.0 days, 95% confidence interval 2.7 to 5.2), corresponding to an odds ratio of 1.6 (1.4 to 1.8) for preterm birth, albeit confined to mild preterm birth (34-36 weeks). The adjusted odds ratios of preterm birth after subfertility treatment were 1.3 (1.1 to 1.5) when controlled for birth year, maternal age, and parity and 1.6 (1.3 to 1.8) with additional control for fetal sex, caesarean section, zygosity, and chorionicity. Although an increased risk of preterm birth was therefore seen among twins resulting from subfertility treatment, the risk was largely caused by a first birth effect among subfertile couples; conversely, the risk of prematurity was substantially levelled off by the protective effect of dizygotic twinning.
Conclusions Twins resulting from subfertility treatment have an increased risk of preterm birth, but the risk is limited to mild preterm birth, primarily by virtue of dizygotic twinning.
doi:10.1136/bmj.38625.685706.AE
PMCID: PMC1285094
PMID: 16249191