# Related Articles

Background

In trials designed to estimate rates of perinatal mother to child transmission of HIV, HIV assays are scheduled at multiple points in time. Still, infection status for some infants at some time points may be unknown, particularly when interim analyses are conducted.

Methods

Logistic regression models are commonly used to estimate covariate-adjusted transmission rates, but their methods for handling missing data may be inadequate. Here we propose using coarsened multinomial regression models to estimate cumulative and conditional rates of HIV transmission. Through simulation, we compare the proposed models to standard logistic models in terms of bias, mean squared error, coverage probability, and power. We consider a range of treatment effect and visit process scenarios, while including imperfect sensitivity of the assay and contamination of the endpoint due to early breastfeeding transmission. We illustrate the approach through analysis of data from a clinical trial designed to prevent perinatal transmission.

Results

The proposed cumulative and conditional models performed well when compared to their logistic counterparts. Performance of the proposed cumulative model was particularly strong under scenarios where treatment was assumed to increase the risk of in utero transmission but decrease the risk of intrapartum and overall perinatal transmission and under scenarios designed to represent interim analyses. Power to estimate intrapartum and perinatal transmission was consistently higher for the proposed models.

Conclusion

Coarsened multinomial regression models are preferred to standard logistic models for estimation of perinatal mother to child transmission of HIV, particularly when assays are missing or occur off-schedule for some infants.

doi:10.1186/1471-2288-8-46

PMCID: PMC2515333
PMID: 18627627

Background

The absence of a gold standard, i.e., a diagnostic reference standard having perfect sensitivity and specificity, is a common problem in clinical practice and in diagnostic research studies. There is a need for methods to estimate the incremental value of a new, imperfect test in this context.

Methods

We use a Bayesian approach to estimate the probability of the unknown disease status via a latent class model and extend two commonly-used measures of incremental value based on predictive values [difference in the area under the ROC curve (AUC) and integrated discrimination improvement (IDI)] to the context where no gold standard exists. The methods are illustrated using simulated data and applied to the problem of estimating the incremental value of a novel interferon-gamma release assay (IGRA) over the tuberculin skin test (TST) for latent tuberculosis (TB) screening. We also show how to estimate the incremental value of IGRAs when decisions are based on observed test results rather than predictive values.

Results

We showed that the incremental value is greatest when both sensitivity and specificity of the new test are better and that conditional dependence between the tests reduces the incremental value. The incremental value of the IGRA depends on the sensitivity and specificity of the TST, as well as the prevalence of latent TB, and may thus vary in different populations.

Conclusions

Even in the absence of a gold standard, incremental value statistics may be estimated and can aid decisions about the practical value of a new diagnostic test.

doi:10.1186/1471-2288-14-67

PMCID: PMC4077291
PMID: 24886359

Area under the curve; Bayesian estimation; Incremental value; Informative priors; Integrated discrimination improvement; Imperfect diagnostic tests; Latent class models; Tuberculosis

Summary

We extend the standard multivariate mixed model by incorporating a smooth time effect and relaxing distributional assumptions. We propose a semiparametric Bayesian approach to multivariate longitudinal data using a mixture of Polya trees prior distribution. Usually, the distribution of random effects in a longitudinal data model is assumed to be Gaussian. However, the normality assumption may be suspect, particularly if the estimated longitudinal trajectory parameters exhibit multimodality and skewness. In this paper we propose a mixture of Polya trees prior density to address the limitations of the parametric random effects distribution. We illustrate the methodology by analyzing data from a recent HIV-AIDS study.

doi:10.1111/j.1467-842X.2010.00581.x

PMCID: PMC3127550
PMID: 21731424

Conditional predictive ordinate; Longitudinal data; Mixture of Polya trees; Penalized spline

Latent class models (LCMs) are used increasingly for addressing a broad variety of problems, including sparse modeling of multivariate and longitudinal data, model-based clustering, and flexible inferences on predictor effects. Typical frequentist LCMs require estimation of a single finite number of classes, which does not increase with the sample size, and have a well-known sensitivity to parametric assumptions on the distributions within a class. Bayesian nonparametric methods have been developed to allow an infinite number of classes in the general population, with the number represented in a sample increasing with sample size. In this article, we propose a new nonparametric Bayes model that allows predictors to flexibly impact the allocation to latent classes, while limiting sensitivity to parametric assumptions by allowing class-specific distributions to be unknown subject to a stochastic ordering constraint. An efficient MCMC algorithm is developed for posterior computation. The methods are validated using simulation studies and applied to the problem of ranking medical procedures in terms of the distribution of patient morbidity.

doi:10.1198/jasa.2011.ap10058

PMCID: PMC3324040
PMID: 22505787

Factor analysis; Latent variables; Mixture model; Model-based clustering; Nested Dirichlet process; Order restriction; Random probability measure; Stick breaking

Missing covariate data is common in observational studies of time to an event, especially when covariates are repeatedly measured over time. Failure to account for the missing data can lead to bias or loss of efficiency, especially when the data are non-ignorably missing. Previous work has focused on the case of fixed covariates rather than those that are repeatedly measured over the follow-up period, so here we present a selection model that allows for proportional hazards regression with time-varying covariates when some covariates may be non-ignorably missing. We develop a fully Bayesian model and obtain posterior estimates of the parameters via the Gibbs sampler in WinBUGS. We illustrate our model with an analysis of post-diagnosis weight change and survival after breast cancer diagnosis in the Long Island Breast Cancer Study Project (LIBCSP) follow-up study. Our results indicate that post-diagnosis weight gain is associated with lower all-cause and breast cancer specific survival among women diagnosed with new primary breast cancer. Our sensitivity analysis showed only slight differences between models with different assumptions on the missing data mechanism yet the complete case analysis yielded markedly different results.

doi:10.1002/sim.4076

PMCID: PMC3253577
PMID: 20960582

proportional hazards regression; non-ignorably missing data; missing covariates; selection model

Background

The Spectrum program is used to estimate key HIV indicators from the trends in incidence and prevalence estimated by the Estimation and Projection Package or the Workbook. These indicators include the number of people living with HIV, new infections, AIDS deaths, AIDS orphans, the number of adults and children needing treatment, the need for prevention of mother-to-child transmission and the impact of antiretroviral treatment on survival. The UNAIDS Reference Group on Estimates, Models and Projections regularly reviews new data and information needs, and recommends updates to the methodology and assumptions used in Spectrum.

Methods

The latest update to Spectrum was used in the 2009 round of global estimates. This update contains new procedures for estimating: the age and sex distribution of adult incidence, new child infections occurring around delivery or through breastfeeding, the survival of children by timing of infection and the number of double orphans.

doi:10.1136/sti.2010.044222

PMCID: PMC3173821
PMID: 21106510

HIV; modelling; AIDS; estimates; epidemiology

Becquet, Renaud | Marston, Milly | Dabis, François | Moulton, Lawrence H. | Gray, Glenda | Coovadia, Hoosen M. | Essex, Max | Ekouevi, Didier K. | Jackson, Debra | Coutsoudis, Anna | Kilewo, Charles | Leroy, Valériane | Wiktor, Stefan Z. | Nduati, Ruth | Msellati, Philippe | Zaba, Basia | Ghys, Peter D. | Newell, Marie-Louise | Bhutta, Zulfiqar A.
Background

Assumptions about survival of HIV-infected children in Africa without antiretroviral therapy need to be updated to inform ongoing UNAIDS modelling of paediatric HIV epidemics among children. Improved estimates of infant survival by timing of HIV-infection (perinatally or postnatally) are thus needed.

Methodology/Principal Findings

A pooled analysis was conducted of individual data of all available intervention cohorts and randomized trials on prevention of HIV mother-to-child transmission in Africa. Studies were right-censored at the time of infant antiretroviral initiation. Overall mortality rate per 1000 child-years of follow-up was calculated by selected maternal and infant characteristics. The Kaplan-Meier method was used to estimate survival curves by child's HIV infection status and timing of HIV infection. Individual data from 12 studies were pooled, with 12,112 children of HIV-infected women. Mortality rates per 1,000 child-years follow-up were 39.3 and 381.6 for HIV-uninfected and infected children respectively. One year after acquisition of HIV infection, an estimated 26% postnatally and 52% perinatally infected children would have died; and 4% uninfected children by age 1 year. Mortality was independently associated with maternal death (adjusted hazard ratio 2.2, 95%CI 1.6–3.0), maternal CD4<350 cells/ml (1.4, 1.1–1.7), postnatal (3.1, 2.1–4.1) or peri-partum HIV-infection (12.4, 10.1–15.3).

Conclusions/Results

These results update previous work and inform future UNAIDS modelling by providing survival estimates for HIV-infected untreated African children by timing of infection. We highlight the urgent need for the prevention of peri-partum and postnatal transmission and timely assessment of HIV infection in infants to initiate antiretroviral care and support for HIV-infected children.

doi:10.1371/journal.pone.0028510

PMCID: PMC3285615
PMID: 22383946

Bayesian Poisson log-linear multilevel models scalable to epidemiological studies are proposed to investigate population variability in sleep state transition rates. Hierarchical random effects are used to account for pairings of subjects and repeated measures within those subjects, as comparing diseased to non-diseased subjects while minimizing bias is of importance. Essentially, non-parametric piecewise constant hazards are estimated and smoothed, allowing for time-varying covariates and segment of the night comparisons. The Bayesian Poisson regression is justified through a re-derivation of a classical algebraic likelihood equivalence of Poisson regression with a log(time) offset and survival regression assuming exponentially distributed survival times. Such re-derivation allows synthesis of two methods currently used to analyze sleep transition phenomena: stratified multi-state proportional hazards models and log-linear models with GEE for transition counts. An example data set from the Sleep Heart Health Study is analyzed. Supplementary material includes the analyzed data set as well as the code for a reproducible analysis.

doi:10.1002/sim.4457

PMCID: PMC3774038
PMID: 22241689

multi-state models; recurrent event; competing risks; survival analysis; frailties; sleep; hypnogram

Tamhane, M. | Gautney, B. | Shiu, C. | Segaren, N. | Jeannis, L. | Eustache, C. | Simeon-Fadois, Y. | Chen, Y. H. | De, D. | Irivinti, S. | Tamma, P. | Thompson, C. B. | Khamadi, S. | Siberry, G.K. | Persaud, D.
Background

Nucleic-acid-testing (NAT) to diagnose HIV infection in children under age 18 months provides a barrier to HIV-testing in exposed children from resource-constrained settings. The ultrasensitive HIV- p24- antigen (Up24) assay is cheaper and easier to perform and is sensitive (84–98%) and specific (98–100%). The cut-point optical density (OD) selected for discriminating between positive and negative samples may need assessment due to regional differences in mother-to-child HIV-transmission rates.

Objectives

We used receiver operator characteristics (ROC) curves and logistic regression analyses to assess the effect of various cut-points on the diagnostic performance of Up24 for HIV-infection status among HIV-exposed children. Positive and negative predictive values at different rates of disease prevalence were also estimated.

Study design

A study of Up24 testing on dried blood spot (DBS) samples collected from 278 HIV-exposed Haitian children, 3–24-months of age, in whom HIV-infection status was determined by NAT on the same DBS card.

Results

The sensitivity and specificity of Up24 varied by the cut-point-OD value selected. At a cut-point-OD of 8-fold the standard deviation of the negative control (NCSD), sensitivity and specificity of Up24 were maximized [87.8% (95% CI, 83.9–91.6) and 92% (95% CI, 88.8–95.2), respectively]. In lower prevalence settings (5%), positive and negative predictive values of Up24 were maximal (75.9% and 98.8%, respectively) at a cut-point-OD that was 15-fold the NCSD.

Conclusions

In low prevalence settings, a high degree of specificity can be achieved with Up24 testing of HIV-exposed children when a higher cut-point OD is used; a feature that may facilitate more frequent use of Up24 antigen testing for HIV-exposed children.

doi:10.1016/j.jcv.2011.01.012

PMCID: PMC3065028
PMID: 21330193

We propose Bayesian parametric and semiparametric partially linear regression methods to analyze the outcome-dependent follow-up data when the random time of a follow-up measurement of an individual depends on the history of both observed longitudinal outcomes and previous measurement times. We begin with the investigation of the simplifying assumptions of Lipsitz, Fitzmaurice, Ibrahim, Gelber, and Lipshultz, and present a new model for analyzing such data by allowing subject-specific correlations for the longitudinal response and by introducing a subject-specific latent variable to accommodate the association between the longitudinal measurements and the follow-up times. An extensive simulation study shows that our Bayesian partially linear regression method facilitates accurate estimation of the true regression line and the regression parameters. We illustrate our new methodology using data from a longitudinal observational study.

doi:10.1198/00

PMCID: PMC2288578
PMID: 18392118

Bayesian cubic smoothing spline; Latent variable; Partially linear model

Understanding temporal change in human behavior and psychological processes is a central issue in the behavioral sciences. With technological advances, intensive longitudinal data (ILD) are increasingly generated by studies of human behavior that repeatedly administer assessments over time. ILD offer unique opportunities to describe temporal behavioral changes in detail and identify related environmental and psychosocial antecedents and consequences. Traditional analytical approaches impose strong parametric assumptions about the nature of change in the relationship between time-varying covariates and outcomes of interest. This paper introduces time-varying effect models (TVEM) that explicitly model changes in the association between ILD covariates and ILD outcomes over time in a flexible manner. In this article, we describes unique research questions that the TVEM addresses, outline the model-estimation procedure, share a SAS macro for implementing the model, demonstrate model utility with a simulated example, and illustrate model applications in ILD collected as part of a smoking-cessation study to explore the relationship between smoking urges and self-efficacy during the course of the pre- and post- cessation period.

doi:10.1037/a0025814

PMCID: PMC3288551
PMID: 22103434

intensive longitudinal data; time-varying effect model; non-parametric; P-spline; applications

SUMMARY

A Bayesian multivariate hierarchical transformation model (BMHTM) is developed for receiver operating characteristic (ROC) curve analysis based on clustered continuous diagnostic outcome data with covariates. Two special features of this model are that it incorporates non-linear monotone transformations of the outcomes and that multiple correlated outcomes may be analysed. The mean, variance, and transformation components are all modelled parametrically, enabling a wide range of inferences. The general framework is illustrated by focusing on two problems: (1) analysis of the diagnostic accuracy of a covariate-dependent univariate test outcome requiring a Box–Cox transformation within each cluster to map the test outcomes to a common family of distributions; (2) development of an optimal composite diagnostic test using multivariate clustered outcome data. In the second problem, the composite test is estimated using discriminant function analysis and compared to the test derived from logistic regression analysis where the gold standard is a binary outcome. The proposed methodology is illustrated on prostate cancer biopsy data from a multi-centre clinical trial.

doi:10.1002/sim.2187

PMCID: PMC1540405
PMID: 16217836

Bayesian methods; hierarchical models; multivariate analysis; receiver operating characteristic (ROC) curve; Box–Cox transformation

Background

Estimates of the sensitivity and specificity for new diagnostic tests based on evaluation against a known gold standard are imprecise when the accuracy of the gold standard is imperfect. Bayesian latent class models (LCMs) can be helpful under these circumstances, but the necessary analysis requires expertise in computational programming. Here, we describe open-access web-based applications that allow non-experts to apply Bayesian LCMs to their own data sets via a user-friendly interface.

Methods/Principal Findings

Applications for Bayesian LCMs were constructed on a web server using R and WinBUGS programs. The models provided (http://mice.tropmedres.ac) include two Bayesian LCMs: the two-tests in two-population model (Hui and Walter model) and the three-tests in one-population model (Walter and Irwig model). Both models are available with simplified and advanced interfaces. In the former, all settings for Bayesian statistics are fixed as defaults. Users input their data set into a table provided on the webpage. Disease prevalence and accuracy of diagnostic tests are then estimated using the Bayesian LCM, and provided on the web page within a few minutes. With the advanced interfaces, experienced researchers can modify all settings in the models as needed. These settings include correlation among diagnostic test results and prior distributions for all unknown parameters. The web pages provide worked examples with both models using the original data sets presented by Hui and Walter in 1980, and by Walter and Irwig in 1988. We also illustrate the utility of the advanced interface using the Walter and Irwig model on a data set from a recent melioidosis study. The results obtained from the web-based applications were comparable to those published previously.

Conclusions

The newly developed web-based applications are open-access and provide an important new resource for researchers worldwide to evaluate new diagnostic tests.

doi:10.1371/journal.pone.0079489

PMCID: PMC3827152
PMID: 24265775

Sensitivity and specificity are common measures of the accuracy of a diagnostic test. The usual estimators of these quantities are unbiased if data on the diagnostic test result and the true disease status are obtained from all subjects in an appropriately selected sample. In some studies, verification of the true disease status is performed only for a subset of subjects, possibly depending on the result of the diagnostic test and other characteristics of the subjects. Estimators of sensitivity and specificity based on this subset of subjects are typically biased; this is known as verification bias. Methods have been proposed to correct verification bias under the assumption that the missing data on disease status are missing at random (MAR), that is, the probability of missingness depends on the true (missing) disease status only through the test result and observed covariate information. When some of the covariates are continuous, or the number of covariates is relatively large, the existing methods require parametric models for the probability of disease or the probability of verification (given the test result and covariates), and hence are subject to model misspecification. We propose a new method for correcting verification bias based on the propensity score, defined as the predicted probability of verification given the test result and observed covariates. This is estimated separately for those with positive and negative test results. The new method classifies the verified sample into several subsamples that have homogeneous propensity scores and allows correction for verification bias. Simulation studies demonstrate that the new estimators are more robust to model misspecification than existing methods, but still perform well when the models for the probability of disease and probability of verification are correctly specified.

doi:10.1093/biostatistics/kxr020

PMCID: PMC3276270
PMID: 21856650

Diagnostic test; Model misspecification; Propensity score; Sensitivity; Specificity

The proportional odds model may serve as a useful alternative to the Cox proportional hazards model to study association between covariates and their survival functions in medical studies. In this article, we study an extended proportional odds model that incorporates the so-called “external” time-varying covariates. In the extended model, regression parameters have a direct interpretation of comparing survival functions, without specifying the baseline survival odds function. Semiparametric and maximum likelihood estimation procedures are proposed to estimate the extended model. Our methods are demonstrated by Monte-Carlo simulations, and applied to a landmark randomized clinical trial of a short course Nevirapine (NVP) for mother-to-child transmission (MTCT) of human immunodeficiency virus type-1 (HIV-1). Additional application includes analysis of the well-known Veterans Administration (VA) Lung Cancer Trial.

doi:10.1080/01621459.2012.656021

PMCID: PMC3420072
PMID: 22904583

Counting process; Estimating function; HIV/AIDS; Maximum likelihood estimation; Semiparametric model; Time-varying covariate

We propose a new general Bayesian latent class model for evaluation of the performance of multiple diagnostic tests in situations in which no gold standard test exists based on a computationally intensive approach. The modeling represents an interesting and suitable alternative to models with complex structures that involve the general case of several conditionally independent diagnostic tests, covariates, and strata with different disease prevalences. The technique of stratifying the population according to different disease prevalence rates does not add further marked complexity to the modeling, but it makes the model more flexible and interpretable. To illustrate the general model proposed, we evaluate the performance of six diagnostic screening tests for Chagas disease considering some epidemiological variables. Serology at the time of donation (negative, positive, inconclusive) was considered as a factor of stratification in the model. The general model with stratification of the population performed better in comparison with its concurrents without stratification. The group formed by the testing laboratory Biomanguinhos FIOCRUZ-kit (c-ELISA and rec-ELISA) is the best option in the confirmation process by presenting false-negative rate of 0.0002% from the serial scheme. We are 100% sure that the donor is healthy when these two tests have negative results and he is chagasic when they have positive results.

doi:10.1155/2012/487502

PMCID: PMC3419444
PMID: 22919430

In this article, we propose generalized Bayesian dynamic factor models for jointly modeling mixed-measurement time series. The framework allows mixed-scale measurements associated with each time series, with different measurements having different distributions in the exponential family conditionally on time-varying latent factor(s). Efficient Bayesian computational algorithms are developed for posterior inference on both the latent factors and model parameters, based on a Metropolis Hastings algorithm with adaptive proposals. The algorithm relies on a Greedy Density Kernel Approximation (GDKA) and parameter expansion with latent factor normalization. We tested the framework and algorithms in simulated studies and applied them to the analysis of intertwined credit and recovery risk for Moody’s rated firms from 1982–2008, illustrating the importance of jointly modeling mixed-measurement time series. The article has supplemental materials available online.

doi:10.1080/10618600.2012.729986

PMCID: PMC4004613
PMID: 24791133

Adaptive Metropolis Hastings; Bayesian; Dynamic Factor Model; Exponential Family; Mixed-Measurement Time Series

In order to make a missing at random (MAR) or ignorability assumption realistic, auxiliary covariates are often required. However, the auxiliary covariates are not desired in the model for inference. Typical multiple imputation approaches do not assume that the imputation model marginalizes to the inference model. This has been termed ‘uncongenial’ (Meng, 1994). In order to make the two models congenial (or compatible), we would rather not assume a parametric model for the marginal distribution of the auxiliary covariates, but we typically do not have enough data to estimate the joint distribution well non-parametrically. In addition, when the imputation model uses a non-linear link function (e.g., the logistic link for a binary response), the marginalization over the auxiliary covariates to derive the inference model typically results in a difficult to interpret form for effect of covariates. In this article, we propose a fully Bayesian approach to ensure that the models are compatible for incomplete longitudinal data by embedding an interpretable inference model within an imputation model and that also addresses the two complications described above. We evaluate the approach via simulations and implement it on a recent clinical trial.

doi:10.1111/biom.12121

PMCID: PMC4007313
PMID: 24571539

Congenial imputation; Multiple imputation; Marginalized models; Auxiliary variable MAR

Tustin, Aaron W. | Small, Dylan S. | Delgado, Stephen | Neyra, Ricardo Castillo | Verastegui, Manuela R. | Ancca Juárez, Jenny M. | Quispe Machaca, Víctor R. | Gilman, Robert H. | Bern, Caryn | Levy, Michael Z.
Statistical methods such as latent class analysis can estimate the sensitivity and specificity of diagnostic tests when no perfect reference test exists. Traditional latent class methods assume a constant disease prevalence in one or more tested populations. When the risk of disease varies in a known way, these models fail to take advantage of additional information that can be obtained by measuring risk factors at the level of the individual. We show that by incorporating complex field-based epidemiologic data, in which the disease prevalence varies as a continuous function of individual-level covariates, our model produces more accurate sensitivity and specificity estimates than previous methods. We apply this technique to a simulated population and to actual Chagas disease test data from a community near Arequipa, Peru. Results from our model estimate that the first-line enzyme-linked immunosorbent assay has a sensitivity of 78% (95% CI: 62–100%) and a specificity of 100% (95% CI: 99–100%). The confirmatory immunofluorescence assay is estimated to be 73% sensitive (95% CI: 65–81%) and 99% specific (95% CI: 96–100%).

doi:10.1515/2161-962X.1005

PMCID: PMC3785942
PMID: 24083130

Chagas disease; latent class analysis; Trypanosoma cruzi

Summary

Most current Bayesian SEIR models either use exponentially distributed latent and infectious periods, allow for a single distribution on the latent and infectious period, or make strong assumptions regarding the quantity of information available regarding time distributions, particulary the time spent in the exposed compartment. Many infectious diseases require a more realistic assumption on the latent and infectious periods. In this paper, we provide an alternative model allowing general distributions to be utilized for both the exposed and infectious compartments, while avoiding the need for full latent time data. The alternative formulation is a path-specific SEIR (PS SEIR) model that follows individual paths through the exposed and infectious compartments, thereby removing the need for an exponential assumption on the latent and infectious time distributions. We show how the PS SEIR model is a stochastic analog to a general class of deterministic SEIR models. We then demonstrate the improvement of this PS SEIR model over more common population averaged models via simulation results and perform a new analysis of the Iowa mumps epidemic from 2006.

doi:10.1111/j.1541-0420.2012.01809.x

PMCID: PMC3622117
PMID: 23323602

Bayesian; epidemic; exponential assumption; infectious; Iowa; mumps; latent; MCMC; SEIR; SIR

We consider the estimation of the parameters indexing a parametric model for the conditional distribution of a diagnostic marker given covariates and disease status. Such models are useful for the evaluation of whether and to what extent a marker’s ability to accurately detect or discard disease depends on patient characteristics. A frequent problem that complicates the estimation of the model parameters is that estimation must be conducted from observational studies. Often, in such studies not all patients undergo the gold standard assessment of disease. Furthermore, the decision as to whether a patient undergoes verification is not controlled by study design. In such scenarios, maximum likelihood estimators based on subjects with observed disease status are generally biased. In this paper, we propose estimators for the model parameters that adjust for selection to verification that may depend on measured patient characteristics and additonally adjust for an assumed degree of residual association. Such estimators may be used as part of a sensitivity analysis for plausible degrees of residual association. We describe a doubly robust estimator that has the attractive feature of being consistent if either a model for the probability of selection to verification or a model for the probability of disease among the verified subjects (but not necessarily both) is correct.

doi:10.1016/j.csda.2008.06.021

PMCID: PMC3475507
PMID: 23087495

Missing at Random; Nonignorable; Missing Covariate; Sensitivity Analysis; Semiparametric; Diagnosis

Researchers modeling historical heights have typically relied on the restrictive assumption of a normal distribution, only the mean of which is affected by age, income, nutrition, disease, and similar influences. To avoid these restrictive assumptions, we develop a new semiparametric approach in which covariates are allowed to affect the entire distribution without imposing any parametric shape. We apply our method to a new database of height distributions for Italian provinces, drawn from conscription records, of unprecedented length and geographical disaggregation. Our method allows us to standardize distributions to a single age and calculate moments of the distribution that are comparable through time. Our method also allows us to generate counterfactual distributions for a range of ages, from which we derive age-height profiles. These profiles reveal how the adolescent growth spurt (AGS) distorts the distribution of stature, and they document the earlier and earlier onset of the AGS as living conditions improved over the second half of the nineteenth century. Our new estimates of provincial mean height also reveal a previously unnoticed “regime switch” from regional convergence to divergence in this period.

PMCID: PMC2831262
PMID: 19348106

Summary

The timing of mother-to-child transmission (MTCT) of HIV is critical in understanding the dynamics of MTCT. It has a great implication to developing any effective treatment or prevention strategies for such transmissions. In this paper, we develop an imputation method to analyze the censored MTCT timing in presence of auxiliary information. Specifically, we first propose a statistical model based on the hazard functions of the MTCT timing to reflect three MTCT modes: in utero, during delivery and via breastfeeding, with different shapes of the baseline hazard that vary between infants. This model also allows that the majority of infants may be immuned from the MTCT of HIV. Then, the model is fitted by MCMC to explore marginal inferences via multiple imputation. Moreover, we propose a simple and straightforward approach to take into account the imperfect sensitivity in imputation step, and study appropriate censoring techniques to account for weaning. Our method is assessed by simulations, and applied to a large trial designed to assess the use of antibiotics in preventing MTCT of HIV.

doi:10.2202/1948-4690.1018

PMCID: PMC3419597
PMID: 22905281

HIV/AIDS; mixture models; mother to child transmission of HIV; multiple imputation

Modelling is fundamental to many fields of science and engineering. A model can be thought of as a representation of possible data one could predict from a system. The probabilistic approach to modelling uses probability theory to express all aspects of uncertainty in the model. The probabilistic approach is synonymous with Bayesian modelling, which simply uses the rules of probability theory in order to make predictions, compare alternative models, and learn model parameters and structure from data. This simple and elegant framework is most powerful when coupled with flexible probabilistic models. Flexibility is achieved through the use of Bayesian non-parametrics. This article provides an overview of probabilistic modelling and an accessible survey of some of the main tools in Bayesian non-parametrics. The survey covers the use of Bayesian non-parametrics for modelling unknown functions, density estimation, clustering, time-series modelling, and representing sparsity, hierarchies, and covariance structure. More specifically, it gives brief non-technical overviews of Gaussian processes, Dirichlet processes, infinite hidden Markov models, Indian buffet processes, Kingman’s coalescent, Dirichlet diffusion trees and Wishart processes.

doi:10.1098/rsta.2011.0553

PMCID: PMC3538441
PMID: 23277609

probabilistic modelling; Bayesian statistics; non-parametrics; machine learning

We present a semi-parametric deconvolution estimator for the density function of a random variable X that is measured with error, a common challenge in many epidemiological studies. Traditional deconvolution estimators rely only on assumptions about the distribution of X and the error in its measurement, and ignore information available in auxiliary variables. Our method assumes the availability of a covariate vector statistically related to X by a mean–variance function regression model, where regression errors are normally distributed and independent of the measurement errors. Simulations suggest that the estimator achieves a much lower integrated squared error than the observed-data kernel density estimator when models are correctly specified and the assumption of normal regression errors is met. We illustrate the method using anthropometric measurements of newborns to estimate the density function of newborn length.

doi:10.1002/sim.4186

PMCID: PMC3307103
PMID: 21284016

density estimation; measurement error; mean–variance function model