Testing for Hardy–Weinberg equilibrium is ubiquitous and has traditionally been carried out via frequentist approaches. However, the discreteness of the sample space means that uniformity of p-values under the null cannot be assumed, with enumeration of all possible counts, conditional on the minor allele count, offering a computationally expensive way of p-value calibration. In addition, the interpretation of the subsequent p-values, and choice of significance threshold depends critically on sample size, because equilibrium will always be rejected at conventional levels with large sample sizes. We argue for a Bayesian approach using both Bayes factors, and the examination of posterior distributions. We describe simple conjugate approaches, and methods based on importance sampling Monte Carlo. The former are convenient because they yield closed-form expressions for Bayes factors, which allow their application to a large number of single nucleotide polymorphisms (SNPs), in particular in genome-wide contexts. We also describe straightforward direct sampling methods for examining posterior distributions of parameters of interest. For large numbers of alleles at a locus we resort to Markov chain Monte Carlo. We discuss a number of possibilities for prior specification, and apply the suggested methods to a number of real datasets.
Bayes factors; Exact test; Genome-wide association studies; Importance sampling; Prior choice; Significance level
Small area estimation (SAE) is an important endeavor in many fields and is used for resource allocation by both public health and government organizations. Often, complex surveys are carried out within areas, in which case it is common for the data to consist only of the response of interest and an associated sampling weight, reflecting the design. While it is appealing to use spatial smoothing models, and many approaches have been suggested for this endeavor, it is rare for spatial models to incorporate the weighting scheme, leaving the analysis potentially subject to bias. To examine the properties of various approaches to estimation we carry out a simulation study, looking at bias due to both non-response and non-random sampling. We also carry out SAE of smoking prevalence in Washington State, at the zip code level, using data from the 2006 Behavioral Risk Factor Surveillance System. The computation times for the methods we compare are short, and all approaches are implemented in R using currently available packages.
Complex surveys; Design-based inference; Intrinsic CAR models; Random effects models; Weighting
Hierarchical modeling has been used extensively for small area estimation. However, design weights that are required to reflect complex surveys are rarely considered in these models. We develop computationally efficient, Bayesian spatial smoothing models that acknowledge the design weights. Computation is carried out using the integrated nested Laplace approximation, which is fast. A simulation study is presented that considers the effects of non-response and non-random selection of individuals. We examine the impact of ignoring the design weights and the benefits of spatial smoothing. The results show that, when compared with standard approaches, mean squared error can be greatly reduced with the proposed models. Bias reduction occurs through the inclusion of the design weights, with variance reduction being achieved through hierarchical smoothing. We analyze data from the Washington State 2006 Behavioral Risk Factor Surveillance System. The models are easily and quickly fitted within the R environment, using existing packages.
Bayesian methods; integrated nested Laplace approximation; sample surveys; spatial statistics
Chromatin accessibility is an important functional genomics phenotype that influences transcription factor binding and gene expression. Genome-scale technologies allow chromatin accessibility to be mapped with high-resolution, facilitating detailed analyses into the genetic architecture and evolution of chromatin structure within and between species. We performed Formaldehyde-Assisted Isolation of Regulatory Elements sequencing (FAIRE-Seq) to map chromatin accessibility in two parental haploid yeast species, Saccharomyces cerevisiae and Saccharomyces paradoxus and their diploid hybrid. We show that although broad-scale characteristics of the chromatin landscape are well conserved between these species, accessibility is significantly different for 947 regions upstream of genes that are enriched for GO terms such as intracellular transport and protein localization exhibit. We also develop new statistical methods to investigate the genetic architecture of variation in chromatin accessibility between species, and find that cis effects are more common and of greater magnitude than trans effects. Interestingly, we find that cis and trans effects at individual genes are often negatively correlated, suggesting widespread compensatory evolution to stabilize levels of chromatin accessibility. Finally, we demonstrate that the relationship between chromatin accessibility and gene expression levels is complex, and a significant proportion of differences in chromatin accessibility might be functionally benign.
Inside the nucleus of a cell, DNA is associated with proteins to form a complex three-dimensional structure referred to as chromatin. The structure of chromatin influences how accessible specific DNA sequences are to transcription factors, and therefore chromatin accessibility is an important determinant of gene expression. To better understand how patterns of chromatin accessibility change over time, we quantitatively measured levels of chromatin accessibility in two yeast species and their diploid hybrid. We show that significant differences in chromatin accessibility exist between these two species and occur upstream of genes that are enriched for specific biological functions. We also develop new statistical methods to understand the genetics of variation in chromatin accessibility. Finally, we show that the relationship between chromatin accessibility and gene expression is complex, and many of the observed differences in chromatin accessibility between these two species may not influence gene expression levels. Thus, our work highlights the need to develop additional experimental and statistical methods to distinguish between functionally significant and benign changes in chromatin accessibility.
In this paper, we consider two-phase sampling in the situation in which all covariates are categorical. Two-phase designs are appealing from an efficiency perspective since, if carefully implemented, they allow sampling to be concentrated in informative cells. A number of likelihood-based methods have been developed for the analysis of two-phase data, but we describe a Bayesian approach which has previously been unavailable. The methods are first compared with existing approaches via a simulation study, and are then applied to data collected on Wilms tumour. The benefits of a Bayesian approach include relaxation of the reliance on asymptotic inference, particularly in sparse data situations, and the potential to model data with complex dependencies, for example, via the introduction of random effects. The sparse data situation is illustrated via a simulated example.
Contingency tables; Efficiency; Markov chain Monte Carlo; Outcome-dependent sampling
Water-pipe and smokeless tobacco use have been associated with several adverse health outcomes. However, little information is available on the association between water-pipe use and heart disease (HD). Therefore, we investigated the association of smoking water-pipe and chewing nass (a mixture of tobacco, lime, and ash) with prevalent HD.
Baseline data (collected in 2004–2008) from a prospective population-based study in Golestan Province, Iran.
50,045 residents of Golestan (40–75 years old; 42.4% male).
Main outcome measures
ORs and 95% CIs from multivariate logistic regression models for the association of water-pipe and nass use with HD prevalence.
A total of 3051 (6.1%) participants reported a history of HD, and 525 (1.1%) and 3726 (7.5%) reported ever water-pipe or nass use, respectively. Heavy water-pipe smoking was significantly associated with HD prevalence (highest level of cumulative use versus never use, OR= 3.75; 95% CI 1.52 – 9.22; P for trend= 0.04). This association persisted when using different cutoff points, when restricting HD to those taking nitrate compound medications, and among never cigarette smokers. There was no significant association between nass use and HD prevalence (highest category of use versus never use, OR= 0.91; 95% CI 0.69 – 1.20).
Our study suggests a significant association between HD and heavy water-pipe smoking. Although the existing evidence suggesting similar biological consequences of water-pipe and cigarette smoking make this association plausible, results of our study were based on a modest number of water-pipe users and need to be replicated in further studies.
hookah; ischemic heart disease; nass; tobacco; water-pipe
Large outbreaks of hand, foot and mouth disease (HFMD) were observed in both 2008 and 2009 in China.
Using the national surveillance data since May 2, 2008, epidemiological characteristics of the outbreaks are summarized, and the transmissibility of the disease and the effects of potential risk factors were evaluated via a susceptible-infectious-recovered transmission model.
Children of 1.0–2.9 years were the most susceptible group to HFMD (odds ratios [OR] > 2.3 as compared to other age groups). Infant cases had the highest incidences of severe disease (ORs > 1.4) and death (ORs > 2.4), as well as the longest delay from symptom onset to diagnosis (2.3 days). Males were more susceptible to HFMD than females (OR=1.56 [95% confidence interval=1.56, 1.57]). An one day delay in diagnosis was associated with increases in the odds of severe disease by 40.3% [38.7%, 41.9%] and in the odds of death by 53.7% [43.6%, 64.5%]. Compared to Coxsackie A16, enterovirus (EV) 71 is more strongly associated with severe disease (OR=15.6 [13.4, 18.1]) and death (OR=40.7 [13.0, 127.3]). The estimated local effective reproductive numbers among prefectures ranged from 1.4 to 1.6 (median=1.4) in spring and stayed below 1.2 in other seasons. A higher risk of transmission was associated with temperatures in the range of 70-80F, higher relative humidity, wind speed, precipitation, population density, and the periods in which schools were open.
HFMD is a moderately transmittable infectious disease, mainly among pre-school children. EV71 was responsible for most severe cases and fatalities. Mixing of asymptomatically infected children in schools might have contributed to the spread of HFMD. Timely diagnosis may be a key to reducing the high mortality rate in infants.
Ecological data are available at the level of the group, rather than at the level of the individual. The use of ecological data in spatial epidemiological investigations is particularly common. Though the computational methods described are more generally applicable, this paper concentrates on the situation in which the margins of 2 × 2 tables are observed in each of n geographical areas, with a Bayesian approach to inference. We consider auxiliary schemes that impute the missing data, and compare with a previously suggested normal approximation. The analysis of ecological data is subject to ecological bias, with the only reliable means of removing such bias being the addition of auxiliary individual-level information. Various schemes have been suggested for this supplementation, and we illustrate how the computational methods may be applied to the analysis of such enhanced data. The methods are illustrated using simulated data and two examples. In the first example the ecological data are supplemented with a simple random sample of individual-level data, and in this example the normal approximation fails. In the second example case-control sampling provide the additional information.
Auxiliary data; Case-control sampling; Ecological bias; Markov chain Monte Carlo
Genome-wide association studies (GWAS) require large sample sizes to obtain adequate statistical power, but it may be possible to increase the power by incorporating complementary data. In this study we investigated the feasibility of automatically retrieving information from the medical literature and leveraging this information in GWAS.
We developed a method that searches through PubMed abstracts for pre-assigned keywords and key concepts, and uses this information to assign prior probabilities of association for each single nucleotide polymorphism (SNP) with the phenotype of interest - the Adjusting Association Priors with Text (AdAPT) method. Association results from a GWAS can subsequently be ranked in the context of these priors using the Bayes False Discovery Probability (BFDP) framework. We initially tested AdAPT by comparing rankings of known susceptibility alleles in a previous lung cancer GWAS, and subsequently applied it in a two-phase GWAS of oral cancer.
Known lung cancer susceptibility SNPs were consistently ranked higher by AdAPT BFDPs than by p-values. In the oral cancer GWAS, we sought to replicate the top five SNPs as ranked by AdAPT BFDPs, of which rs991316, located in the ADH gene region of 4q23, displayed a statistically significant association with oral cancer risk in the replication phase (per-rare-allele log additive p-value [ptrend] = 2.5×10−3). The combined OR for having one additional rare allele was 0.83 (95% CI: 0.76–0.90), and this association was independent of previously identified susceptibility SNPs that are associated with overall UADT cancer in this gene region. We also investigated if rs991316 was associated with other cancers of the upper aerodigestive tract (UADT), but no additional association signal was found.
This study highlights the potential utility of systematically incorporating prior knowledge from the medical literature in genome-wide analyses using the AdAPT methodology. AdAPT is available online (url: http://services.gate.ac.uk/lld/gwas/service/config).
Incidence of myelodysplastic syndromes (MDS) has been described in the United States since its inclusion in the Surveillance, Epidemiology, and End Results program in 2001, and the Seattle-Puget Sound region of Washington State has among the highest rates of the registries. In this investigation, we described small-scale incidence patterns of MDS within the Seattle-Puget Sound region from 2002 to 2006 and identified potential spatial clusters to inform planning of future studies of MDS etiology.
We used a spatial disease mapping model to estimate smoothed relative risks for each census tract and to describe the spatial component of variability in the incidence rates. We also used two methods to describe the location of potential MDS clusters: the approach of Besag and Newell and the Kulldorff spatial scan statistic.
Our findings from all three approaches indicated the most likely areas of increased MDS incidence were located on Whidbey Island in Island County.
Interpretation is limited because our data are based on the residential location of the MDS case only at the time of diagnosis. Nevertheless, inclusion of identified cluster regions in future population-based research and investigation of individual-level exposures could shed light on environmental risk factors for MDS.
Generalized linear mixed models (GLMMs) continue to grow in popularity due to their ability to directly acknowledge multiple levels of dependency and model different data types. For small sample sizes especially, likelihood-based inference can be unreliable with variance components being particularly difficult to estimate. A Bayesian approach is appealing but has been hampered by the lack of a fast implementation, and the difficulty in specifying prior distributions with variance components again being particularly problematic. Here, we briefly review previous approaches to computation in Bayesian implementations of GLMMs and illustrate in detail, the use of integrated nested Laplace approximations in this context. We consider a number of examples, carefully specifying prior distributions on meaningful quantities in each case. The examples cover a wide range of data types including those requiring smoothing over time and a relatively complicated spline model for which we examine our prior specification in terms of the implied degrees of freedom. We conclude that Bayesian inference is now practically feasible for GLMMs and provides an attractive alternative to likelihood-based approaches such as penalized quasi-likelihood. As with likelihood-based approaches, great care is required in the analysis of clustered binary data since approximation strategies may be less accurate for such data.
Integrated nested Laplace approximations; Longitudinal data; Penalized quasi-likelihood; Prior specification; Spline models
Ecological inference is a problem of partial identification, and therefore reliable precise conclusions are rarely possible without the collection of individual level (identifying) data. Without such data, sensitivity analyses provide the only recourse. In this paper we review and critique approaches to ecological inference in the social sciences, and describe in detail hierarchical models, which allow both sensitivity analysis and the incorporation of individual level data into an ecological analysis. A crucial element of a sensitivity analysis in such models is prior specification, and we detail how this may be carried out. Furthermore, we demonstrate how the inclusion of a small amount of individual level data can dramatically improve the properties of such estimates.
With the advent of rapid and relatively cheap genotyping technologies there is now the opportunity to attempt to identify gene-environment and gene-gene interactions when the number of genes and environmental factors is potentially large. Unfortunately the dimensionality of the parameter space leads to a computational explosion in the number of possible interactions that may be investigated. The full model that includes all interactions and main effects can be unstable, with wide confidence intervals arising from the large number of estimated parameters. We describe a hierarchical mixture model that allows all interactions to be investigated simultaneously, but assumes the effects come from a mixture prior with two components, one that reflects small null effects and the second for epidemiologically significant effects. Effects from the former are effectively set to zero, hence increasing the power for the detection of real signals. The prior framework is very flexible, which allows substantive information to be incorporated into the analysis. We illustrate the methods first using simulation, and then on data from a case-control study of lung cancer in Central and Eastern Europe.
Hierarchical models; Informative prior distributions; Markov chain Monte Carlo; Mean-variance trade-off
Background Cancer registries in the 1970s showed that parts of Golestan Province in Iran had the highest rate of oesophageal squamous cell carcinoma (OSCC) in the world. More recent studies have shown that while rates are still high, they are approximately half of what they were before, which might be attributable to improved socio-economic status (SES) and living conditions in this area. We examined a wide range of SES indicators to investigate the association between different SES components and risk of OSCC in the region.
Methods Data were obtained from a population-based case–control study conducted between 2003 and 2007 with 300 histologically proven OSCC cases and 571 matched neighbourhood controls. We used conditional logistic regression to compare cases and controls for individual SES indicators, for a composite wealth score constructed using multiple correspondence analysis, and for factors obtained from factors analysis.
Results We found that various dimensions of SES, such as education, wealth and being married were all inversely related to OSCC. The strongest inverse association was found with education. Compared with no education, the adjusted odds ratios (95% confidence intervals) for primary education and high school or beyond were 0.52 (0.27–0.98) and 0.20 (0.06–0.65), respectively.
Conclusions The strong association of SES with OSCC after adjustment for known risk factors implies the presence of yet unidentified risk factors that are correlated with our SES measures; identification of these factors could be the target of future studies. Our results also emphasize the importance of using multiple SES measures in epidemiological studies.
Oesophageal cancer; socio-economic status; case–control; epidemiology; Iran; factor analysis; correspondence analysis
In this paper, we illustrate that combining ecological data with subsample data in situations in which a linear model is appropriate provides three main benefits. First, by including the individual level subsample data, the biases associated with linear ecological inference can be eliminated. Second, by supplementing the subsample data with ecological data, the information about parameters will be increased. Third, we can use readily available ecological data to design optimal subsampling schemes, so as to further increase the information about parameters. We present an application of this methodology to the classic problem of estimating the effect of a college degree on wages. We show that combining ecological data with subsample data provides precise estimates of this value, and that optimal subsampling schemes (conditional on the ecological data) can provide good precision with only a fraction of the observations.
Ecological bias; Combining information; Within-area confounding; Returns to education; Sample design
To investigate patterns of food and nutrient consumption in Golestan province, a high-incidence area for esophageal cancer (EC) in northern Iran.
Twelve 24-hour dietary recalls were administered during a one year period to 131 healthy participants in a pilot cohort study. We compare here nutrient intake in Golestan with Recommended Daily Allowances (RDAs) and Lowest Threshold Intakes (LTIs). We also compare the intake of 27 food groups and nutrients among several population subgroups, using mean values from the twelve recalls.
Rural women had a very low level of vitamin intake, which was even lower than LTIs (P < 0.01). Daily intake of vitamins A and C was lower than LTI in 67% and 73% of rural women, respectively. Among rural men, the vitamin intakes were not significantly different from LTIs. Among urban women, the vitamin intakes were significantly lower than RDAs, but were significantly higher than LTIs. Among urban men, the intakes were not significantly different from RDAs. Compared to urban dwellers, intake of most food groups and nutrients, including vitamins, was significantly lower among rural dwellers. In terms of vitamin intake, no significant difference was observed between Turkmen and non-Turkmen ethnics.
The severe deficiency in vitamin intake among women and rural dwellers and marked differences in nutrient intake between rural and urban dwellers may contribute to the observed epidemiological pattern of EC in Golestan, with high incidence rates among women and people with low socioeconomic status, and the highest incidence rate among rural women.
esophageal cancer; Iran; Caspian Littoral; Golestan: Turkmen; diet record
An automated method for counting spot-forming units in the ELISpot assay is described that uses a statistical model fit to training data that is based on counts from one or more experts. The method adapts to variable background intensities and provides considerable flexibility with respect to what image features can be used to model expert counts. Point estimates of spot counts are produced together with intervals that reflect the degree of uncertainty in the count. Finally, the approach is completely transparent and “open source” in contrast to methods embedded in current commercial software. An illustrative application to data from a study of the reactivity of T-cells from healthy human subjects to a pool of immunodominant peptides from CMV, EBV and flu is presented.
Automated Spot Counting; ELISpot Assay; Image Analysis; Generalized Linear Models
To investigate the risk of adverse birth outcomes associated with residence near landfill sites in Great Britain.
Geographical study of risks of adverse birth outcomes in populations living within 2 km of 9565 landfill sites operational at some time between 1982 and 1997 (from a total of 19 196 sites) compared with those living further away.
Over 8.2 million live births, 43 471 stillbirths, and 124 597 congenital anomalies (including terminations).
Main outcome measures
All congenital anomalies combined, some specific anomalies, and prevalence of low and very low birth weight (<2500 g and <1500 g).
For all anomalies combined, relative risk of residence near landfill sites (all waste types) was 0.92 (99% confidence interval 0.907 to 0.923) unadjusted, and 1.01 (1.005 to 1.023) adjusted for confounders. Adjusted risks were 1.05 (1.01 to 1.10) for neural tube defects, 0.96 (0.93 to 0.99) for cardiovascular defects, 1.07 (1.04 to 1.10) for hypospadias and epispadias (with no excess of surgical correction), 1.08 (1.01 to 1.15) for abdominal wall defects, 1.19 (1.05 to 1.34) for surgical correction of gastroschisis and exomphalos, and 1.05 (1.047 to 1.055) and 1.04 (1.03 to 1.05) for low and very low birth weight respectively. There was no excess risk of stillbirth. Findings for special (hazardous) waste sites did not differ systematically from those for non-special sites. For some specific anomalies, higher risks were found in the period before opening compared with after opening of a landfill site, especially hospital admissions for abdominal wall defects.
We found small excess risks of congenital anomalies and low and very low birth weight in populations living near landfill sites. No causal mechanisms are available to explain these findings, and alternative explanations include data artefacts and residual confounding. Further studies are needed to help differentiate between the various possibilities.
What is already known on this topicVarious studies have found excess risks of certain congenital anomalies and low birth weight near landfill sitesRisks up to two to three times higher have been reportedThese studies have been difficult to interpret because of problems of exposure classification, small sample size, confounding, and reporting biasWhat this study addsSome 80% of the British population lives within 2 km of known landfill sites in Great BritainBy including all landfill sites in the country, we avoided the problem of selective reporting, and maximised statistical powerAlthough we found excess risks of congenital anomalies and low birth weight near landfill sites in Great Britain, they were smaller than in some other studiesFurther work is needed to differentiate potential data artefacts and confounding effects from possible causal associations with landfill