Motivation: The data that put the ‘evidence’ into ‘evidence-based medicine’ are central to developments in public health, primary and hospital care. A fundamental challenge is to site such data in repositories that can easily be accessed under appropriate technical and governance controls which are effectively audited and are viewed as trustworthy by diverse stakeholders. This demands socio-technical solutions that may easily become enmeshed in protracted debate and controversy as they encounter the norms, values, expectations and concerns of diverse stakeholders. In this context, the development of what are called ‘Data Safe Havens’ has been crucial. Unfortunately, the origins and evolution of the term have led to a range of different definitions being assumed by different groups. There is, however, an intuitively meaningful interpretation that is often assumed by those who have not previously encountered the term: a repository in which useful but potentially sensitive data may be kept securely under governance and informatics systems that are fit-for-purpose and appropriately tailored to the nature of the data being maintained, and may be accessed and utilized by legitimate users undertaking work and research contributing to biomedicine, health and/or to ongoing development of healthcare systems.
Results: This review explores a fundamental question: ‘what are the specific criteria that ought reasonably to be met by a data repository if it is to be seen as consistent with this interpretation and viewed as worthy of being accorded the status of ‘Data Safe Haven’ by key stakeholders’? We propose 12 such criteria.
Motivation: Very large studies are required to provide sufficiently big sample sizes for adequately powered association analyses. This can be an expensive undertaking and it is important that an accurate sample size is identified. For more realistic sample size calculation and power analysis, the impact of unmeasured aetiological determinants and the quality of measurement of both outcome and explanatory variables should be taken into account. Conventional methods to analyse power use closed-form solutions that are not flexible enough to cater for all of these elements easily. They often result in a potentially substantial overestimation of the actual power.
Results: In this article, we describe the Estimating Sample-size and Power in R by Exploring Simulated Study Outcomes tool that allows assessment errors in power calculation under various biomedical scenarios to be incorporated. We also report a real world analysis where we used this tool to answer an important strategic question for an existing cohort.
Availability and implementation: The software is available for online calculation and downloads at http://espresso-research.org. The code is freely available at https://github.com/ESPRESSO-research.
Supplementary data are available at Bioinformatics online.
Asthma and chronic obstructive pulmonary disease (COPD) are heterogeneous diseases.
We sought to determine, in terms of their sputum cellular and mediator profiles, the extent to which they represent distinct or overlapping conditions supporting either the “British” or “Dutch” hypotheses of airway disease pathogenesis.
We compared the clinical and physiological characteristics and sputum mediators between 86 subjects with severe asthma and 75 with moderate-to-severe COPD. Biological subgroups were determined using factor and cluster analyses on 18 sputum cytokines. The subgroups were validated on independent severe asthma (n = 166) and COPD (n = 58) cohorts. Two techniques were used to assign the validation subjects to subgroups: linear discriminant analysis, or the best identified discriminator (single cytokine) in combination with subject disease status (asthma or COPD).
Discriminant analysis distinguished severe asthma from COPD completely using a combination of clinical and biological variables. Factor and cluster analyses of the sputum cytokine profiles revealed 3 biological clusters: cluster 1: asthma predominant, eosinophilic, high TH2 cytokines; cluster 2: asthma and COPD overlap, neutrophilic; cluster 3: COPD predominant, mixed eosinophilic and neutrophilic. Validation subjects were classified into 3 subgroups using discriminant analysis, or disease status with a binary assessment of sputum IL-1β expression. Sputum cellular and cytokine profiles of the validation subgroups were similar to the subgroups from the test study.
Sputum cytokine profiling can determine distinct and overlapping groups of subjects with asthma and COPD, supporting both the British and Dutch hypotheses. These findings may contribute to improved patient classification to enable stratified medicine.
Asthma and COPD overlap; cytokines; factor and cluster analyses; COPD, Chronic obstructive pulmonary disease; ROC, Receiver operating characteristic; ROC AUC, Area under the receiver operating characteristic curve
Background: Research in modern biomedicine and social science requires sample sizes so large that they can often only be achieved through a pooled co-analysis of data from several studies. But the pooling of information from individuals in a central database that may be queried by researchers raises important ethico-legal questions and can be controversial. In the UK this has been highlighted by recent debate and controversy relating to the UK’s proposed ‘care.data’ initiative, and these issues reflect important societal and professional concerns about privacy, confidentiality and intellectual property. DataSHIELD provides a novel technological solution that can circumvent some of the most basic challenges in facilitating the access of researchers and other healthcare professionals to individual-level data.
Methods: Commands are sent from a central analysis computer (AC) to several data computers (DCs) storing the data to be co-analysed. The data sets are analysed simultaneously but in parallel. The separate parallelized analyses are linked by non-disclosive summary statistics and commands transmitted back and forth between the DCs and the AC. This paper describes the technical implementation of DataSHIELD using a modified R statistical environment linked to an Opal database deployed behind the computer firewall of each DC. Analysis is controlled through a standard R environment at the AC.
Results: Based on this Opal/R implementation, DataSHIELD is currently used by the Healthy Obese Project and the Environmental Core Project (BioSHaRE-EU) for the federated analysis of 10 data sets across eight European countries, and this illustrates the opportunities and challenges presented by the DataSHIELD approach.
Conclusions: DataSHIELD facilitates important research in settings where: (i) a co-analysis of individual-level data from several studies is scientifically necessary but governance restrictions prohibit the release or sharing of some of the required data, and/or render data access unacceptably slow; (ii) a research group (e.g. in a developing nation) is particularly vulnerable to loss of intellectual property—the researchers want to fully share the information held in their data with national and international collaborators, but do not wish to hand over the physical data themselves; and (iii) a data set is to be included in an individual-level co-analysis but the physical size of the data precludes direct transfer to a new site for analysis.
DataSHIELD; pooled analysis; ELSI; privacy; confidentiality; disclosure; distributed computing; intellectual property; bioinformatics
Background: Errors, introduced through poor assessment of physical measurement or because of inconsistent or inappropriate standard operating procedures for collecting, processing, storing or analysing haematological and biochemistry analytes, have a negative impact on the power of association studies using the collected data. A dataset from UK Biobank was used to evaluate the impact of pre-analytical variability on the power of association studies.
Methods: First, we estimated the proportion of the variance in analyte concentration that may be attributed to delay in processing using variance component analysis. Then, we captured the proportion of heterogeneity between subjects that is due to variability in the rate of degradation of analytes, by fitting a mixed model. Finally, we evaluated the impact of delay in processing on the power of a nested case-control study using a power calculator that we developed and which takes into account uncertainty in outcome and explanatory variables measurements.
Results: The results showed that (i) the majority of the analytes investigated in our analysis, were stable over a period of 36 h and (ii) some analytes were unstable and the resulting pre-analytical variation substantially decreased the power of the study, under the settings we investigated.
Conclusions: It is important to specify a limited delay in processing for analytes that are very sensitive to delayed assay. If the rate of degradation of an analyte varies between individuals, any delay introduces a bias which increases with increasing delay. If pre-analytical variation occurring due to delays in sample processing is ignored, it affects adversely the power of the studies that use the data.
Biobank; Pre-analytical variation; Biosamples; Statistical power
Inter-individual variation in mean leukocyte telomere length (LTL) is associated with cancer and several age-associated diseases. Here, in a genome-wide meta-analysis of 37,684 individuals with replication of selected variants in a further 10,739 individuals, we identified seven loci, including five novel loci, associated with mean LTL (P<5x10−8). Five of the loci contain genes (TERC, TERT, NAF1, OBFC1, RTEL1) that are known to be involved in telomere biology. Lead SNPs at two loci (TERC and TERT) associate with several cancers and other diseases, including idiopathic pulmonary fibrosis. Moreover, a genetic risk score analysis combining lead variants at all seven loci in 22,233 coronary artery disease cases and 64,762 controls showed an association of the alleles associated with shorter LTL with increased risk of CAD (21% (95% CI: 5–35%) per standard deviation in LTL, p=0.014). Our findings support a causal role of telomere length variation in some age-related diseases.
To further investigate susceptibility loci identified by genome-wide association studies, we genotyped 5,500 SNPs across 14 associated regions in 8,000 samples from a control group and 3 diseases: type 2 diabetes (T2D), coronary artery disease (CAD) and Graves’ disease. We defined, using Bayes theorem, credible sets of SNPs that were 95% likely, based on posterior probability, to contain the causal disease-associated SNPs. In 3 of the 14 regions, TCF7L2 (T2D), CTLA4 (Graves’ disease) and CDKN2A-CDKN2B (T2D), much of the posterior probability rested on a single SNP, and, in 4 other regions (CDKN2A-CDKN2B (CAD) and CDKAL1, FTO and HHEX (T2D)), the 95% sets were small, thereby excluding most SNPs as potentially causal. Very few SNPs in our credible sets had annotated functions, illustrating the limitations in understanding the mechanisms underlying susceptibility to common diseases. Our results also show the value of more detailed mapping to target sequences for functional studies.
We conducted genome-wide association analyses of mean leukocyte telomere length in 2,917 subjects and follow-up replication analyses in 9,492 and identified a locus on 3q26 encompassing the telomerase RNA component TERC, with compelling evidence for association (rs12696304, combined P value 3.72×10−14). Each copy of the minor allele of rs12696304 was associated with ≈75 base pairs shorter mean telomere length equivalent to ≈3.6 years of age-related attrition of mean telomere length.
Background In a recent paper by Homer et al. (Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet 2008;4:e1000167), a method for detecting whether a given individual is a contributor to a particular genomic mixture was proposed. This prompted grave concern about the public dissemination of aggregate statistics from genome-wide association studies. It is of clear scientific importance that such data be shared widely, but the confidentiality of study participants must not be compromised. The issue of what summary genomic data can safely be posted on the web is only addressed satisfactorily when the theoretical underpinnings of the proposed method are clarified and its performance evaluated in terms of dependence on underlying assumptions.
Methods The original method raised a number of concerns and several alternatives have since been proposed, including a simple linear regression approach. In our proposed generalized estimating equation approach, we maintain the simplicity of the linear regression model but obtain inferences that are more robust to approximation of the variance/covariance structure and can accommodate linkage disequilibrium.
Results We affirm that, in principle, it is possible to determine that a ‘candidate’ individual has participated in a study, given a subset of aggregate statistics from that study. However, the methods depend critically on a number of key factors including: the ancestry of participants in the study; the absolute and relative numbers of cases and controls; and the number of single nucleotide polymorphisms.
Conclusions Simple guidelines for publication that are based on a single criterion are therefore unlikely to suffice. In particular, ‘directed’ summary statistics should not be posted openly on the web but could be protected by an internet-based access check as proposed by the P3G_Consortium et al. (Public access to genome-wide data: five views on balancing research with privacy and protection. PLoS Genet 2009;5:e1000665).
Identification; linear regression; generalized estimating equations; linkage disequilibrium; case–control genetic association studies
Background Proper understanding of the roles of, and interactions between genetic, lifestyle, environmental and psycho-social factors in determining the risk of development and/or progression of chronic diseases requires access to very large high-quality databases. Because of the financial, technical and time burdens related to developing and maintaining very large studies, the scientific community is increasingly synthesizing data from multiple studies to construct large databases. However, the data items collected by individual studies must be inferentially equivalent to be meaningfully synthesized. The DataSchema and Harmonization Platform for Epidemiological Research (DataSHaPER; http://www.datashaper.org) was developed to enable the rigorous assessment of the inferential equivalence, i.e. the potential for harmonization, of selected information from individual studies.
Methods This article examines the value of using the DataSHaPER for retrospective harmonization of established studies. Using the DataSHaPER approach, the potential to generate 148 harmonized variables from the questionnaires and physical measures collected in 53 large population-based studies (6.9 million participants) was assessed. Variable and study characteristics that might influence the potential for data synthesis were also explored.
Results Out of all assessment items evaluated (148 variables for each of the 53 studies), 38% could be harmonized. Certain characteristics of variables (i.e. relative importance, individual targeted, reference period) and of studies (i.e. observational units, data collection start date and mode of questionnaire administration) were associated with the potential for harmonization. For example, for variables deemed to be essential, 62% of assessment items paired could be harmonized.
Conclusion The current article shows that the DataSHaPER provides an effective and flexible approach for the retrospective harmonization of information across studies. To implement data synthesis, some additional scientific, ethico-legal and technical considerations must be addressed. The success of the DataSHaPER as a harmonization approach will depend on its continuing development and on the rigour and extent of its use. The DataSHaPER has the potential to take us closer to a truly collaborative epidemiology and offers the promise of enhanced research potential generated through synthesized databases.
Data synthesis; data quality; data pooling; harmonization; meta-analysis; DataSHaPER; retrospective harmonization
Rationale: Genomic loci are associated with FEV1 or the ratio of FEV1 to FVC in population samples, but their association with chronic obstructive pulmonary disease (COPD) has not yet been proven, nor have their combined effects on lung function and COPD been studied.
Objectives: To test association with COPD of variants at five loci (TNS1, GSTCD, HTR4, AGER, and THSD4) and to evaluate joint effects on lung function and COPD of these single-nucleotide polymorphisms (SNPs), and variants at the previously reported locus near HHIP.
Methods: By sampling from 12 population-based studies (n = 31,422), we obtained genotype data on 3,284 COPD case subjects and 17,538 control subjects for sentinel SNPs in TNS1, GSTCD, HTR4, AGER, and THSD4. In 24,648 individuals (including 2,890 COPD case subjects and 13,862 control subjects), we additionally obtained genotypes for rs12504628 near HHIP. Each allele associated with lung function decline at these six SNPs contributed to a risk score. We studied the association of the risk score to lung function and COPD.
Measurements and Main Results: Association with COPD was significant for three loci (TNS1, GSTCD, and HTR4) and the previously reported HHIP locus, and suggestive and directionally consistent for AGER and TSHD4. Compared with the baseline group (7 risk alleles), carrying 10–12 risk alleles was associated with a reduction in FEV1 (β = –72.21 ml, P = 3.90 × 10−4) and FEV1/FVC (β = –1.53%, P = 6.35 × 10−6), and with COPD (odds ratio = 1.63, P = 1.46 × 10−5).
Conclusions: Variants in TNS1, GSTCD, and HTR4 are associated with COPD. Our highest risk score category was associated with a 1.6-fold higher COPD risk than the population average score.
FEV1; FVC; genome-wide association study; modeling risk
Genetic determinants of blood pressure are poorly defined. We undertook a large-scale gene-centric analysis to identify loci and pathways associated with ambulatory systolic and diastolic blood pressure.
We measured 24-hour ambulatory BP in 2020 individuals from 520 white European nuclear families (the GRAPHIC Study) and genotyped their DNA using the Illumina HumanCVD BeadChip array which contains approximately 50000 single nucleotide polymorphisms in >2000 cardiovascular candidate loci. We found a strong association between rs13306560 polymorphism in the promoter region of MTHFR and CLCN6 and mean 24-hour diastolic blood pressure - each minor allele copy of rs13306560 was associated with 2.6 mmHg lower mean 24-hour diastolic blood pressure (P=1.2×10−8). rs13306560 was also associated with clinic diastolic blood pressure in a combined analysis of 8129 subjects from the GRAPHIC Study, the CoLaus Study and the Silesian Cardiovascular Study (P=5.4×10−6). Additional analysis of associations between variants in Gene Ontology-defined pathways and mean 24-hour blood pressure in the GRAPHIC Study showed that cell survival control signalling cascades could play a role in blood pressure regulation. There was also a significant over-representation of rare variants (minor allele frequency <0.05) amongst polymorphisms showing at least nominal association with mean 24-hour blood pressure indicating that a considerable proportion of its heritability may be explained by uncommon alleles.
Through a large scale gene-centric analysis of ambulatory blood pressure, we identified an association of a novel variant at the MTHFR/CLNC6 locus with diastolic blood pressure and provided new insights into the genetic architecture of blood pressure.
gene; genetics; blood pressure; single nucleotide polymorphism; association; heritability
The promise of science lies in expectations of its benefits to societies and is matched by expectations of the realisation of the significant public investment in that science. In this paper, we undertake a methodological analysis of the science of biobanking and a sociological analysis of translational research in relation to biobanking. Part of global and local endeavours to translate raw biomedical evidence into practice, biobanks aim to provide a platform for generating new scientific knowledge to inform development of new policies, systems and interventions to enhance the public’s health. Effectively translating scientific knowledge into routine practice, however, involves more than good science. Although biobanks undoubtedly provide a fundamental resource for both clinical and public health practice, their potentiating ontology—that their outputs are perpetually a promise of scientific knowledge generation—renders translation rather less straightforward than drug discovery and treatment implementation. Biobanking science, therefore, provides a perfect counterpoint against which to test the bounds of translational research. We argue that translational research is a contextual and cumulative process: one that is necessarily dynamic and interactive and involves multiple actors. We propose a new multidimensional model of translational research which enables us to imagine a new paradigm: one that takes us from bench to bedside to backyard and beyond, that is, attentive to the social and political context of translational science, and is cognisant of all the players in that process be they researchers, health professionals, policy makers, industry representatives, members of the public or research participants, amongst others.
Lung function measures are heritable traits that predict population morbidity and mortality and are essential for the diagnosis of chronic obstructive pulmonary disease (COPD). Variations in many genes have been reported to affect these traits, but attempts at replication have provided conflicting results. Recently, we undertook a meta-analysis of Genome Wide Association Study (GWAS) results for lung function measures in 20,288 individuals from the general population (the SpiroMeta consortium).
To comprehensively analyse previously reported genetic associations with lung function measures, and to investigate whether single nucleotide polymorphisms (SNPs) in these genomic regions are associated with lung function in a large population sample.
We analysed association for SNPs tagging 130 genes and 48 intergenic regions (+/−10 kb), after conducting a systematic review of the literature in the PubMed database for genetic association studies reporting lung function associations.
The analysis included 16,936 genotyped and imputed SNPs. No loci showed overall significant association for FEV1 or FEV1/FVC traits using a carefully defined significance threshold of 1.3×10−5. The most significant loci associated with FEV1 include SNPs tagging MACROD2 (P = 6.81×10−5), CNTN5 (P = 4.37×10−4), and TRPV4 (P = 1.58×10−3). Among ever-smokers, SERPINA1 showed the most significant association with FEV1 (P = 8.41×10−5), followed by PDE4D (P = 1.22×10−4). The strongest association with FEV1/FVC ratio was observed with ABCC1 (P = 4.38×10−4), and ESR1 (P = 5.42×10−4) among ever-smokers.
Polymorphisms spanning previously associated lung function genes did not show strong evidence for association with lung function measures in the SpiroMeta consortium population. Common SERPINA1 polymorphisms may affect FEV1 among smokers in the general population.
The prevalence of hypertension in African Americans (AAs) is higher than in other US groups; yet, few have performed genome-wide association studies (GWASs) in AA. Among people of European descent, GWASs have identified genetic variants at 13 loci that are associated with blood pressure. It is unknown if these variants confer susceptibility in people of African ancestry. Here, we examined genome-wide and candidate gene associations with systolic blood pressure (SBP) and diastolic blood pressure (DBP) using the Candidate Gene Association Resource (CARe) consortium consisting of 8591 AAs. Genotypes included genome-wide single-nucleotide polymorphism (SNP) data utilizing the Affymetrix 6.0 array with imputation to 2.5 million HapMap SNPs and candidate gene SNP data utilizing a 50K cardiovascular gene-centric array (ITMAT-Broad-CARe [IBC] array). For Affymetrix data, the strongest signal for DBP was rs10474346 (P= 3.6 × 10−8) located near GPR98 and ARRDC3. For SBP, the strongest signal was rs2258119 in C21orf91 (P= 4.7 × 10−8). The top IBC association for SBP was rs2012318 (P= 6.4 × 10−6) near SLC25A42 and for DBP was rs2523586 (P= 1.3 × 10−6) near HLA-B. None of the top variants replicated in additional AA (n = 11 882) or European-American (n = 69 899) cohorts. We replicated previously reported European-American blood pressure SNPs in our AA samples (SH2B3, P= 0.009; TBX3-TBX5, P= 0.03; and CSK-ULK3, P= 0.0004). These genetic loci represent the best evidence of genetic influences on SBP and DBP in AAs to date. More broadly, this work supports that notion that blood pressure among AAs is a trait with genetic underpinnings but also with significant complexity.
HDL cholesterol (HDL-C) is an established marker of cardiovascular risk with significant genetic determination. However, HDL particles are not homogenous, and refined HDL phenotyping may improve insight into regulation of HDL metabolism. We therefore assessed HDL particles by NMR spectroscopy and conducted a large-scale candidate gene association analysis.
We measured plasma HDL-C and determined mean HDL particle size and particle number by NMR spectroscopy in 2024 individuals from 512 British Caucasian families. Genotypes were 49,094 SNPs in >2,100 cardiometabolic candidate genes/loci as represented on the HumanCVD BeadChip version 2. False discovery rates (FDR) were calculated to account for multiple testing. Analyses on classical HDL-C revealed significant associations (FDR<0.05) only for CETP (cholesteryl ester transfer protein; lead SNP rs3764261: p = 5.6*10−15) and SGCD (sarcoglycan delta; rs6877118: p = 8.6*10−6). In contrast, analysis with HDL mean particle size yielded additional associations in LIPC (hepatic lipase; rs261332: p = 6.1*10−9), PLTP (phospholipid transfer protein, rs4810479: p = 1.7*10−8) and FBLN5 (fibulin-5; rs2246416: p = 6.2*10−6). The associations of SGCD and Fibulin-5 with HDL particle size could not be replicated in PROCARDIS (n = 3,078) and/or the Women's Genome Health Study (n = 23,170).
We show that refined HDL phenotyping by NMR spectroscopy can detect known genes of HDL metabolism better than analyses on HDL-C.
Elevated blood pressure is a common, heritable cause of cardiovascular disease worldwide. To date, identification of common genetic variants influencing blood pressure has proven challenging. We tested 2.5m genotyped and imputed SNPs for association with systolic and diastolic blood pressure in 34,433 subjects of European ancestry from the Global BPgen consortium and followed up findings with direct genotyping (N≤71,225 European ancestry, N=12,889 Indian Asian ancestry) and in silico comparison (CHARGE consortium, N=29,136). We identified association between systolic or diastolic blood pressure and common variants in 8 regions near the CYP17A1 (P=7×10−24), CYP1A2 (P=1×10−23), FGF5 (P=1×10−21), SH2B3 (P=3×10−18), MTHFR (P=2×10−13), c10orf107 (P=1×10−9), ZNF652 (P=5×10−9) and PLCD3 (P=1×10−8) genes. All variants associated with continuous blood pressure were associated with dichotomous hypertension. These associations between common variants and blood pressure and hypertension offer mechanistic insights into the regulation of blood pressure and may point to novel targets for interventions to prevent cardiovascular disease.
Background Vast sample sizes are often essential in the quest to disentangle the complex interplay of the genetic, lifestyle, environmental and social factors that determine the aetiology and progression of chronic diseases. The pooling of information between studies is therefore of central importance to contemporary bioscience. However, there are many technical, ethico-legal and scientific challenges to be overcome if an effective, valid, pooled analysis is to be achieved. Perhaps most critically, any data that are to be analysed in this way must be adequately ‘harmonized’. This implies that the collection and recording of information and data must be done in a manner that is sufficiently similar in the different studies to allow valid synthesis to take place.
Methods This conceptual article describes the origins, purpose and scientific foundations of the DataSHaPER (DataSchema and Harmonization Platform for Epidemiological Research; http://www.datashaper.org), which has been created by a multidisciplinary consortium of experts that was pulled together and coordinated by three international organizations: P3G (Public Population Project in Genomics), PHOEBE (Promoting Harmonization of Epidemiological Biobanks in Europe) and CPT (Canadian Partnership for Tomorrow Project).
Results The DataSHaPER provides a flexible, structured approach to the harmonization and pooling of information between studies. Its two primary components, the ‘DataSchema’ and ‘Harmonization Platforms’, together support the preparation of effective data-collection protocols and provide a central reference to facilitate harmonization. The DataSHaPER supports both ‘prospective’ and ‘retrospective’ harmonization.
Conclusion It is hoped that this article will encourage readers to investigate the project further: the more the research groups and studies are actively involved, the more effective the DataSHaPER programme will ultimately be.
Data synthesis; data quality; data pooling; harmonization; meta-analysis; DataSHaPER; prospective harmonization; retrospective harmonization
Background Contemporary bioscience sometimes demands vast sample sizes and there is often then no choice but to synthesize data across several studies and to undertake an appropriate pooled analysis. This same need is also faced in health-services and socio-economic research. When a pooled analysis is required, analytic efficiency and flexibility are often best served by combining the individual-level data from all sources and analysing them as a single large data set. But ethico-legal constraints, including the wording of consent forms and privacy legislation, often prohibit or discourage the sharing of individual-level data, particularly across national or other jurisdictional boundaries. This leads to a fundamental conflict in competing public goods: individual-level analysis is desirable from a scientific perspective, but is prevented by ethico-legal considerations that are entirely valid.
Methods Data aggregation through anonymous summary-statistics from harmonized individual-level databases (DataSHIELD), provides a simple approach to analysing pooled data that circumvents this conflict. This is achieved via parallelized analysis and modern distributed computing and, in one key setting, takes advantage of the properties of the updating algorithm for generalized linear models (GLMs).
Results The conceptual use of DataSHIELD is illustrated in two different settings.
Conclusions As the study of the aetiological architecture of chronic diseases advances to encompass more complex causal pathways—e.g. to include the joint effects of genes, lifestyle and environment—sample size requirements will increase further and the analysis of pooled individual-level data will become ever more important. An aim of this conceptual article is to encourage others to address the challenges and opportunities that DataSHIELD presents, and to explore potential extensions, for example to its use when different data sources hold different data on the same individuals.
Pooling; analysis; meta-analysis; individual-level; study-level; generalized linear model; GLM; ethico-legal; ELSI; identification; disclosure; distributed computing; bioinformatics; information technology; IT
Pulmonary function measures are heritable traits that predict morbidity and mortality and define chronic obstructive pulmonary disease (COPD). We tested genome-wide association with forced expiratory volume in 1 s (FEV1) and the ratio of FEV1 to forced vital capacity (FVC) in the SpiroMeta consortium (n = 20,288 individuals of European ancestry). We conducted a meta-analysis of top signals with data from direct genotyping (n ≤ 32,184 additional individuals) and in silico summary association data from the CHARGE Consortium (n = 21,209) and the Health 2000 survey (n ≤ 883). We confirmed the reported locus at 4q31 and identified associations with FEV1 or FEV1/FVC and common variants at five additional loci: 2q35 in TNS1 (P = 1.11 × 10−12), 4q24 in GSTCD (2.18 × 10−23), 5q33 in HTR4 (P = 4.29 × 10−9), 6p21 in AGER (P = 3.07 × 10−15) and 15q23 in THSD4 (P = 7.24 × 10−15). mRNA analyses showed expression of TNS1, GSTCD, AGER, HTR4 and THSD4 in human lung tissue. These associations offer mechanistic insight into pulmonary function regulation and indicate potential targets for interventions to alleviate respiratory disease.
Genome-wide association studies (GWAS) have led to a rapid increase in available data on common genetic variants and phenotypes and numerous discoveries of new loci associated with susceptibility to common complex diseases. Integrating the evidence from GWAS and candidate gene studies depends on concerted efforts in data production, online publication, database development, and continuously updated data synthesis. Here the authors summarize current experience and challenges on these fronts, which were discussed at a 2008 multidisciplinary workshop sponsored by the Human Genome Epidemiology Network. Comprehensive field synopses that integrate many reported gene-disease associations have been systematically developed for several fields, including Alzheimer's disease, schizophrenia, bladder cancer, coronary heart disease, preterm birth, and DNA repair genes in various cancers. The authors summarize insights from these field synopses and discuss remaining unresolved issues—especially in the light of evidence from GWAS, for which they summarize empirical P-value and effect-size data on 223 discovered associations for binary outcomes (142 with P < 10−7). They also present a vision of collaboration that builds reliable cumulative evidence for genetic associations with common complex diseases and a transparent, distributed, authoritative knowledge base on genetic variation and human health. As a next step in the evolution of Human Genome Epidemiology reviews, the authors invite investigators to submit field synopses for possible publication in the American Journal of Epidemiology.
association; database; encyclopedias; epidemiologic methods; genome, human; genome-wide association study; genomics; meta-analysis
Background Despite earlier doubts, a string of recent successes indicates that if sample sizes are large enough, it is possible—both in theory and in practice—to identify and replicate genetic associations with common complex diseases. But human genome epidemiology is expensive and, from a strategic perspective, it is still unclear what ‘large enough’ really means. This question has critical implications for governments, funding agencies, bioscientists and the tax-paying public. Difficult strategic decisions with imposing price tags and important opportunity costs must be taken.
Methods Conventional power calculations for case–control studies disregard many basic elements of analytic complexity—e.g. errors in clinical assessment, and the impact of unmeasured aetiological determinants—and can seriously underestimate true sample size requirements. This article describes, and applies, a rigorous simulation-based approach to power calculation that deals more comprehensively with analytic complexity and has been implemented on the web as ESPRESSO: (www.p3gobservatory.org/powercalculator.htm).
Results Using this approach, the article explores the realistic power profile of stand-alone and nested case–control studies in a variety of settings and provides a robust quantitative foundation for determining the required sample size both of individual biobanks and of large disease-based consortia. Despite universal acknowledgment of the importance of large sample sizes, our results suggest that contemporary initiatives are still, at best, at the lower end of the range of desirable sample size. Insufficient power remains particularly problematic for studies exploring gene–gene or gene–environment interactions.
Discussion Sample size calculation must be both accurate and realistic, and we must continue to strengthen national and international cooperation in the design, conduct, harmonization and integration of studies in human genome epidemiology.
Human genome epidemiology; biobank; sample size; statistical power; simulation studies; measurement error; reliability; aetiological heterogeneity
Nuala Sheehan and colleagues describe how Mendelian randomization provides an alternative way of dealing with the problems of observational studies, especially confounding.
The study of change in intermediate phenotypes over time is important in genetics. In this paper we explore a new approach to phenotype definition in the genetic analysis of longitudinal phenotypes. We utilized data from the longitudinal Framingham Heart Study Family Cohort to investigate the familial aggregation and evidence for linkage to change in systolic blood pressure (SBP) over time. We used Gibbs sampling to derive sigma-squared-A-random-effects (SSARs) for the longitudinal phenotype, and then used these as a new phenotype in subsequent genome-wide linkage analyses.
Additive genetic effects (σ2A.time) were estimated to account for ~9.2% of the variance in the rate of change of SBP with age, while additive genetic effects (σ2A) were estimated to account for ~43.9% of the variance in SBP at the mean age. The linkage results suggested that one or more major loci regulating change in SBP over time may localize to chromosomes 2, 3, 4, 6, 10, 11, 17, and 19. The results also suggested that one or more major loci regulating level of SBP may localize to chromosomes 3, 8, and 14.
Our results support a genetic component to both SBP and change in SBP with age, and are consistent with a complex, multifactorial susceptibility to the development of hypertension. The use of SSARs derived from quantitative traits as input to a conventional linkage analysis appears to be valuable in the linkage analysis of genetically complex traits. We have now demonstrated in this paper the use of SSARs in the context of longitudinal family data.