Background: Research in modern biomedicine and social science requires sample sizes so large that they can often only be achieved through a pooled co-analysis of data from several studies. But the pooling of information from individuals in a central database that may be queried by researchers raises important ethico-legal questions and can be controversial. In the UK this has been highlighted by recent debate and controversy relating to the UK’s proposed ‘care.data’ initiative, and these issues reflect important societal and professional concerns about privacy, confidentiality and intellectual property. DataSHIELD provides a novel technological solution that can circumvent some of the most basic challenges in facilitating the access of researchers and other healthcare professionals to individual-level data.
Methods: Commands are sent from a central analysis computer (AC) to several data computers (DCs) storing the data to be co-analysed. The data sets are analysed simultaneously but in parallel. The separate parallelized analyses are linked by non-disclosive summary statistics and commands transmitted back and forth between the DCs and the AC. This paper describes the technical implementation of DataSHIELD using a modified R statistical environment linked to an Opal database deployed behind the computer firewall of each DC. Analysis is controlled through a standard R environment at the AC.
Results: Based on this Opal/R implementation, DataSHIELD is currently used by the Healthy Obese Project and the Environmental Core Project (BioSHaRE-EU) for the federated analysis of 10 data sets across eight European countries, and this illustrates the opportunities and challenges presented by the DataSHIELD approach.
Conclusions: DataSHIELD facilitates important research in settings where: (i) a co-analysis of individual-level data from several studies is scientifically necessary but governance restrictions prohibit the release or sharing of some of the required data, and/or render data access unacceptably slow; (ii) a research group (e.g. in a developing nation) is particularly vulnerable to loss of intellectual property—the researchers want to fully share the information held in their data with national and international collaborators, but do not wish to hand over the physical data themselves; and (iii) a data set is to be included in an individual-level co-analysis but the physical size of the data precludes direct transfer to a new site for analysis.
DataSHIELD; pooled analysis; ELSI; privacy; confidentiality; disclosure; distributed computing; intellectual property; bioinformatics
Not all obese subjects have an adverse metabolic profile predisposing them to developing type 2 diabetes or cardiovascular disease. The BioSHaRE-EU Healthy Obese Project aims to gain insights into the consequences of (healthy) obesity using data on risk factors and phenotypes across several large-scale cohort studies. Aim of this study was to describe the prevalence of obesity, metabolic syndrome (MetS) and metabolically healthy obesity (MHO) in ten participating studies.
Ten different cohorts in seven countries were combined, using data transformed into a harmonized format. All participants were of European origin, with age 18–80 years. They had participated in a clinical examination for anthropometric and blood pressure measurements. Blood samples had been drawn for analysis of lipids and glucose. Presence of MetS was assessed in those with obesity (BMI ≥ 30 kg/m2) based on the 2001 NCEP ATP III criteria, as well as an adapted set of less strict criteria. MHO was defined as obesity, having none of the MetS components, and no previous diagnosis of cardiovascular disease.
Data for 163,517 individuals were available; 17% were obese (11,465 men and 16,612 women). The prevalence of obesity varied from 11.6% in the Italian CHRIS cohort to 26.3% in the German KORA cohort. The age-standardized percentage of obese subjects with MetS ranged in women from 24% in CHRIS to 65% in the Finnish Health2000 cohort, and in men from 43% in CHRIS to 78% in the Finnish DILGOM cohort, with elevated blood pressure the most frequently occurring factor contributing to the prevalence of the metabolic syndrome. The age-standardized prevalence of MHO varied in women from 7% in Health2000 to 28% in NCDS, and in men from 2% in DILGOM to 19% in CHRIS. MHO was more prevalent in women than in men, and decreased with age in both sexes.
Through a rigorous harmonization process, the BioSHaRE-EU consortium was able to compare key characteristics defining the metabolically healthy obese phenotype across ten cohort studies. There is considerable variability in the prevalence of healthy obesity across the different European populations studied, even when unified criteria were used to classify this phenotype.
Harmonization; Obesity; Metabolic syndrome; Cardiovascular disease; Metabolically healthy
Linkage of data collected by large Canadian cohort studies with provincially managed administrative health databases can offer very interesting avenues for multidisciplinary and cost-effective health research in Canada. Successfully co-analyzing cohort data and administrative health data (AHD) can lead to research results capable of improving the health and well-being of Canadians and enhancing the delivery of health care services. However, such an endeavour will require strong coordination and long-term commitment between all stakeholders involved. The challenges and opportunities of a pan-Canadian cohort-to-AHD data linkage program have been considered by cohort study investigators and data custodians from each Canadian province. Stakeholders acknowledge the important public health benefits of establishing such a program and have established an action plan to move forward.
PMID: 23823892 CAMSID: cams3806
Medical record linkage; data collection; public health
Individual-level data pooling of large population-based studies across research centres in international research projects faces many hurdles. The BioSHaRE (Biobank Standardisation and Harmonisation for Research Excellence in the European Union) project aims to address these issues by building a collaborative group of investigators and developing tools for data harmonization, database integration and federated data analyses.
Eight population-based studies in six European countries were recruited to participate in the BioSHaRE project. Through workshops, teleconferences and electronic communications, participating investigators identified a set of 96 variables targeted for harmonization to answer research questions of interest. Using each study’s questionnaires, standard operating procedures, and data dictionaries, harmonization potential was assessed. Whenever harmonization was deemed possible, processing algorithms were developed and implemented in an open-source software infrastructure to transform study-specific data into the target (i.e. harmonized) format. Harmonized datasets located on server in each research centres across Europe were interconnected through a federated database system to perform statistical analysis.
Retrospective harmonization led to the generation of common format variables for 73% of matches considered (96 targeted variables across 8 studies). Authenticated investigators can now perform complex statistical analyses of harmonized datasets stored on distributed servers without actually sharing individual-level data using the DataSHIELD method.
New Internet-based networking technologies and database management systems are providing the means to support collaborative, multi-center research in an efficient and secure manner. The results from this pilot project show that, given a strong collaborative relationship between participating studies, it is possible to seamlessly co-analyse internationally harmonized research databases while allowing each study to retain full control over individual-level data. We encourage additional collaborative research networks in epidemiology, public health, and the social sciences to make use of the open source tools presented herein.
Biobanks can have a pivotal role in elucidating disease etiology, translation, and
advancing public health. However, meeting these challenges hinges on a critical shift in
the way science is conducted and requires biobank harmonization. There is growing
recognition that a common strategy is imperative to develop biobanking globally and
effectively. To help guide this strategy, we articulate key principles, goals, and
priorities underpinning a roadmap for global biobanking to accelerate health science,
patient care, and public health. The need to manage and share very large amounts of data
has driven innovations on many fronts. Although technological solutions are allowing
biobanks to reach new levels of integration, increasingly powerful data-collection tools,
analytical techniques, and the results they generate raise new ethical and legal issues
and challenges, necessitating a reconsideration of previous policies, practices, and
ethical norms. These manifold advances and the investments that support them are also
fueling opportunities for biobanks to ultimately become integral parts of health-care
systems in many countries. International harmonization to increase interoperability and
sustainability are two strategic priorities for biobanking. Tackling these issues requires
an environment favorably inclined toward scientific funding and equipped to address
socio-ethical challenges. Cooperation and collaboration must extend beyond systems to
enable the exchange of data and samples to strategic alliances between many organizations,
including governmental bodies, funding agencies, public and private science enterprises,
and other stakeholders, including patients. A common vision is required and we articulate
the essential basis of such a vision herein.
Background Proper understanding of the roles of, and interactions between genetic, lifestyle, environmental and psycho-social factors in determining the risk of development and/or progression of chronic diseases requires access to very large high-quality databases. Because of the financial, technical and time burdens related to developing and maintaining very large studies, the scientific community is increasingly synthesizing data from multiple studies to construct large databases. However, the data items collected by individual studies must be inferentially equivalent to be meaningfully synthesized. The DataSchema and Harmonization Platform for Epidemiological Research (DataSHaPER; http://www.datashaper.org) was developed to enable the rigorous assessment of the inferential equivalence, i.e. the potential for harmonization, of selected information from individual studies.
Methods This article examines the value of using the DataSHaPER for retrospective harmonization of established studies. Using the DataSHaPER approach, the potential to generate 148 harmonized variables from the questionnaires and physical measures collected in 53 large population-based studies (6.9 million participants) was assessed. Variable and study characteristics that might influence the potential for data synthesis were also explored.
Results Out of all assessment items evaluated (148 variables for each of the 53 studies), 38% could be harmonized. Certain characteristics of variables (i.e. relative importance, individual targeted, reference period) and of studies (i.e. observational units, data collection start date and mode of questionnaire administration) were associated with the potential for harmonization. For example, for variables deemed to be essential, 62% of assessment items paired could be harmonized.
Conclusion The current article shows that the DataSHaPER provides an effective and flexible approach for the retrospective harmonization of information across studies. To implement data synthesis, some additional scientific, ethico-legal and technical considerations must be addressed. The success of the DataSHaPER as a harmonization approach will depend on its continuing development and on the rigour and extent of its use. The DataSHaPER has the potential to take us closer to a truly collaborative epidemiology and offers the promise of enhanced research potential generated through synthesized databases.
Data synthesis; data quality; data pooling; harmonization; meta-analysis; DataSHaPER; retrospective harmonization
The rapid and continuing progress in gene discovery for complex diseases is fuelling interest in the potential application of genetic risk models for clinical and public health practice.The number of studies assessing the predictive ability is steadily increasing, but they vary widely in completeness of reporting and apparent quality.Transparent reporting of the strengths and weaknesses of these studies is important to facilitate the accumulation of evidence on genetic risk prediction.A multidisciplinary workshop sponsored by the Human Genome Epidemiology Network developed a checklist of 25 items recommended for strengthening the reporting of Genetic RIsk Prediction Studies (GRIPS), building on the principles established by prior reporting guidelines.These recommendations aim to enhance the transparency, quality and completeness of study reporting, and thereby to improve the synthesis and application of information from multiple studies that might differ in design, conduct or analysis.
The rapid and continuing progress in gene discovery for complex diseases is fueling interest in the potential application of genetic risk models for clinical and public health practice. The number of studies assessing the predictive ability is steadily increasing, but they vary widely in completeness of reporting and apparent quality. Transparent reporting of the strengths and weaknesses of these studies is important to facilitate the accumulation of evidence on genetic risk prediction. A multidisciplinary workshop sponsored by the Human Genome Epidemiology Network developed a checklist of 25 items recommended for strengthening the reporting of Genetic RIsk Prediction Studies (GRIPS), building on the principles established by previous reporting guidelines. These recommendations aim to enhance the transparency, quality and completeness of study reporting, and thereby to improve the synthesis and application of information from multiple studies that might differ in design, conduct or analysis.
The rapid and continuing progress in gene discovery for complex diseases is fuelling interest in the potential application of genetic risk models for clinical and public health practice. The number of studies assessing the predictive ability is steadily increasing, but they vary widely in completeness of reporting and apparent quality. Transparent reporting of the strengths and weaknesses of these studies is important to facilitate the accumulation of evidence on genetic risk prediction. A multidisciplinary workshop sponsored by the Human Genome Epidemiology Network developed a checklist of 25 items recommended for strengthening the reporting of Genetic RIsk Prediction Studies (GRIPS), building on the principles established by prior reporting guidelines. These recommendations aim to enhance the transparency, quality and completeness of study reporting, and thereby to improve the synthesis and application of information from multiple studies that might differ in design, conduct or analysis.
Genetic; Risk prediction; Methodology; Guidelines; Reporting
Background Vast sample sizes are often essential in the quest to disentangle the complex interplay of the genetic, lifestyle, environmental and social factors that determine the aetiology and progression of chronic diseases. The pooling of information between studies is therefore of central importance to contemporary bioscience. However, there are many technical, ethico-legal and scientific challenges to be overcome if an effective, valid, pooled analysis is to be achieved. Perhaps most critically, any data that are to be analysed in this way must be adequately ‘harmonized’. This implies that the collection and recording of information and data must be done in a manner that is sufficiently similar in the different studies to allow valid synthesis to take place.
Methods This conceptual article describes the origins, purpose and scientific foundations of the DataSHaPER (DataSchema and Harmonization Platform for Epidemiological Research; http://www.datashaper.org), which has been created by a multidisciplinary consortium of experts that was pulled together and coordinated by three international organizations: P3G (Public Population Project in Genomics), PHOEBE (Promoting Harmonization of Epidemiological Biobanks in Europe) and CPT (Canadian Partnership for Tomorrow Project).
Results The DataSHaPER provides a flexible, structured approach to the harmonization and pooling of information between studies. Its two primary components, the ‘DataSchema’ and ‘Harmonization Platforms’, together support the preparation of effective data-collection protocols and provide a central reference to facilitate harmonization. The DataSHaPER supports both ‘prospective’ and ‘retrospective’ harmonization.
Conclusion It is hoped that this article will encourage readers to investigate the project further: the more the research groups and studies are actively involved, the more effective the DataSHaPER programme will ultimately be.
Data synthesis; data quality; data pooling; harmonization; meta-analysis; DataSHaPER; prospective harmonization; retrospective harmonization
Background Contemporary bioscience sometimes demands vast sample sizes and there is often then no choice but to synthesize data across several studies and to undertake an appropriate pooled analysis. This same need is also faced in health-services and socio-economic research. When a pooled analysis is required, analytic efficiency and flexibility are often best served by combining the individual-level data from all sources and analysing them as a single large data set. But ethico-legal constraints, including the wording of consent forms and privacy legislation, often prohibit or discourage the sharing of individual-level data, particularly across national or other jurisdictional boundaries. This leads to a fundamental conflict in competing public goods: individual-level analysis is desirable from a scientific perspective, but is prevented by ethico-legal considerations that are entirely valid.
Methods Data aggregation through anonymous summary-statistics from harmonized individual-level databases (DataSHIELD), provides a simple approach to analysing pooled data that circumvents this conflict. This is achieved via parallelized analysis and modern distributed computing and, in one key setting, takes advantage of the properties of the updating algorithm for generalized linear models (GLMs).
Results The conceptual use of DataSHIELD is illustrated in two different settings.
Conclusions As the study of the aetiological architecture of chronic diseases advances to encompass more complex causal pathways—e.g. to include the joint effects of genes, lifestyle and environment—sample size requirements will increase further and the analysis of pooled individual-level data will become ever more important. An aim of this conceptual article is to encourage others to address the challenges and opportunities that DataSHIELD presents, and to explore potential extensions, for example to its use when different data sources hold different data on the same individuals.
Pooling; analysis; meta-analysis; individual-level; study-level; generalized linear model; GLM; ethico-legal; ELSI; identification; disclosure; distributed computing; bioinformatics; information technology; IT
Making sense of rapidly evolving evidence on genetic associations is crucial to making genuine advances in human genomics and the eventual integration of this information in the practice of medicine and public health. Assessment of the strengths and weaknesses of this evidence, and hence the ability to synthesize it, has been limited by inadequate reporting of results. The STrengthening the REporting of Genetic Association studies (STREGA) initiative builds on the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) Statement and provides additions to 12 of the 22 items on the STROBE checklist. The additions concern population stratification, genotyping errors, modeling haplotype variation, Hardy–Weinberg equilibrium, replication, selection of participants, rationale for choice of genes and variants, treatment effects in studying quantitative traits, statistical methods, relatedness, reporting of descriptive and outcome data, and the volume of data issues that are important to consider in genetic association studies. The STREGA recommendations do not prescribe or dictate how a genetic association study should be designed but seek to enhance the transparency of its reporting, regardless of choices made during design, conduct, or analysis.
Gene–disease associations; Genetics; Gene–environment interaction; Systematic review; Meta analysis; Reporting recommendations; Epidemiology; Genome-wide association
Genome-wide association studies (GWAS) have led to a rapid increase in available data on common genetic variants and phenotypes and numerous discoveries of new loci associated with susceptibility to common complex diseases. Integrating the evidence from GWAS and candidate gene studies depends on concerted efforts in data production, online publication, database development, and continuously updated data synthesis. Here the authors summarize current experience and challenges on these fronts, which were discussed at a 2008 multidisciplinary workshop sponsored by the Human Genome Epidemiology Network. Comprehensive field synopses that integrate many reported gene-disease associations have been systematically developed for several fields, including Alzheimer's disease, schizophrenia, bladder cancer, coronary heart disease, preterm birth, and DNA repair genes in various cancers. The authors summarize insights from these field synopses and discuss remaining unresolved issues—especially in the light of evidence from GWAS, for which they summarize empirical P-value and effect-size data on 223 discovered associations for binary outcomes (142 with P < 10−7). They also present a vision of collaboration that builds reliable cumulative evidence for genetic associations with common complex diseases and a transparent, distributed, authoritative knowledge base on genetic variation and human health. As a next step in the evolution of Human Genome Epidemiology reviews, the authors invite investigators to submit field synopses for possible publication in the American Journal of Epidemiology.
association; database; encyclopedias; epidemiologic methods; genome, human; genome-wide association study; genomics; meta-analysis
Persons exposed to residential traffic have increased rates of respiratory morbidity and mortality. As poverty is an important determinant of ill health, some have argued that these associations may relate to the lower socioeconomic status of those living along major roads.
The objective was to evaluate the association between traffic intensity at home and hospital admissions for respiratory disease among Montreal residents of 60 years and older.
Case hospitalisations were those with respiratory diagnoses and control hospitalisations were those where the primary discharge diagnosis was non‐respiratory. Morning peak traffic estimates from the EMME/2 Montreal traffic model (MOTREM98) were used as an indicator of exposure to road traffic outside the homes of those hospitalised. The crude association between traffic intensity and hospitalisation for respiratory disease was adjusted by an area based estimate of the appraised value of patients' residences, expressed as a dollar average over a small segment of road (lodging value). This indicator of socioeconomic status, as calculated from the Montreal property assessment database, is available at a finer geographical scale than the neighbourhood socioeconomic indicators accessible from the Canadian census.
Increased odds of being hospitalised for a respiratory compared with a control diagnosis were associated with higher levels of estimated road traffic nearby patients' homes, even after adjustment for lodging value (crude OR 1.35, CI95% 1.22 to 1.49; adjusted OR 1.18, CI95% 1.06 to 1.31 for >3160 vehicles passing during the three hour morning traffic peak compared with secondary roads off network).
The results suggest that road traffic intensity itself, may affect the respiratory health of elderly residents of a large Canadian city, an association that is not solely a reflection of socioeconomic status.
traffic intensity; hospital admissions; respiratory diagnoses; housing; socioeconomic status
Julian Little and colleagues present the STREGA recommendations, which are aimed at improving the reporting of genetic association studies.
gene-disease associations; genetics; gene-environment interaction; systematic review; meta analysis; reporting recommendations; epidemiology; genome-wide association
Background Despite earlier doubts, a string of recent successes indicates that if sample sizes are large enough, it is possible—both in theory and in practice—to identify and replicate genetic associations with common complex diseases. But human genome epidemiology is expensive and, from a strategic perspective, it is still unclear what ‘large enough’ really means. This question has critical implications for governments, funding agencies, bioscientists and the tax-paying public. Difficult strategic decisions with imposing price tags and important opportunity costs must be taken.
Methods Conventional power calculations for case–control studies disregard many basic elements of analytic complexity—e.g. errors in clinical assessment, and the impact of unmeasured aetiological determinants—and can seriously underestimate true sample size requirements. This article describes, and applies, a rigorous simulation-based approach to power calculation that deals more comprehensively with analytic complexity and has been implemented on the web as ESPRESSO: (www.p3gobservatory.org/powercalculator.htm).
Results Using this approach, the article explores the realistic power profile of stand-alone and nested case–control studies in a variety of settings and provides a robust quantitative foundation for determining the required sample size both of individual biobanks and of large disease-based consortia. Despite universal acknowledgment of the importance of large sample sizes, our results suggest that contemporary initiatives are still, at best, at the lower end of the range of desirable sample size. Insufficient power remains particularly problematic for studies exploring gene–gene or gene–environment interactions.
Discussion Sample size calculation must be both accurate and realistic, and we must continue to strengthen national and international cooperation in the design, conduct, harmonization and integration of studies in human genome epidemiology.
Human genome epidemiology; biobank; sample size; statistical power; simulation studies; measurement error; reliability; aetiological heterogeneity
Making sense of rapidly evolving evidence on genetic associations is crucial to making genuine advances in human genomics and the eventual integration of this information in the practice of medicine and public health. Assessment of the strengths and weaknesses of this evidence, and hence the ability to synthesize it, has been limited by inadequate reporting of results. The STrengthening the REporting of Genetic Association studies (STREGA) initiative builds on the STrengthening the Reporting of OBservational Studies in Epidemiology (STROBE) Statement and provides additions to 12 of the 22 items on the STROBE checklist. The additions concern population stratification, genotyping errors, modelling haplotype variation, Hardy–Weinberg equilibrium, replication, selection of participants, rationale for choice of genes and variants, treatment effects in studying quantitative traits, statistical methods, relatedness, reporting of descriptive and outcome data and the volume of data issues that are important to consider in genetic association studies. The STREGA recommendations do not prescribe or dictate how a genetic association study should be designed, but seek to enhance the transparency of its reporting, regardless of choices made during design, conduct or analysis.
Epidemiology; gene-disease associations; gene-environment interaction; genetics; genome-wide association; meta-analysis; reporting recommendations; systematic review