|Home | About | Journals | Submit | Contact Us | Français|
Background Vast sample sizes are often essential in the quest to disentangle the complex interplay of the genetic, lifestyle, environmental and social factors that determine the aetiology and progression of chronic diseases. The pooling of information between studies is therefore of central importance to contemporary bioscience. However, there are many technical, ethico-legal and scientific challenges to be overcome if an effective, valid, pooled analysis is to be achieved. Perhaps most critically, any data that are to be analysed in this way must be adequately ‘harmonized’. This implies that the collection and recording of information and data must be done in a manner that is sufficiently similar in the different studies to allow valid synthesis to take place.
Methods This conceptual article describes the origins, purpose and scientific foundations of the DataSHaPER (DataSchema and Harmonization Platform for Epidemiological Research; http://www.datashaper.org), which has been created by a multidisciplinary consortium of experts that was pulled together and coordinated by three international organizations: P3G (Public Population Project in Genomics), PHOEBE (Promoting Harmonization of Epidemiological Biobanks in Europe) and CPT (Canadian Partnership for Tomorrow Project).
Results The DataSHaPER provides a flexible, structured approach to the harmonization and pooling of information between studies. Its two primary components, the ‘DataSchema’ and ‘Harmonization Platforms’, together support the preparation of effective data-collection protocols and provide a central reference to facilitate harmonization. The DataSHaPER supports both ‘prospective’ and ‘retrospective’ harmonization.
Conclusion It is hoped that this article will encourage readers to investigate the project further: the more the research groups and studies are actively involved, the more effective the DataSHaPER programme will ultimately be.
Scientific developments in the wake of the Human Genome1–3 and HapMap4,5 projects are helping to shape the future of public health and clinical medicine.6–8 However, while dramatic progress has been made in detecting genetic associations with complex diseases,9,10 the role of genetic determinants is only a part of a much larger picture. The role of lifestyle, environmental and social factors in modulating the risk and/or progression of chronic diseases has been recognized and explored for many years.11,12 This is entirely logical even from the perspective of functional genomics: the concept of ‘fitness’ that is central to natural selection and human evolution has, as its fundamental basis, the interaction between prevailing environment and the genome.13 This implies that causal pathways leading to disease should be ‘expected’ to involve gene–environment interactions.14,15
It is therefore clear that bioscience needs access to studies that incorporate social, environmental and lifestyle factors as well as genetic determinants.7,15 Provided that the quality of the information that such studies generate is adequate and that the statistical power of key analyses can be rendered sufficient,15–18 it will then be possible to successfully pursue a comprehensive investigation of the direct and interactive effects of a broader range of relevant classes of aetiological determinants. However, in the real world, the attainment of adequate statistical power presents a serious challenge. For example, when appropriate account is taken of assessment errors in both determinants and outcomes, sample-size estimates for analyses involving gene–environment interactions comparable in magnitude with the direct genetic effects that have so far been replicated,9,10,16 typically indicate a requirement for ‘tens of thousands of cases’.16,17,19 This means that even the largest16,20–22 and best measured18,23 of contemporary studies will only be able to generate enough cases—or subjects—for the commonest of complex diseases.16,22 This in turn implies that the analysis of synthesized data across several studies is set to become increasingly important.15,24 Such harmonization may be used to support targeted scientific projects,25–27 and to facilitate synthesis of information among studies28–34 or data portals.35–39
Fortunately, extensive experience already exists in the synthesis of epidemiological studies.33,40–42 For example, data synthesis was pivotal to the success of the EPIC study (the European Prospective Investigation into Cancer and Nutrition) which starting in the 1990s, recruited more than 500000 participants via (initially) 22 centres across nine European countries.28,43 EPIC’s focus on nutrition placed heavy demands on sample size, and effective data synthesis across all centres was therefore critical to many of its principal analyses. Although EPIC was designed prospectively as a coordinated consortium of studies, centre-specific questionnaires were used.28,44 In such a setting, the data synthesis was constrained by the quality18 of the underlying data and by their compatibility.45 One of the important achievements of the EPIC project was the development of methods and tools (e.g. EPIC SOFT43) to enable calibration and pooling of data that had been collected under different protocols in different centres, so that data synthesis was rendered valid.
However, in common with other major epidemiological consortia—e.g. GenomEUtwin project30 and EURALIM project41—EPIC demonstrated that information synthesis is far from easy. It demands time, resources and rigour.40,43,44,46 Furthermore, as scientific ambitions and capacities have extended, the sample-size challenge continues to grow,9,15–18 and the requirement for effective data synthesis has now become a regular necessity.15,24 Moreover, as different sets of outcome and exposure variables are required for different analyses—and no single study can afford to capture ‘all’ desired measures—individual studies are necessarily being pooled with different combinations of other studies—as demonstrated, e.g. by the number of different consortia involving studies such as Avon Longitudinal Study of Parents and Children (ALSPAC), EPIC-Norfolk and the 1958 Birth Cohort.25,26,47,48 This implies that it would be beneficial to supplement consortium-specific approaches to harmonization, calibration and synthesis29,30,34,41,43 with more generic methods.49–52
However, the scientific utility of data synthesis is always constrained by the quantity and quality of the underlying data,18,53 and by their compatibility between studies.45,54 The latter implies that the collection and recording of information and data must be carried out in a manner that is sufficiently similar in the different studies to allow valid synthesis to take place. When this is so, ‘harmonization’ may be said to exist.53 The fundamental challenge might therefore be viewed as being to increase sample size by synthesizing over an adequate number of studies, but to restrict that synthesis to those studies that are satisfactorily harmonized for the specific outcomes, genetic, environmental and lifestyle factors targeted.42,54 Two complementary approaches may be adopted to support effective data synthesis. The first one principally targets ‘what’ is to be synthesized, whereas the other one focuses on ‘how’ to collect the required information. Thus: (i) core sets of information may be identified to serve as the foundation for a flexible approach to harmonization51,52,55–57; or (ii) standard collection devices (questionnaires and standard operating procedures) may be suggested as a required basis for collection of information.49,58–61
It is with all of these considerations in mind that the DataSHaPER project (DataSchema and Harmonization Platform for Epidemiological Research) has been launched. The DataSHaPER (http://www.datashaper.org) offers free access to questionnaires and core sets of variables that can be used to support the development of data-collection tools for emerging studies or to serve as a central reference for harmonization between pre-existing studies. The DataSHaPER is an international project that is being developed under the joint umbrellas of P3G (the Public Population Project in Genomics50,62), PHOEBE (Promoting Harmonization of Epidemiological Biobanks in Europe63) and CPT (Canadian Partnership for Tomorrow64), in collaboration with more than 50 major studies from around the world. This conceptual article describes the motivation, aims and scientific foundation of the DataSHaPER project.
Standardization is a sine qua non of information pooling. However, scientific, technological, ethical, cultural and other constraints make it difficult to impose identical infrastructures and uniform procedures across studies. Furthermore, it is important to recognize that it is not always necessary to use precisely the same methods and tools for data collection in order to achieve valid data integration across studies. Rather, what ‘is’ crucial is that the information conveyed by each data set is ‘inferentially equivalent’. If the ‘quality’ of the data to be integrated is also adequate, inferential equivalence greatly increases the potential for collaboration between studies and, therefore, the scientific opportunities. The definition of equivalence will vary with the scientific context and must take into account both the primary information collected (e.g. serum cholesterol level) and the qualifying factors that can influence the interpretation of that information (e.g. whether the participant had been fasting prior to sample collection). In some situations, even a small change in the way information is collected can substantially modify scientific compatibility, whereas in others, considerable flexibility can be allowed. Formally, a valid balance must be struck between the use of precisely uniform specifications that render pooling straightforward (e.g. identical questions asked under identical conditions), and the acceptance of greater flexibility and diversity that may be appropriate and more realistic in a collaborative context (e.g. similar questions, but asked by an interviewer in one study and completed by the participant in another).
In an ideal world, information would be ‘prospectively harmonized’: emerging studies would make use, where possible, of harmonized questionnaires and standard operating procedures.53,65 This enhances the potential for future pooling but entails significant challenges—ahead of time—in developing and agreeing to common assessment protocols. However, at the same time, it is important to increase the utility of existing studies by ‘retrospectively harmonizing’ data that have already been collected, to optimize the subset of information that may legitimately be pooled.30,53,65 Here, the quantity and quality of information that can be pooled is limited by the heterogeneity intrinsic to the pre-existing differences in study design and conduct.
The DataSHaPER is both a scientific approach and a practical tool. Originally, the plan had been to develop a standardized questionnaire (or set of questionnaires) with the primary aim of facilitating prospective harmonization of future biobanks. But after some months of work, it became clear that complete standardization was too restrictive and would be of limited applicability to retrospective harmonization. This resulted in a fundamental change of approach that led to the piloting of the concept that is now known as the DataSHaPER. In order to understand the DataSHaPER, an important distinction must be drawn between core ‘variables’—the primary units of interest in a statistical analysis—and the specific ‘assessment items’ that are collected by individual studies (e.g. questions in questionnaires). It is a pre-defined set of ‘variables’ that serves as the reference for harmonization between studies. This approach provides an appropriate level of flexibility, because a given variable may potentially be constructed using different assessment items in different studies. It is important to note that this does not imply a reduction in scientific rigour; the specific information collected by a given study can only be viewed as harmonized to a particular DataSHaPER variable if the assessment items in that study can be used to generate a ‘valid’ equivalent to the required variable. This entails a formal scientific evaluation and validation process.
Access to the DataSHaPER application and content is open and free (http://www.datashaper.org). The public website presents the published DataSchemas and offers links to their ontology files. However, to access the DataSchemas under development as well as the results generated by the pre-existing Harmonization Platforms, users need to respond to specific criteria, be authenticated by the DataSHaPER Team and use a username and password.
A DataSchema is a hierarchical structure composed of variables nested within domains, themes and modules (Figure 1). Each DataSchema on the Platform is made up of variables that may be derived from: interview administration; health and risk-factor questionnaires; physical and cognitive measures; medical files; sample collection, handling, processing and banking; biochemical measures, registries (e.g. databases containing deaths, hospitalization episodes and environmental variables) and others. Variables may be of primary scientific interest in their own right or qualifying factors that contribute to the interpretation of other information of primary interest. A variable may be complete in itself [e.g. current smoker (yes/no) or measured weight] or it may derive from one or several others (e.g. body mass index).
The DataSchema platform on the DataSHaPER website contains a comprehensive description of available schemas. Each may include: a list of variables with their definitions and formats; links to relevant ontologies; and access to reference questions/questionnaires, indexes and operating procedures that have been selected58,68,69 or developed.70 Where possible, variables have been defined such that they can reliably be constructed from standard questionnaires and classifications (e.g. The International Physical Activity Questionnaire for physical activity58). Although a DataSchema aims primarily to provide a template for prospective and retrospective harmonization, it also provides a guide to help emerging projects select suitable assessment items and sample collection tools, even when data pooling is not planned.
The ‘Generic DataSchema’ is the first schema to have been developed under the DataSHaPER project. It is aimed at supporting the construction of general-purpose baseline questionnaires for use in large cohorts enrolling middle-aged participants. Its construction was a collaborative effort involving investigators from more than 25 international cohorts in 14 countries. Its structure and contents were determined at a series of international consensus workshops held over 2 years (2006–08) with iterative rounds of comments and feedback between meetings. The contents were chosen so as to provide a core data set with broad international applicability. Ethnic and cultural specificity was therefore minimized and the schema was chosen so as to be simple enough to encourage widespread use, yet comprehensive enough to support meaningful research. Detailed selection criteria for individual variables are listed in Box 1.
A fundamental aim was to restrict the Generic DataSchema to a limited number of variables identified as key by consensus.
The Generic DataSchema contains 3 modules, 13 themes, 45 domains and more than 180 variables. As an illustrative example of its content (Figure 1), the theme ‘sociodemographic characteristics’ contains the domain ‘household status’ (defined as a social unit comprised of one or more individuals living together in the same dwelling, all of whom need not be related) which in turn includes three variables: (i) ‘marital status (currently married; yes/no)’; (ii) ‘living with a partner in a common household (yes/no)’; and (iii) ‘number of people who live with the participant in the same household (number)’.
Early versions of the Generic DataSchema were used by several large population-based studies to help create their data-collection tools. These included the LifeLines71 (The Netherlands) and LifeGene72 (Sweden) Projects as well as the five cohorts in the CPT Project64 and the Canadian Longitudinal Study on Aging.73
It is the Harmonization Platform that enables a DataSchema to be used as a basis for harmonization in a specific scientific context. It provides a rigorous approach to a three-step process that entails: (i) the development of rules providing a formal assessment of the potential for each individual study to generate each of the variables in the DataSchema; (ii) the application of these rules to determine and tabulate the ability of each study to generate each variable, thereby identifying the information that ‘can’ be shared; (iii) where a variable can be constructed by a given study, the development and application of a processing algorithm enabling that study to generate the required variable in an appropriate form.
The compatibility of variables is formally assessed on a three-level scale of matching quality: ‘complete’, ‘partial’ or ‘impossible’ (e.g. see Table 1). This process is referred to as ‘pairing’. Rules generated for variable pairing are context specific and will vary according to each harmonization project. Rule creation and pairing are both systematic processes based on protocols involving iteration between domain experts, research assistants and a validation panel. The whole procedure is subject to appropriate quality assurance.
The first use of the Harmonization Platform was in association with the Generic DataSchema. Pairing rules were therefore developed for all the variables in that schema. As an illustrative example, Table 2 details the rules created for the variable ‘Current quantity of red wine consumed (number of glasses of red wine/week)’. Using such pairing rules, the potential to harmonize 50 large population-based studies (each including at least 10000 healthy participants) has now been explored for ‘all’ variables in the DataSchema: additional studies joined the collaboration to enable this formal evaluation to take place. In combination, these 50 collaborating studies have recruited or plan to recruit a total of approximately 5.4 million participants.
The detailed results of the full pairing analysis will form the basis of a second paper to follow. For the purposes of the present conceptual article, we will therefore do no more than provide a brief illustration of the nature of the results to be anticipated. For example, using the specific variable considered in Table 2 (‘Average number of glasses of red wine consumed by the participant per week’), 7 (14%) of the 50 studies generated a complete match, 3 (6%) a partial match and 38 (76%) an impossible match. In the particular case being considered, therefore, information from approximately 873 900 participants might potentially be co-analysed for the variable of interest; i.e. from those studies that provide a complete match. In contrast, when the variable ‘Current quantity of wine consumed’ was considered (with no specification of red or white wine), 21 (42%) studies provided a complete match (1.8 million participants). As another example, when the variable ‘measured weight’ was investigated, 36 (72%) studies (3.6 million participants) provided a complete match. According to the pairing rules in this setting, in order that it might be considered a ‘complete match’, the weight of the participant had necessarily to be ‘measured’ at least once by a trained nurse/interviewer with a standard device. Where, weight was ‘reported’ by the participant, it was viewed only as a ‘partial match’.
However, in order to answer a real scientific question, the pairing statuses of more than one variable must usually be considered simultaneously. For example, if harmonized information is required on ‘Current quantity of wine consumed’, ‘Body Mass Index’ and ‘Current Tobacco Smoker’, a total of 12 studies provide a complete match for all three variables (approximately 1 million participants). At the same time, additional issues must also be taken into account. These include ethico-legal constraints on access to data or samples, the compatibility of different study designs and protocols, and the distribution of missing data. Consideration of such issues is fundamental to scientific rigour in using the DataSHaPER.
The DataSHaPER was originally launched under the P3G50,62 and PHOEBE63 initiatives in response to requests from the members of both consortia for guidelines to support the construction of questionnaires to facilitate prospective harmonization of large population-based studies. But the overall focus evolved, and rapidly subsumed the critical need for tools to support retrospective synthesis of information between existing/legacy studies. As the nascent project progressed, it became clear that one of the primary needs of the scientific community was to have access to comprehensive documentation of the potential to synthesize data across subgroup of studies. It was also recognized that such documentation needed to include descriptions of the procedures used to collect data and to target both generic and specialized data collected by studies using various designs. The network of the DataSHaPER collaborators has therefore extended over time and now includes, e.g., scientists working in disease-oriented networks of studies such as Genecure (chronic kidney diseases).74 Clearly, the ongoing development of new DataSchemas and Harmonization Platforms will reflect the interests and needs of the scientific teams using and developing them. As illustrative examples, future DataSHaPERs may focus on particular conditions (e.g. stroke, type 2 diabetes), social and lifestyle factors (e.g. nutrition, environmental pollutants), or specific population subgroups (e.g. newborn, elderly).
Documenting the potential to synthesize information across studies is critical and should foster collaboration, but it is only a step in the process leading to the final statistical analyses making use of synthesized data sets. In its recent development, the structure and web interface of the DataSHaPER is thus being consolidated in order to facilitate complementarity with other tools and approaches to harmonization, data access, processing, pooling and analysis (e.g. PhenX49, dbGaP36, DataSHIELD,75 OBiBa76 and SAIL77). It is the access to such integrated suites of tools that will ultimately facilitate the generation of new scientific discoveries using large-scale synthesized data sets across networks of studies.
The question ‘What would constitute the ultimate proof of success or failure of the DataSHaPER approach’ needs to be addressed. Such proof will necessarily accumulate over time, and will involve two fundamental elements: (i) ratification of the basic DataSHaPER approach; and (ii) confirmation of the quality of each individual DataSHaPER as they are developed and/or extended. An important indication of the former would be provided by the widespread use of our tools. However, the ultimate proof of principle will necessarily be based on the generation of replicable scientific findings by researchers using the approach. But, for such evidence to accumulate it will be essential to assure the quality of each individual DataSHaPER (see Box 2). Even if the fundamental approach is sound, its success will depend critically on how individual DataSHaPERs are constructed and used. It seems likely that if consistency and quality are to be assured in the global development of the approach, it will be necessary for new DataSHaPERs to be formally endorsed by a central advisory team.
Final set of synthesized ‘data’
The novelty of the DataSHaPER is not in the scientific challenges or solutions being addressed and proposed: similar projects have been embarked upon before. However, the DataSHaPER provides access to useful tools (see Box 3) and has critical advantages. The approach aims to be generic, flexible and can be used both prospectively and retrospectively. Furthermore, the web interface can easily be updated as new DataSchema and Harmonization Platforms are added and thus, provides a good potential for constant improvement of the content. Finally, the DataSHaPER has emerged as a common approach to the concrete need to document the potential to synthesize data across biobanks and cohort studies. However, the scientific utility of any synthesized data set depends on the quality of data to be pooled and on the rigour of the harmonization and synthesis process. The DataSHaPER can make a valuable contribution. However, if it is to be successful, it must continue to evolve and it must be used both widely and wisely.
For emerging studies
For network of studies to be prospectively or retrospectively harmonized
Genome Canada and Genome Quebec (The Public Population Project in Genomics); Canadian Partnership Against Cancer (CPT); European FP6 (LSHG-CT-2006- 518418 to Promoting Harmonization of Epidemiological Biobanks in Europe); Medical Research Council Project Grant (G0601625; methods programme in genetic epidemiology at the University of Leicester that focuses on genetic statistics and large-scale data harmonization and pooling); Wellcome Trust Supplementary Grant (086160/Z/08/A); Leverhulme Trust Research Fellowship (RF/9/RFG/2009/0062); National Institute for Health Research (Leicester Biomedical Research Unit in Cardiovascular Science); German Federal Ministry of Education and Research (BMBF) in the context of the German National Genome Research Network (NGFN-2 and NGFN-plus) (to E.W.); German Federal Ministry of Education and Research (BMBF) (Model attempt for networking in German research consortiadevelopment of a common concept for biobanks); European Framework 7 (Biobanking and Biomolecular Resources Research Infrastructure); J.L. is a Canada Research Chair in Human Genome Epidemiology.
We would like to thank all of the additional studies and biobanking experts that provided advice and information on the development of the DataSHaPER, and are now part of the ongoing collaboration that is taking the DataSHaPER project forward. The funders had no role in study design, data collection and analysis, the decision to publish, or preparation of the manuscript.
Conflict of interest: None declared.