|Home | About | Journals | Submit | Contact Us | Français|
Scientists are taking advantage of the Internet and collaborative web technology to accelerate discovery in a massively connected, participative environment —a phenomenon referred to by some as Science 2.0. As a new way of doing science, this phenomenon has the potential to push science forward in a more efficient manner than was previously possible. The Grid-Enabled Measures (GEM) database has been conceptualized as an instantiation of Science 2.0 principles by the National Cancer Institute with two overarching goals: (1) Promote the use of standardized measures, which are tied to theoretically based constructs; and (2) Facilitate the ability to share harmonized data resulting from the use of standardized measures. This is done by creating an online venue connected to the Cancer Biomedical Informatics Grid (caBIG®) where a virtual community of researchers can collaborate together and come to consensus on measures by rating, commenting and viewing meta-data about the measures and associated constructs. This paper will describe the web 2.0 principles on which the GEM database is based, describe its functionality, and discuss some of the important issues involved with creating the GEM database, such as the role of mutually agreed-on ontologies (i.e., knowledge categories and the relationships among these categories— for data sharing).
The digital and information age has transformed how we access information and interact with others; a transformation that has implications for the conduct of health science. For example, the NIH was one of the first scientific agencies to use networked information technologies (a prototype cyberinfrastructure) to coordinate input from literally hundreds of laboratories from around the word to document the 3 billion+ base pairs comprising the human genome.1 This technology mediated effort virtually defined the concept of “team science”2 in an era of distributed computing. Likewise, current efforts to extract research value from the terabytes of data existing within electronic medical records hold the promise of reducing healthcare costs through comparative effectiveness studies,3 reducing health disparities by informing policy decisions at the systems level,4 and accelerating discovery by closing the gap in translational research.5–6
A limiting factor in taking full advantage of these networked information technologies, however, has been the organizational challenge of deriving agreement from scientific communities on the common terms, measurements, and data elements that will make up the content and structure of the interconnected data systems.7–9 This paper describes one solution to that problem: the GEM database. Whereas other systems have used consensus panels7 and psychometric analytic techniques10 to select common measures for very specific purposes, the GEM database is distinct in that it uses “web 2.0” functionality to solicit, comment, vet, and select measures from the behavioral and population science communities in open and transparent ways. It is an example of what some National Science Foundation (NSF) grantees have referred to as “Science 2.0.”11–12
In 2004, computer publishing entrepreneur Tim O’Reilly hosted a conference of information technology specialists to identify characteristics of successful websites that seemed to be thriving in spite of the financial downturn associated with the “dot.com” implosion. He referred to this new generation of the web as “web 2.0” to provide a sense of “lessons learned” along with a forecast of rising trends. Among the most notable changes he foresaw in the emerging online ecosystem was a movement away from passive information dissemination (a hallmark of the mass media age) to active user participation. Wikis and blogs were beginning to dominate the Internet, as were sites for user-generated content such as YouTube®, Flickr®, MySpace®, Delicious® and later FaceBook®, Foursquare®, and Twitter®. Commercial vendors began to incorporate usage and other types of data into user-friendly evaluations of their products; and the Internet community began to value the notion of “collective intelligence” as valid input when making decisions.
These changes in the social ecology of online information systems did not escape the notice of informatics researchers and health scientists. Health information seeking and active participation in health-related discussion making rose substantially with diffusion of the web during the first decade of the new millennium.13–15 Frustrated by the slow movement of the traditional medical community, some patients began joining online participative sites, such as the site “patientslikeme.com,” in which they could freely share data on their own personal recoveries with others with similar diagnoses in an online environment.16 Advocacy groups began encouraging their members to participate actively in their own healthcare and to engage proactively in researchers’ efforts to recruit for clinical trials.17 Because of these noticeable changes, the Silicon Valley–based Institute for the Future predicted a trend toward “Open Health” in which biomedical science, healthcare practice, personal responsibility, and community advocacy merged.18 Gunther Eysenbach, editor of the online Journal of Medical Internet Research, concluded that medicine and health were moving toward a paradigm of “Medicine 2.0” or “Health 2.0” as patients and practitioners began to engage collectively in the new environs of web 2.0. 19–20
Likewise, the nation’s science institutions began reflecting on how the change in Internet culture might influence the conduct of science itself. In spring of 2010, the NSF convened a series of technical meetings to explore the phenomenon of “Technology Mediated Social Participation.”21 “The World Wide Web is the ultimate open, global, collaborative communication medium,” explained a representative from Macmillan Publishers in summarizing implications of the meetings, and “science is the ultimate open, global, and collaborative human endeavor.”22 Nevertheless, the culture of science often moves more slowly than the technology it helped create. The change is occurring, it is just not happening as quickly as we may like.22
Still, the changes that are occurring are impressive. In 2010 the National Aeronautics and Space Administration demonstrated how it could effectively take advantage of crowd-sourcing (i.e., engaging a full community through the web to solve a particular problem) to solve some of the more elusive computational challenges posed by long-range space flight.23 Examples of “participatory research” sponsored by the NSF have illustrated how individuals equipped with cell phones could contribute in parallel to document environmental health hazards.24 Based on the potential of collaborative technologies to improve efficiencies in science, the President’s Council of Advisors on Science Technology directed the Federal government to invest “in new multi-agency NIT [Network Information Technologies] initiatives in areas of particular importance to our national priorities,” beginning with health and healthcare.25
The need to marshal collective efforts in addressing the challenges of modern technology is readily apparent at the NIH. After succeeding in documenting the full human genome in 2003, the NIH community faced the task of identifying the precise connections between the DNA-sequenced base pairs and predictions for disease process. This link is necessary for advances in combating diseases such as cancer, cardiovascular disease, and neurologic disorders.26 This “needle-in-a-haystack”27 problem can be solved only by developing the cyberinfrastructure needed to connect the research community in focused, productive ways.28–29
To address the problem of data coordination at the National Cancer Institute, the largest institute within the NIH, biomedical informaticians embarked on an effort to create the next generation of computing infrastructure. Termed the “cancer Biomedical Informatics Grid,” or caBIG®, the purpose of this new cyberinfrastructure was to connect the cancer research community in secure, interoperable, standardized, and semantically cohesive ways.30–31 This new platform for cancer-related research would allow scientists to share protocols, operate in a harmonized way with standard measures, upload and share data, and obtain secure access to a virtual lattice of interconnected data sets. The spirit of the project matches the charge given to the Secretary of the DHHS by the Committee on Vital and Health Statistics in 2001: to “connect the dots” in biomedical and public health research.32–33 Through its connection to caBIG®, the GEM database is a tool to help connect these dots and increase scientific productivity.
The purpose of the GEM database is to serve as a portal for health scientists who wish to take advantage of the benefits of Science 2.0 to accelerate scientific discovery. The GEM database has two overarching goals: (1) Promote use of standardized measures which are tied to theoretically based constructs; and (2) Facilitate sharing of harmonized data resulting from the use of standardized measures. Although the process by which the GEM database achieves these goals is unique, the overall objectives are commensurate with other efforts such as consensus measures for phenotypes and eXposures (PhenX), Patient-Reported Outcomes Measurement Information System (PROMIS), and the NIH Toolbox (see The Changing Landscape at the end of this manuscript for more information).
The GEM database originated from strategic efforts to bring population science into the collective domain of Grid computing—that is, using an Internet-based research workspace—as supported by caBIG®. Admittedly, the standard classifications guiding much of molecular medicine had been determined in the years preceding the massive infrastructure undertaking. Therefore, bringing the components of molecular medicine into semantic agreement was primarily a matter of assuring compliance with existing vocabularies and ontologies.
Bringing the components of population science into harmonization, however, was more challenging. By its nature, population science is made up multiple scientific disciplines aimed at different levels of analysis in solving health-related problems.34 Epidemiologists traditionally look at variables distributed across different contextual environments in order to identify risk factors for disease processes and to determine optimal treatment approaches in clinical or public health practice. Behavioral researchers often look “intrapsychically” to determine how cognitive processes, emotional responses, and situationally based knowledge may interact in the prevention or control of disease. Until recently, these disciplines felt little urgency to agree on and share standardized measures. The task is daunting and few incentives have been in place.35
The landscape, however, is beginning to change especially for clinical researchers and those with a prevention focus. A good example is the ‘meaningful-use’ requirement for determining eligibility for financial incentives as mandated by Congress in the Health Information Technology for Economic and Clinical Health Act of 2009.36–37 Increasingly, standardized measures within the EMR will be used to assess patients’ prevention-related outcomes (smoking status, diet, nutrition, screening).38 Selecting the right measures—measures that are both valid and practicable—will be a necessary first step in creating a new healthcare system that can be used to nudge healthier behavior by assessing system outcomes.39
The GEM database was designed to facilitate these standardization and sharing processes across scientific disciplines. Necessarily, the GEM database was built to be flexible enough to accommodate different types of measures (e.g., from self-report to biological). The system can accommodate independent and dependent variables across the health continuum, from prevention, diagnosis, treatment and end-of-life issues, regardless of disease/wellness focus. Using principles of Science 2.0, GEM solicits community participation in contributing, vetting, and selecting measures for harmonizing data in a grid-enabled world. More importantly, it is a doorway – or portal – to the use of harmonized measures and creation of a more semantically integrated science.
As a result, the GEM database creates an environment for “prospective meta-analyses” in which research is designed for integration. Comparisons across studies can become the norm, rather than the result of the current, often awkward, retrofitting process. The purpose and structure of the GEM approach can be seen from the design of the online tool’s primary interface. The GEM database interface uses a tab-based architecture with three main tabs: constructs, measures, and data sets. Briefly, users can search for or add information about theoretically based constructs (including meta-data such as the definition and synonyms), find linked measures in a one-to-many relationship (along with associated meta-data such as author, reliability, and validity) and search for and share harmonized data that use any common measures (see Figure 1).
The GEM database is based on four primary design principles: (1) Architecture for participation—barriers to participation (or collaboration, in the scientific realm) are removed—that is, it is easy to “plug and play” (think of how easy it is to upload videos to YouTube); (2) Data-driven decision making—decisions, whether to buy a stereo or choose a certain doctor, hospital, or self-report measure are based on objective data; (3) Wisdom of the masses—under certain circumstances (i.e., decentralization, diversity of opinion), the masses can make more intelligent decisions than an individual expert40; (4) Open access—the ability to access or manipulate data and make results available to those who need and can use it in a functional manner.
As its name implies, the GEM database is integrated with the caBIG® grid-computing infrastructure and its resources. caBIG® and GEM intersect in many ways. First, and most directly, the meta-data available in the GEM database (i.e., the information about the constructs, measures and data sets) are published through a publicly available caGrid data service. This data service was implemented to be fully compatible with the caBIG®/caGrid integrated framework (“the Grid”). This means the GEM meta-data can be used by other data systems and/or analytic web services connected to the Grid using a well defined architecture. While the caGrid object-oriented architecture ensures interoperability between systems, reconciling definitions of terms ensures semantic interoperability through a common vocabulary and concept mapping. Therefore, a system that understands the concepts associated with behavioral science measures and constructs can use the meta-data available through the GEM database service for its own purposes. This flexibility could help facilitate data sharing between systems that catalog scientific measures and constructs, among other things.
In addition to publishing its meta-data through a data service, the GEM database allows users to provide meta-data about research data sets (e.g., description, dates of data collection, sample characteristics) available as data services on the Grid. To work successfully, these data sets must be available as caGrid-compatible data services and must use at least one of the measures documented in the GEM database. The meta-data about each data set includes the information required to access the associated Grid data service. With the addition of this functionality, researchers can now find measures of interest and then immediately find data sets that include these measures. At a minimum, this function allows researchers to review the type of data produced by a measure of interest. Depending on a researcher’s objectives and technical capabilities, a query could then be executed to bring together data from several data services all using common constructs and associated measures. This interoperability could open the door to new data-sharing opportunities within the scientific community. Very likely, novel statistical techniques will need to be applied. Ultimately this approach to common data collection and resulting sharing has the potential to accelerate the accumulation of knowledge.
From its inception, one of the core goals of GEM has been data sharing. As discussed previously, the GEM database shares meta-data via a public data service on the Grid and promotes the sharing of data through its “Datasets” tab. However, true data sharing is reciprocated. That is, the system shares its information but also receives information from outside sources. Since there are outside database systems that contain meta-data about measures and constructs, the creators of the GEM database wanted to make sure it could benefit from these existing repositories without requiring duplicate data entry. In the caBIG® world, the ideal solution would be to have the GEM database import these meta-data using another data service on the Grid. However, at this time, most outside sources of meta-data for measures and constructs are not connected to the Grid. These sources are primarily proprietary data belonging to individual investigators. Therefore, a file-based approach to importing data into the GEM database needed to be developed.
During the initial deployment phase, the GEM database supported the importation of information about constructs and measures via Microsoft Excel spreadsheets. Standardized import template spreadsheets were developed for both constructs and measures. Institutions wishing to contribute meta-data from an existing database used these templates to create import spreadsheets. These spreadsheets were then processed and the contents imported into the GEM database. Although the GEM database team successfully imported meta-data from several outside institutions using these Excel spreadsheets, automating the import process was challenging. The majority of the barriers to automation were related to handling data values that were not consistent with the database’s internal checks and balances. A future update will address these issues using XML (Extensible Markup Language), a schema for encoding documents in machine-readable form.
In the simplest terms, a wiki is an open, collaborative website where anyone can contribute (see for more information: http://en.wikipedia.org/wiki/Wiki). It is in this spirit of enabling community-driven science that the GEM database was developed. It provides a means by which any registered user within the research community can contribute by adding meta-data about constructs, measures and research data. The contributor does not need to be the author of a construct or measure in order to document it in the GEM database. Any registered user can make additions or changes to any meta-data in the database. But it cannot be done anonymously. To ensure a sense of accountability, the author of the additions/edits will be known.
Just as with other Wiki-like systems, the GEM database is dependent on the community of users to monitor the accuracy and validity of the information being contributed. Information found to be erroneous or suspect by the community can be discussed and subsequently corrected by the community itself. As a safeguard, the GEM database keeps a full audit trail so that the “who, what and when” associated with each update is always available to every user. Despite these safeguards, it may seem to those not familiar with wikis that having a site where anyone who has an authorized login may contribute and edit existing information would lead to chaos and inaccurate entries. However, it has been demonstrated by Nature that the accuracy of Wikipedia when covering the subject of science is on par with that of the Encyclopedia Britannica 41.
Although the GEM database promotes Wiki-like open authorship and editing of meta-data, it is more structured than a traditional Wiki. For example, all contributed information must be entered into a fixed set of attributes instead of an open narrative format. These attributes are specific to constructs, measures or data sets. The attributes associated with measures include name, description, construct measured, target population, validity, and reliability among others. Also, some of these attributes will be required before a construct or measure can be submitted for public review. This process ensures a minimum level of completeness before the content is visible to the larger research community.
An important aspect of community-driven websites is communication among various members of the community. The ability to provide feedback and comment on important concepts gives participants a voice and encourages them to fully engage in the project. The GEM database allows for community feedback in two ways. First, users are able to post comments about any aspect of the constructs, measures, and data sets. These comments are visible at the bottom of each detail page. Commenting can be used to provide feedback to content authors and potential users of measures and data sets. In addition to general comments, the GEM database encourages researchers to rate measures using a 5-star rating system. The scale has no pre-specified meaning and instead will access user’s evaluative gestalt of the measure. Users can always change their previous ratings but each user can have only one active rating for a given measure so there is less concern that someone may ‘pad’ the rating of their favorite measure. An average rating for each measure is calculated and made available through both the GEM database interface and the caGrid data service. This rating system allows a community of researchers to integrate data in the process of identifying measures that should become the standard.
Identification of one measure as better than others and a decision to promote for standard use should be data-driven. In the GEM database, users are given data to help with this process. These data include subjective outcomes such as averages and distributions of ratings, objective measures such as number of times a measure has been downloaded from the database, and more traditional psychometric information such as a measure’s reliability and validity.
Though crowd sourcing has its detractors and even proponents acknowledge that it doesn’t work effectively under all circumstances, the characteristics of the GEM database and its user community would suggest that the conditions for its use are optimal 40: (1) Diversity of opinion: the users of the database will likely come from a variety of academic disciplines and have different scientific interests; (2) Independence of members: users will not be collectively tied to any particular group or a particular way of seeing the world; (3) Decentralization: final decisions are not made by one or a few experts (top-down) but by the entire community (bottom-up); (4) A good method for aggregating opinions: all input from all users can be found, summarized, and utilized by the community of users to make decisions. Although these factors do not guarantee the best measures will be identified, the circumstances are very supportive of that outcome, far more so than reliance on dissemination of findings by peer-reviewed publication alone. And, as the number of users increases, the more likely their collective wisdom will efficiently advance the field.
Open access is an essential component of the philosophy underlying the GEM database. Users can easily find meta-data about measures and data sets. The measures themselves can be downloaded easily (where publicly available) and data can be identified and accessed in several different formats that should facilitate data sharing. However, the move toward open collaboration works only insofar as those working together are willing to share information or data and are willing to sublimate their own concerns to a collective benefit. To achieve this openness, the scientific community must break down the barriers—real or perceived—that often hinder open sharing of information.42 Concerns about sharing data such as confidentiality and HIPAA requirements need to be addressed. Some of these can be accomplished relatively easily by removing or obscuring any identifying information.
Other concerns, such as sharing data that a scientist may consider their property or fears of others trumping their results, may take longer to overcome. These barriers are particularly strong in academic centers where people compete for limited resources and promotions. Some of this can be accomplished only by embracing a new collaborative philosophy. 43
To meet the goals of the GEM database, common understanding of terms is essential. An ontology is a way of representing a knowledge domain to enable “a shared and common understanding that can be communicated between people and heterogeneous and distributed application systems.”44. Development of a useful ontology requires that the terms are defined with sufficient context so if two people – or computer systems – use the same terms, there is complete and correct communication between them. A good example is the sharing of information through electronic medical records discussed earlier where two record systems need to be measuring the same construct in the same way with complete understanding of terms. Differences in the definition of terms used can make the goal of merging across data systems untenable.
There currently exist many efforts to reconcile definitions of terms such as SNOMED CT® (Systematized Nomenclature of Medicine—Clinical Terms) and LOINC® (Logical Observation Identifiers Names and Codes). These systems can potentially interact with the GEM database to further facilitate sharing of information. For example, if there were a relationship between a measure in the GEM database and a LOINC code that describes a term, the reference to the LOINC code could be a piece of meta-data for that measure in the GEM database. In essence, this would standardize terms across the two systems and would ensure the compatibility in terms. And by extension, any other data systems that use the LOINC codes could share information with the GEM database.
In the domain of behavioral research, an additional kind/type of misunderstanding can occur when social scientists and clinicians use the same words to mean different things. For example social and behavioral scientists often define depression as a self-reported mood state assessed with a self-report measure such as The Beck Depression Inventory. Conversely, for the clinician, “depression” may refer to a clinical state that is evaluated by a medical professional according to the DSM IV criteria.
The solution to the problem of how to integrate such heterogeneous data has been the development of Semantic web technologies that can accomplish this integration by utilizing the semantics of the data (context, provenance, relationships to other data) that are made explicit through the use of ontologies. One example of this is the Open Biological and Biomedical Ontologies (http://www.obofoundry.org/) which is a consortium of scientists with the goal of “…creating a suite of orthogonal interoperable reference ontologies in the biomedical domain.” In order for collaborators in behavioral research to move forward most efficiently, a similar ontology of the respective terms is necessary. Professional organizations such as the American Psychological Association have attempted to achieve this end through development of a thesaurus of terms but this approach has limited usefulness if it is not adopted by a wider range of users.
Using standardized measures and sharing harmonized data are useful only if they ultimately result in more-efficient science and improved outcomes. As a tool, the GEM database can facilitate this improved way of doing science through its ability to combine shared constructs and associated measures across independent data sets. Scientific discovery can advance more quickly by combining data to create a cumulative knowledge base—literally standing on the shoulders of giants—as opposed to the traditional way of doing science where scientists work independently and do not integrate information across studies. These data sets can then be analyzed using statistical techniques that have been termed integrative data analysis (IDA) 45.
Integrative data analysis is similar to meta-analysis in that both have the goal of combining information across studies. However, in meta-analysis, only summary statistics of empirical research (e.g., effect sizes) are analyzed, and in IDA the actual data are combined together, with the merged data set being analyzed as a whole. Generally, IDA is more likely to be hypothesis-driven. The GEM database facilitates both IDA and the use of other exploratory methods, such as data mining, which use advanced computational methods to look for unseen patterns within a large data set. All of these techniques have been termed ‘discovery informatics’ and make for meaningful use of secondary data. 46–47
Scientists, especially those in academia, typically do not have institutional incentives to agree on standard measures and to share data with other scientists. Academia tends to reward researchers for individual scholarly productivity, not those who work collaboratively and share resources. Historically, there has been little support to engage with others outside of one’s own discipline 48. But this individualism leads to only a continued fractured scientific landscape. Using agreed-on measures and sharing data will require a change in thinking about how science is best conducted and a turn toward a collective mindset.
One way to promote this change is to appeal to the scientists’ interest in having a voice in setting the agenda for their respective fields. Using the GEM database can help scientists influence their colleagues to use a particular measure. Data sharing requires adopting a “you’ve got to give to get” mentality. While most scientists would be willing to receive data from others to enhance scientific discovery, requiring them to share data first may push them to think differently and more creatively. Many legitimate concerns exist; examples include concerns about participant confidentiality and other misuses of data. Under some circumstances sharing data would be irresponsible or even illegal.
Institutional review boards may raise concerns about the ability to get informed consent or to identify respondents and will need reassurance that study participants will not be harmed in any way. For the majority of situations, however, concerns can be addressed. Often, what is needed is to change scientific cultural norms and expectations about collaboration 49 in addition to providing procedures and processes to address legal and/or ethical issues. For instance, dbGaP, a website to access to data from genomewide association studies (see: http://www.ncbi.nlm.nih.gov/gap), requires users to receive permission from a Data Access Committee before allowing access to restricted data. Those who submit data must provide a statement that they have met all applicable laws and regulations. Similarly, the GEM database will require users to follow existing data-sharing procedures as delineated by the caBIG® Data Sharing & Intellectual Capital Workspace (see: https://cabig.nci.nih.gov/working_groups/DSIC_SLWG/).
The GEM database is certainly not the first tool of its kind to promote standardizing measures and sharing data. In fact, numerous other examples of both exist within and outside of government. Projects such as PROMIS,50 PhenX; see Schad et al. this issue;) and the NIH Toolbox (for the assessment of neurologic and behavioral functioning; see: http://www.nihtoolbox.org/default.aspx) have similar goals but use different processes. The GEM database tends to focus on the use of a broader community input to drive consensus (more bottom-up) rather than relying on a selected number of experts (more top-down) to make decisions. Despite these practical differences, the rapid proliferation of data sharing and standardizing projects indicates a clear paradigm shift in science. People are discovering the tremendous advantages of consensus—by sharing as opposed to hoarding data. The field is embracing an NIH (Not Invented Here) attitude.
The irony of the concurrent and independent development of several systems to bring consensus is not lost here. As a step toward ameliorating this situation, efforts are underway to import measures from some of these systems (e.g., PROMIS) into the GEM database. Continued effort is needed so that the lack of integration—and thus lack of interoperability—is lessened and not perpetuated.
Grid-enabled measures has the potential to help facilitate a new way of doing science that supports collaboration and sharing of information and data. This new approach, however, has many challenges and requires scientists to move from their comfort zone into a new scientific arena where the necessary incentives—both personal and systemwide—are not well understood or tested. Time will tell whether scientists are eager and able to meet these challenges in the name of advancing scientific discovery.
Publication of this article was supported by the National Institutes of Health.
No financial disclosures were reported by the authors of this paper.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.