We were guided by several principles as we began the design of our system, in September 1998. First, in all stages of the design and implementation we needed to be guided by the needs of our primary intended audience, patients and other members of the public. Second, the only way to successfully effect a project of this scope was to get broad agreement on a common set of data elements with a standard syntax and semantics. Third, since we needed to build a system as quickly as possible, it seemed clear to us that our requirements would evolve over time as we gained additional experience. Therefore, it would be important for us to design the system in a modular and extensible way. Finally, since this would be a long-term effort involving many individuals and organizations with varying backgrounds, technical expertise, and data sets, it would be necessary to implement the project in phases.
To reach the broadest audience with the fewest barriers to access, we designed a Web-based system that would be easy even for a novice user to use and yet would have extensive functionality. The goal was to make it simple for users to formulate their queries and then obtain results that would guide them to further relevant, “just-in-time” information. We involved patients and patient advocates in the early testing of the system, and we identified and then tested our site for accessibility using several readily available tools, also making sure that the system performs reasonably on a wide range of Web browsers.
We decided to begin the project by working with our colleagues at the National Institutes of Health (NIH), making NIH-sponsored trials available first. This first phase has involved working with 21 NIH institutes, each of which have had varying approaches to data management and collection and varying levels of technical expertise. Some institutes have a large number of ongoing clinical trials, while others have only a few. When we began the project, some institutes had well-established databases for managing their clinical trials, others had Web pages that described their trials but no back-end database support, and yet others were still managing their data in paper form. As a first step, we convened representatives from all 21 institutes and discussed and then agreed on a common set of data elements for the clinical trials data. Several groups at NIH as well as other groups in the clinical trials community had already given a good deal of thought to this, and their insights, together with the requirements of the law, allowed us to arrive at a common set of elements in the first few months of the project. We decided on just over a dozen required data elements and another dozen or so optional elements. The elements fall into several high-level categories—descriptive information such as titles and summaries; recruitment information, which lets patients know whether it is still possible to enroll in a trial; location and contact information, which lets patients and their doctors discuss further details with the persons who are actually conducting the trials; administrative data, such as trial sponsors and identification (ID) numbers; and optional supplementary information, such as literature references and key words. lists the required and optional data elements for the system.
Required and Optional Data Elements in Current System
The study ID number is a unique number assigned by the data provider, which is critical for tracking the trial in the system. In some cases our data providers already had developed methods for assigning IDs to their trials. Those who did not have IDs have since developed and assigned them. In addition to the primary ID, there may be secondary IDs, such as NIH grant or contract numbers, and these are also accommodated. Once a trial record comes into our system, we assign it a number that functions much like a MEDLINE unique identifier. Its form is “NCT” followed by eight digits.
The study sponsor is the primary institute, agency, or organization responsible for conducting and funding the clinical study. There may be additional sponsors, and these may also be listed in the database. Investigator names are included at the discretion of the data provider. Study titles and summaries are important because they give a patient or other user of the system a quick indication of the purpose of the trial. We have asked our data providers to provide us with brief, readily understood titles and summaries. The summaries should provide background information, including why the study is being performed, what drugs or other interventions are being studied, which populations are being targeted, how participants are assigned to a treatment design, and what primary and secondary outcomes are being examined for change (e.g., tumor size, weight gain, quality of life). More detailed descriptions may be provided, and these are often somewhat more technical descriptions of the clinical study intended for health professionals.
Location information includes geographic locations, contact information, and status of a clinical trial at a specific location. Many trials are being conducted at multiple locations, sometimes dozens of sites. It is important that the contact information and recruitment status for all sites be accurate and current. We have established six categories into which the recruitment status might fall—not yet recruiting (the investigators have designed the study but are not yet ready to recruit patients); recruiting (the study is ready to begin and is actively recruiting and enrolling subjects); no longer recruiting (the study is under way and has completed its recruiting and enrollment phase); completed (the study has ended, and the results have been determined); suspended (the study has stopped recruiting or enrolling subjects, but may resume recruiting); and terminated (the study has stopped enrolling subjects and there is no potential to resume recruiting). Sometimes information about the exact start and completion dates of the study is available and, if so, it is added to the status information. Contact information needs to be provided for each trial and includes the name of a contact person and a telephone number for further inquiries. For large multicenter trials, a single coordinating center may handle and then refer the calls.
Eligibility criteria are the conditions that an individual must meet to participate in a clinical study. Both inclusion and exclusion criteria are often relevant. For example, patients who enroll in the study must have a specific disease, may need to be in a certain age range (e.g., under 3 months or over 65 years old), and may need to have already undergone a specific therapy regimen, such as chemotherapy. Exclusion criteria are those conditions that may prevent an individual from participating in a clinical study. For example, in a study involving women, perhaps a participant cannot be pregnant or nursing. In other types of studies, a participant cannot, perhaps, have a history of heart disease.
While many clinical trials are designed to investigate new therapies, there are several other study types as well. We have categorized these into nine types—diagnostic, genetic, monitoring, natural history, prevention, screening, supportive care, training, and treatment. Study design types include the familiar randomized control trial as well as others whose usage and frequency we are in the process of reviewing. The current list includes terms for clinical trial and observational study designs as well as methods (e.g., double-blind method) and other descriptors (e.g., multi-center site).
We have required certain items as separate data elements specifically to ensure optimal search capabilities. These include the study phase, the condition under study, and the intervention being tested. The phrase of the study is important information for patients who are considering enrolling in a particular trial. Phase I trials are the most preliminary and include the initial introduction of an investigational new drug into human use. Phase II trials include studies conducted to evaluate the effectiveness of drugs for particular indications and to determine common short-term side effects and risks. Phase III trials generally involve large numbers of patients and are performed after preliminary evidence suggesting effectiveness of a treatment has been obtained. Phase IV studies are generally post-market studies that seek to gain additional information about a drug's risks, benefits, and use. We have requested that data providers name the condition and intervention being studied using the Medical Subject Headings (MeSH) of the Unified Medical Language System (UMLS), if at all possible. Sometimes, of course, the investigational drug is too new to appear in MeSH, but in other cases the drug, procedure, or vaccine is already well established and the trial may be investigating new combinations of drugs or new uses of established procedures.
Some optional information that may be available for a particular study includes references for publications that either led to the design of a study or that report on the study results. In these cases, we have asked our data providers to provide us with a MEDLINE unique identifier (UI) so that we can link directly to a MEDLINE citation record. (In some cases, we have mapped the citations to UIs for our data providers.) A summary of the results can also be prepared specifically for inclusion in the database, and the use of MeSH keywords is also encouraged. Supplementary information may include URLs of Web sites related to the clinical trial. For example, a trial record on mild cognitive impairment, in addition to linking to NIH's National Institute on Aging, might also link to an Alzheimer's organization.
With the important step of agreeing on a common set of data elements completed by the end of 1998, we were able to devote the next six months to working with each institute individually on methods for receiving their data for inclusion in our centralized database at the National Library of Medicine (NLM). To move the project forward rapidly, we assisted those NIH data providers who had little technical infrastructure in a variety of ways. We developed a Web-based data entry system and offered it to anyone who preferred using it to developing a system of their own. If this system is used, the control of the data still resides with the institute, but we manage the process for them. In other cases, we assisted groups who already had databases by helping them redesign aspects of their databases for the purposes of this project or by writing scripts that would extract data from their databases and prepare them in the standard format. Some institutes were able to provide the data with minimal assistance from us, although some iteration was generally necessary before the data could be fully validated.
In all cases, data are sent to us in extensible markup language (XML) format according to a document type definition (DTD) that we have created. XML has been developed to address some of the deficiencies of HTML and at the same time provide a more streamlined version of SGML for use in Web applications.18,19
Its use has several advantages in our application. It is a standard, structured language that can be readily understood by both computers and people, and it provides a simple, verifiable method of exchanging data regardless of the underlying system that may have produced those data. Therefore, data providers are free to use whatever technology they prefer and can change their database and Web site designs at their discretion. The only requirement is that they capture the required data elements and that they can produce a report in the specified XML format. A portion of our DTD is shown in .
Portion of document type definition (DTD) in current system.
A study collection consists of one or more clinical study records, which themselves consist of a number of required and optional data elements, as described earlier. In the segment shown in , the study ID is required, but additional administrative numbers, such as an NIH grant number, are optional. Titles and summaries consist of free text (textblocks), while dates need to adhere to a standard date format. Notice that, for intervention names (and for disease names, although this is not shown in this portion of the DTD), we allow not just a single name, but also any available synonyms.