|Home | About | Journals | Submit | Contact Us | Français|
The authors report on the development of the Cancer Tissue Information Extraction System (caTIES)—an application that supports collaborative tissue banking and text mining by leveraging existing natural language processing methods and algorithms, grid communication and security frameworks, and query visualization methods. The system fills an important need for text-derived clinical data in translational research such as tissue-banking and clinical trials. The design of caTIES addresses three critical issues for informatics support of translational research: (1) federation of research data sources derived from clinical systems; (2) expressive graphical interfaces for concept-based text mining; and (3) regulatory and security model for supporting multi-center collaborative research. Implementation of the system at several Cancer Centers across the country is creating a potential network of caTIES repositories that could provide millions of de-identified clinical reports to users. The system provides an end-to-end application of medical natural language processing to support multi-institutional translational research programs.
Translational research encompasses the dynamic cycle of laboratory studies, clinical studies and epidemiology in service of advancing clinical medicine. The development of informatics tools and infrastructure to support translational research has been the subject of several large-scale national projects.1–4 Translational sciences often require detailed clinical information, for example, to link molecular information to disease phenotype. Clinical expression of disease such as disease stage, disease severity, and response to treatment also provide crucial information for case identification and correlative studies. Unfortunately, almost all clinical outcome information of this kind is stored as unstructured or semi-structured free-text rather than coded, structured data. Natural language processing (NLP) has been used by numerous investigators to code and extract information from clinical documents.5–7 Informatics tools that build on NLP methods are needed to support clinical and translational research within a multi-institutional environment. However, few systems of this kind are currently in existence.
Tissue specimens provide an extremely important resource for researchers and may be collected prospectively or retrospectively. Prospectively collected research specimens in tissue banks are usually only available in small numbers but may be highly annotated with manually extracted clinical information. In contrast, clinical remainders of tissues and fluids provide a much greater pool of possible translational research specimens, but are typically associated with few or no clinical annotations. Almost all information about these specimens must be derived from free-text clinical reports such as the surgical pathology report (SPR). Large volume archives of clinically derived tissues associated with information in the accompanying SPR could provide a rich resource for translational research, if the archive could be made searchable in a manner compliant with the Health Insurance Portability and Accountability Act (HIPAA).8
The caTIES system evolved from a text processing system that we originally developed for the Shared Pathology Informatics Network (SPIN), which proposed to develop a network of institutions sharing de-identified data and tissue through coded SPR.9 Although the vision of the SPIN network was not realized beyond a prototype linking the four contributing institutions, the goal of the project to enable translational research across institutions fostered foundational research in this area. The caTIES project continues this goal but extends the previous system by (1) integrating with the Cancer Biomedical Informatics Grid (caBIG)1 2 architecture, common object representation and controlled vocabulary, (2) providing graphical interfaces and methods for query based retrieval and selection of cases, and (3) implementing a regulatory policy for federated data and tissue sharing through an ‘honest broker’ mechanism.
Data sharing should use the federated model, enabling local authority over management of data. Data must be stripped of all 18 required patient identifiers, to ensure compliance with HIPAA ‘safe-harbor’ practices. To improve sensitivity and specificity of retrieval, documents must be preprocessed to create concept codes for present and absent diseases, pathologic findings, anatomic locations, surgical procedures and other important medical concepts.
The interface must balance the need for a simple and obvious concept-based search capability in most cases with greater expressivity in some cases. For some use cases, researchers must be able to find tissue and documents based on complex Boolean logic in addition to temporal relationships between documents.
Previous efforts towards development of an inter-institutional network of document archives for research purposes have used an open ‘airport’ model, requiring that institutions agree to provide data to all interested users across all institutions.9 In contrast, we considered it essential to (1) bind all requests for data to a local Institutional Review Board (IRB) protocol, and (2) enable institutions to decide to supply data to outside researchers on a study-by-study basis, Additionally, we sought to (3) provide sufficient policies and procedures regarding identity provisioning and auditing to mitigate the risk of sharing de-identified data, and (4) create a rigorous security infrastructure to promote trust among organizations.
The potential benefits of sharing data across organizations are even greater when researchers across organizations can work together to manage datasets of documents and tissues. The system should enable such virtual datasets unencumbered by organizational boundaries.
Informatics tools for translational research are typically deployed in resource limited environments. To ease adoption, customization, and long term maintenance, the system should be built on open-source frameworks, tools and algorithms wherever possible, and use freely available vocabularies for concept-coding.
The system should function within the context of caBIG to promote interoperability between caTIES and other cancer research systems.
caTIES is a suite of clients, services, and datastores connected by and implemented on caBIG architectural blueprints. The system establishes a set of caBIG services that sufficiently govern caTIES behavior. A caTIES service network may function autonomously or may connect to outside service subscribers, such as caBIG.
caTIES establishes a single logical data model sufficient to house all caTIES data (figure 1). At each datastore, some parts of the schema may remain unpopulated but the schema is deployed as a whole. caTIES uses three primary datastores: (1) the private datastore, (2) the research datastore, and (3) the Collaborative Tissue Resource Manager (CTRM) datastore (figure 2). Each organization hosts one private datastore and one research datastore. In the typical configuration, the private and research datastores reside on different machines. The caTIES network hosts a single publicly accessible CTRM for use by all organizations.
The private datastore is the recipient of data derived from clinical systems such as the Anatomic Pathology Laboratory Information System (AP-LIS). It contains identified free text as well as dates, patient medical record numbers and specimen accession numbers. It is only available for access by honest brokers within the organization hosting the specific private datastore.
The research datastore contains de-identified free text reports, along with other unrestricted information such as gender, and age if less than 90. The research datastore is also the target of the NLP Pipeline Service, which creates and stores conceptual annotations with each free-text report. The schema of this database includes the Consented High Performance Index and Retrieval of Pathology Specimens (CHIRPS) SPIN submission schema9 permitting interoperability between caTIES and SPIN.
The CTRM datastore manages the collaborative construction and manipulation of tissue studies. Researchers build tissue order sets and electronically interact with honest brokers at external organizations. Honest brokers are disinterested third-parties, who are responsible for determining availability of biospecimens, filling orders for biospecimens, and providing additional de-identified outcomes data.
caTIES uses hibernate object relational mapping technology, providing a flexible façade for multi-platform relational database management systems (RDBMS) access.
The data preparation phase runs as a series of operating system-based services that transform data from free-text documents stored in clinical systems to concept-annotated de-identified documents stored in the relational database. caTIES services run continually, release machine resources when not in use, and revive on machine restart.
Data preparation encompasses four tasks, performed by four corresponding services, in the following order: (1) acquisition, (2) de-identification, (3) concept-coding, and (4) indexing.
Data may be transferred from AP-LIS or document repositories using a variety of acquisition services. Because of the heterogeneity of clinical systems, caTIES adopters are tasked with populating the private datastore before starting the caTIES services. Adopters may use existing tools provided by vendors, or may write their own data transfer mechanisms, targeting the caTIES logical schema.
To assist adopters, we currently support a data transfer mechanism based on a Cerner data warehouse product that extracts data from any of the three Cerner AP-LIS systems. Two of the four institutions collaborating in the caBIG caTIES pilot implemented this method of data transfer. The third institution wrote its own Health Level 7 (HL7) interface, which directly utilizes the institutional HL7 router feed. The fourth institution created database specific queries to upload identified data. Additional AP-LIS specific acquisition services are being considered for future development.
The caTIES de-identification service removes the 18 identifiers required by HIPAA, and creates and stores randomly generated Universally Unique Identifiers linked to the original identifiers, to support a method for re-identification that is, permissible under HIPAA. At our institution this functionality is achieved using DeID, a commercially available de-identification system. However, caTIES is designed to permit easy uncoupling of the default de-identifier. Adopters can use any system providing similar functionality by implementing a simple Java interface. We have benchmarked this capability using the Harvard scrubber.10 The choice of a default commercial system was motivated by the need for a well-established, formally evaluated method for de-identification.11 12 As newer systems for de-identification mature, we expect that open-source de-identification will replace the default commercial system.
The caTIES coding pipeline service (table 1) produces conceptual annotations on free-text documents. Coding is performed by a sequence of modular processing resources generally applied in the following order:
The core language-processing functionality of the system is achieved using the open-source General Architecture for Text Engineering platform.16 Implementation details of the coding pipeline service are provided in table 1.
For concept coding, caTIES uses MMTx pre-configured with the National Cancer Institute (NCI) Metathesaurus.17 Use of the NCI Metathesaurus is a condition of participation in the caBIG. However, users outside of caBIG may choose any other vocabulary or vocabulary subset that can be used with MMTx, by configuring MMTx differently prior to installation of caTIES.
caTIES coding services have been designed to run in parallel to take advantage of multiple processors available at an organization, greatly reducing the total time for coding massive document sets.
The caTIES indexing service creates a text search engine index for fast access to documents based on the characteristics of the document text and conceptual codes. This index must fulfill the requirement of fast substring searching independent of an underlying RDBMS. CaTIES uses Lucene 2.3 for its information retrieval engine.18
The caTIES SPR index is streamlined for temporally constrained, patient level query by mapping the composite primary key of patient unique identifier and SPR collection date and time to the range of long numbers. This mechanism requires additional bookkeeping time and space in the accompanying RDBMS but it is otherwise transparent to the user.
In addition to the conceptual document index, caTIES maintains an ancestor index that associates NCI Thesaurus concepts with their ancestry. Here ancestry is defined to be all concepts in the transitive closure along the reverse isa-relationship of the NCI Thesaurus. The ancestor index provides ancestors both at SPR index time and later during client query formulation.
For information retrieval across organizations, caTIES uses a grid service architecture based on the Open Grid Service Architecture (OGSA).19 Grid services are stateful webservices that provide more functionality than the basic webservices they are built upon. The caTIES client communicates with three services to search for and retrieve documents. All caTIES services are implemented using the Globus Toolkit Webservices Resource Framework (GT4)—a reference implementation of the OGSA specification.
The caTIES MMTx service derives conceptual search criteria on the client side, based on a user query string. Users may modify concepts interactively.
The caTIES search service communicates the search criteria (including Boolean logic, temporal relationships, and concepts) to the server. On the server side this request is converted from SPIN query XML to Lucene query language. Hits from the search are organized into a response payload that consists of report unique identifiers and some report header information. Subsequent drill down into report specifics occurs on future server requests.
The caTIES OGSA-DAI service provides a Web services conduit for basic Structured Query Language (SQL) Data Manipulation Language and Description Definition Language (DDL) interactions with a data source. OGSA-DAI is an extension to the core functionality of the Globus Toolkit, that provides access to a wide range of databases including MySQL, DB2, Oracle, Postgres, SQL Server, and XIndice, as well as indexed text files. Thus, caTIES may be implemented with any of these database management systems.
The caTIES user interface (UI) is composed of four role-based perspectives: researcher, preliminary user, administrator, and honest broker. At login, the caTIES client loads the appropriate perspective for the user. The user can switch between perspectives if she is registered with more than one role in the system.
The caTIES client is a Java application deployed using Java WebStart. Open source libraries used in the construction of the client application include (1) JGraph library20 for displaying the Diagram query view, (2) GlazedLists library21 for displaying the results table and (3) JFreeChart library22 for constructing pie/bar charts for the results.
The researcher perspective supports query construction and execution, and order management for the distribution protocol. caTIES supports both query by text and query by concept. Users can constrain queries by demographic variables such as age and gender. Standard Boolean constructs including AND, OR and NOT can be used to combine all of the above constraints. Additionally, users can formulate temporal queries based on the timing of diagnostic reports. An example of a temporal query is: “Find all females who had Lobular Carcinoma in Situ, followed by mastectomy within 1 year” (figure 3).
Queries can be modeled using two views: dashboard and diagram. The dashboard view allows for simple text-box driven query construction. The diagram view permits more expressive nested Boolean query construction using a filter-flow metaphor (figure 3). Views are synchronized so that a query in the diagram view always matches the query in the dashboard view. However, since the diagram view is more expressive, not all queries modeled in the diagram view can be viewed in the dashboard view.
Results are visualized in tabular and tree format. In the tree format, they are hierarchically organized by owning organization, and then by patient. Selecting a report in this tree provides detailed document information and annotations (figure 4). The tabular view lists all reports by key criteria (eg, age, gender, concepts) and can be reorganized by sorting.
The preliminary user perspective is identical to the researcher perspective except that it returns only aggregate level data (histograms and pie charts). No record level data can be obtained. Preliminary users typically obtain access without IRB approval, to collect data preparatory to research.
The Administrator UI perspective is used by system administrators and honest brokers to accomplish administrative functions. It supports user account creation, registration of new IRB approved studies, registration of the institution as data provider or tissue provider to external IRB approved studies, and addition of researchers and honest brokers to studies from the administrator's local organization. In addition, it supports quality assurance of de-identification and concept coding. Reports flagged by users for potential errors in de-identification or coding may be reviewed by honest brokers using the Quality Assurance tab. Documents flagged for de-identification errors are quarantined and unavailable for subsequent use until the error is corrected or released.
The honest broker UI perspective enables impartial individuals such as tissue bankers and cancer registrars to assist researchers in filling requests for tissue or further clinical data. On login, the honest broker perspective provides a queue of unfilled requests. Honest brokers can view data from the private (identified) database of their own institution only, in order to fill orders or provide further data in a de-identified manner.
caTIES uses a protocol-based model for collaborative research across a network of organizations. The paradigm is based on a fundamental assumption that exchange of de-identified data and/or tissue between any repository and any researcher requires two IRB protocols—(1) by the organization establishing a de-identified repository for providing data or tissue to one or more researchers, and (2) by the researcher for searching a de-identified repository established at one or more organizations. Differences among IRBs in regulation of data-sharing and materials transfer create the requirement for maximal local control over participation. Thus, organizations who host caTIES nodes may agree to provide data or tissue on external protocols on a study-by-study basis. In previous work, we have validated these assumptions in interviews with IRB and regulatory officials at six US cancer centers.23 The model of privacy, security, and collaboration needs for a research grid derived from this interview study differs dramatically from the open (‘airport’) model of collaboration that has been previously used.9
Access to caTIES must occur within the context of a valid (time-sensitive) approved IRB protocol. All users are bound to one or more IRB approved protocols at the time of user registration. When a protocol is registered by an administrator for a researcher seeking to obtain data or tissue, the administrator registers the home institution as a Data Consumer or Tissue Consumer respectively. The home institution becomes a Data Provider to this local IRB protocol automatically. And if the administrator registers the organization as a Tissue Consumer, then the home organization automatically becomes a Tissue Provider to this local IRB protocol.
Once the protocol is registered, other caTIES nodes may agree to participate on this study protocol. Honest brokers must determine whether a given protocol registered at an external organization meets the constraints of the repository IRB protocol for sharing data that has been approved at the providing organization. In previous work, we determined that many organizations may require only assurance that an external researcher has appropriate credentials and IRB protocol (which can be established at the time of provisioning), but that requirements for data sharing may be more stringent at some organizations.23 The approach we developed enables participation within the bounds of local regulatory requirements.
Within the constraints of this model, caTIES has many features that support collaborative research between organizations hosting caTIES nodes. For example, researchers from different institutions can be a part of the same study protocol, and thus they may create queries and order sets that can be viewed and edited by other researchers on the protocol who may reside at different institutions.
caTIES uses a series of security enforcement layers (figure 5) to lock out unauthorized resource access. Security enforcement layers include:
Physical security of data is supported by the complete separation of de-identified and identified data (which reside on different machines in the typical configuration).
At the network layer, caTIES uses the security model of the Globus Toolkit and OGSA-DAI (figure 5). The Globus Toolkit uses Grid Security Infrastructure (GSI)24 for enabling secure authentication and communication over an open network. GSI provides a number of useful services for Grids, including mutual authentication and single sign on. GSI is based on public key encryption, X.509 certificates, and the Secure Sockets Layer communication protocol. Extensions to these standards have been added for single sign on and delegation. The Globus Toolkit's implementation of the GSI adheres to the Generic Security Service application programming interface,25 which is a standard interface for security systems promoted by the Internet Engineering Task Force.
All caTIES Grid Services are configured as secure grid services. CaTIES secure grid services authenticate and authorize based on a local Dorian installation. Dorian26 provides the caGRID Identity Provider and Identify Federation Service interfaces. Authentication is dependent on a valid Security Assertion Markup Language assertion, and generates a proxy certificate or grid identity. Entries for each authorized user are also stored in the grid maps at each caTIES node, providing an additional level of security.
At the application layer, caTIES maintains security mechanisms for restricting access based on the user's authorization attributes (figure 5). The caTIES CTRM client application dictates authorization using information in its CTRM datastore and embedded business logic. Users are granted restricted access based on their authorized resource set.
At the database layer, caTIES restricts access to data using RDBMS standard mechanisms (figure 5). RDBMS roles and their table access privileges mirror the high level authorization roles of caTIES: administrator, honest broker, researcher, and preliminary user.
Sharing of data between organizations requires agreements, policies and standard operating procedures among participants with regard to the adequacy of de-identification, provisioning of credentials, requirements for IRB review, and auditing of data and protocols. The caTIES project has developed a set of human processes and policies to support the functioning of a caTIES network, which are publicly available.27 The security policy was derived from an in depth interview-based study which used problem scenarios to elicit security and privacy requirements.23
The caTIES installer provides a common front end for installing and configuring all caTIES services and datastores. The caTIES website provides access to currently supported releases of caTIES, installation, administration and user manuals, demonstration videos, and other information to assist new users. The software is available on SourceForge28 which also hosts the caTIES user forums.29 caTIES is released under the caBIG open source license.
Previous evaluations of the early components of the caTIES pipeline have already been reported.30 31 The current evaluation focused on determining the deployed performance of the system using (1) studies of query response timing, and (2) metrics of basic information retrieval, using a set of 30 standard queries of clinical significance (tables 2a and and2b).2b). Queries were invented for three general categories of complexity. Simple queries had no more than two concepts and no temporal relationships or negations. Moderate complexity queries had more than two concepts or negated concepts but no temporal relationships. Complex queries had more than two concepts or negated concepts with temporal relationships. We first tested the query set to determine the length of time to query completion in the deployed system at University of Pittsburgh. At the time of query response testing, the Pittsburgh repository contained more than 1.4 million documents, and was deployed on an IBM HS22 Blade Server with the following specifications: 2×Intel Xeon Processor X5550 (Quad Core), 24 GB Memory, 2×73 GB 15K SAS Drives (mirrored) internal disks, IBM DS3400 300 GB disk storage, 15K Serial Attached SCSI (SAS) drives, running VMWare vSphere 4.0 Standard Edition.
Results show mean and SD for three attempts to account for network traffic fluctuation (table 2). For simple and moderate queries, caTIES responds in sub-second time. Temporal queries do take substantially longer but still respond within 20 s on average and within 1 min in almost all cases.
Next, we tested the information retrieval aspects of the system. In this study, we determined only the precision of the system (table 2). Two authors of this manuscript, a pathologist (RC) and a knowledge engineer with expertise in pathology (MC) separately coded results of all 30 queries as true positive or false positive. All reports (or report sets for complex queries) were coded unless more than 50 reports or report sets were returned, in which case the judges coded only the first 50 reports or report sets returned by the system. Judges achieved an overall inter-rater reliability of 96% agreement. Results show high precision for simple and moderate queries (average 0.94–0.96), which drops slightly for the more complex temporal queries (average 0.88). Performance is expected to degrade for these queries since coding a true positive for temporal queries requires that both reports returned are true positive for each of the two clauses in the query.
Error analysis (table 3) was performed on all reports marked as false positive by either judge. A total of 73 cases were analyzed. The most common errors related to retrieval of documents in which the search concept was erroneously coded by the system because a substring of the more complex concept was recognized by MMTx. In many cases, these errors occur because the more complex concepts are post-coordinated concepts (eg, “post-mastectomy scar”) and are not represented in the vocabulary. Another common source of errors related to erroneous clinical diagnoses—specimens are sometimes labeled with a clinical diagnosis, which is subsequently corrected by pathological examination. Diagnostic uncertainty was a third cause of error, and is a common problem in retrieving clinical documents. Other error categories observed (in decreasing frequency) include: initials incorrectly coded as abbreviations, concepts identified in the report that are in fact historical, conceptual relationships not properly scoped, and errors in negation detection. Of note, the majority of observed errors could be eliminated by (1) limiting search to specific report sections of the report and by (2) extending the negation detection to include newer algorithms which account for uncertainty. Future versions of the system will include these modifications.
At University of Pittsburgh, caTIES is deployed as a production system, supported by the information systems help-desk and applications trainer. Deployment of caTIES at our institution is governed by the Health Sciences Tissue Bank which oversees the policy aspects of the system, using existing human honest broker systems approved by our Institutional Review Board. The system has met the security and privacy requirements of the University of Pittsburgh Medical Center (UPMC) to operate as a ‘UMPC approved clinical system’.
CaTIES has also been deployed at three other caBIG funded institutions including University of Pennsylvania, Thomas Jefferson University, and Washington University St Louis as part the caBIG caTIES pilot. Additionally, caTIES has been deployed by a Midwestern stand-alone cancer center, a Midwestern university affiliated cancer center, and by members of a Western US health consortium, with minimal assistance from the developers. A growing number of other institutions are evaluating and deploying the system without our assistance.
The caTIES system provides an example of an end-to-end medical NLP application that could be used to support multi-institutional collaboration and translational science. The system has a strong policy foundation, expressive user interface, and builds on existing open-source tools and vocabularies. Results of our studies show that it retrieves documents and document-sets quickly, and operates with high precision.
The successful deployment of this translational research system required that we to adopt the security and privacy practices of the more highly regulated health information environment. Acceptance of this repository at our institution took over 1 year and required substantial interaction with IRB, hospital privacy and security officers, honest broker services, and the health sciences tissue bank. The use of a data stewardship model was an important step in reaching consensus among stakeholders. Despite the fact that the data was de-identified, we determined that the system must achieve the same security status as any clinical system in our environment. Automated de-identification is not risk-free and the potential for unregulated use of data must be minimized.
Despite the successful deployment of the caTIES system across multiple individual institutions, including our own, one key functionality of the system has not been used beyond demonstration purposes—no institutions are currently using the grid communications system to support ongoing, multi-institutional data sharing. To reach this goal, we must have a trust fabric with suitable policies and processes for sharing data and tissue. The policy groundwork for such a federation has already been established for our system,23 and more general frameworks and national policies are emerging.32 33 But the practical implementation of such a data sharing network will likely require a great deal more work even after such frameworks become mature, available and widely accepted.
Future versions of the open-source caTIES software will include support for other relational database management systems and operating systems, and will enable individuals deploying the system to more easily specify vocabularies within the Unified Medical Language System. Enhanced methods for data transfer from clinical systems are planned for future releases. Additionally, we expect to provide similar capabilities for coding a select set of other document types, including radiology reports.
An important aspect of ongoing work is to establish a community of institutions committed to achieving a true data sharing network of caTIES nodes using existing national frameworks. The use of the system to support a nationwide virtual paraffin tissue bank is considered the key long term project goal.
We thank Lucy Cafeo at the University of Pittsburgh Department of Biomedical Informatics for expert preparation and review of the manuscript. We also thank the many collaborators, developers, and adopters who contributed to the ideas implemented in the current system, including: Jules Berman, Frank Manion, David Carell, Linda Schmandt, Aditya Nemlekar, Michael Becich, Mark Watson, Rakesh Najaragan, Michelle Bisceglia, Rajiv Dhir, Anil Parwani, Jack London, Ian Fore, George Komatsoulis, Lawrence Wright, John Quigley, Dave Fenstermacher, Qing Zeng, Gunther Schadow, David Berkowitz and Henry Chueh.
Funding: Work on the caTIES system has been funded by multiple sources including the National Cancer Institute R01 CA132672, caBIG program under the Tissue Bank and Pathology Tools Workspace task order to University of Pittsburgh (caBIG contract #79207CBS10), and also by a Clinical and Translational Sciences Award to the University of Pittsburgh (U54 RR023506-01). Earlier work was funded by the National Cancer Institute Shared Pathology Informatics Network (U01 CA 091343). Other Funders: National Cancer Institute; Tissue Bank and Pathology Tools Workspace; Clinical and Translational Sciences Award; National Cancer Institute Shared Pathology Informatics Network.
Competing interests: None.
Provenance and peer review: Not commissioned; externally peer reviewed.