caTIES is a suite of clients, services, and datastores connected by and implemented on caBIG architectural blueprints. The system establishes a set of caBIG services that sufficiently govern caTIES behavior. A caTIES service network may function autonomously or may connect to outside service subscribers, such as caBIG.
caTIES establishes a single logical data model sufficient to house all caTIES data (). At each datastore, some parts of the schema may remain unpopulated but the schema is deployed as a whole. caTIES uses three primary datastores: (1) the private datastore, (2) the research datastore, and (3) the Collaborative Tissue Resource Manager (CTRM) datastore (). Each organization hosts one private datastore and one research datastore. In the typical configuration, the private and research datastores reside on different machines. The caTIES network hosts a single publicly accessible CTRM for use by all organizations.
Object model for private, de-identified and CTRM datastores.
Information architecture showing suite of services, datastores and clients.
The private datastore is the recipient of data derived from clinical systems such as the Anatomic Pathology Laboratory Information System (AP-LIS). It contains identified free text as well as dates, patient medical record numbers and specimen accession numbers. It is only available for access by honest brokers within the organization hosting the specific private datastore.
The research datastore contains de-identified free text reports, along with other unrestricted information such as gender, and age if less than 90. The research datastore is also the target of the NLP Pipeline Service, which creates and stores conceptual annotations with each free-text report. The schema of this database includes the Consented High Performance Index and Retrieval of Pathology Specimens (CHIRPS) SPIN submission schema9
permitting interoperability between caTIES and SPIN.
The CTRM datastore manages the collaborative construction and manipulation of tissue studies. Researchers build tissue order sets and electronically interact with honest brokers at external organizations. Honest brokers are disinterested third-parties, who are responsible for determining availability of biospecimens, filling orders for biospecimens, and providing additional de-identified outcomes data.
caTIES uses hibernate object relational mapping technology, providing a flexible façade for multi-platform relational database management systems (RDBMS) access.
Data preparation services
The data preparation phase runs as a series of operating system-based services that transform data from free-text documents stored in clinical systems to concept-annotated de-identified documents stored in the relational database. caTIES services run continually, release machine resources when not in use, and revive on machine restart.
Data preparation encompasses four tasks, performed by four corresponding services, in the following order: (1) acquisition, (2) de-identification, (3) concept-coding, and (4) indexing.
Data may be transferred from AP-LIS or document repositories using a variety of acquisition services. Because of the heterogeneity of clinical systems, caTIES adopters are tasked with populating the private datastore before starting the caTIES services. Adopters may use existing tools provided by vendors, or may write their own data transfer mechanisms, targeting the caTIES logical schema.
To assist adopters, we currently support a data transfer mechanism based on a Cerner data warehouse product that extracts data from any of the three Cerner AP-LIS systems. Two of the four institutions collaborating in the caBIG caTIES pilot implemented this method of data transfer. The third institution wrote its own Health Level 7 (HL7) interface, which directly utilizes the institutional HL7 router feed. The fourth institution created database specific queries to upload identified data. Additional AP-LIS specific acquisition services are being considered for future development.
The caTIES de-identification service
removes the 18 identifiers required by HIPAA, and creates and stores randomly generated Universally Unique Identifiers linked to the original identifiers, to support a method for re-identification that is, permissible under HIPAA. At our institution this functionality is achieved using DeID, a commercially available de-identification system. However, caTIES is designed to permit easy uncoupling of the default de-identifier. Adopters can use any system providing similar functionality by implementing a simple Java interface. We have benchmarked this capability using the Harvard scrubber.10
The choice of a default commercial system was motivated by the need for a well-established, formally evaluated method for de-identification.11
As newer systems for de-identification mature, we expect that open-source de-identification will replace the default commercial system.
The caTIES coding pipeline service
() produces conceptual annotations on free-text documents. Coding is performed by a sequence of modular processing resources generally applied in the following order:
- Resetter: clears document, deletes existing annotations.
- Tokeniser: tokenizes words, numbers, punctuation and spaces.
- Chunker: parses reports into sections, parts, sentences, and phrases.
- Spell-checker (excluded by default): identifies erroneous spelling and suggests frequency based correction.13
- RegEx: annotates a pre-defined set of attribute and value pairs such as tumor grade and stage.
- Vocabulary concept tagger: annotates fragment of free text to associated concept from a controlled terminology using MetaMap Transfer (MMTx).14
- Semantic-type filter: removes concepts associated with unwanted semantic types.
- NegEx: implements the NegEx negation detection algorithm to tag explicitly negated concepts.15
- Semantic-type categorization: extracts body parts, procedures, diseases and findings based on vocabulary semantic types.
- Physical model deducer: uses rudimentary nearest neighbor discourse level reasoning to arrange named entities into a decomposition and topologic hierarchy.
- Extractor: converts the hierarchy to valid Extensible Markup Language (XML) as defined by the CHIRPS XML schema definition.9
caTIES coding pipeline service components
The core language-processing functionality of the system is achieved using the open-source General Architecture for Text Engineering platform.16
Implementation details of the coding pipeline service are provided in .
For concept coding, caTIES uses MMTx pre-configured with the National Cancer Institute (NCI) Metathesaurus.17
Use of the NCI Metathesaurus is a condition of participation in the caBIG. However, users outside of caBIG may choose any other vocabulary or vocabulary subset that can be used with MMTx, by configuring MMTx differently prior to installation of caTIES.
caTIES coding services have been designed to run in parallel to take advantage of multiple processors available at an organization, greatly reducing the total time for coding massive document sets.
The caTIES indexing service
creates a text search engine index for fast access to documents based on the characteristics of the document text and conceptual codes. This index must fulfill the requirement of fast substring searching independent of an underlying RDBMS. CaTIES uses Lucene 2.3 for its information retrieval engine.18
The caTIES SPR index is streamlined for temporally constrained, patient level query by mapping the composite primary key of patient unique identifier and SPR collection date and time to the range of long numbers. This mechanism requires additional bookkeeping time and space in the accompanying RDBMS but it is otherwise transparent to the user.
In addition to the conceptual document index, caTIES maintains an ancestor index that associates NCI Thesaurus concepts with their ancestry. Here ancestry is defined to be all concepts in the transitive closure along the reverse isa-relationship of the NCI Thesaurus. The ancestor index provides ancestors both at SPR index time and later during client query formulation.
Information retrieval services
For information retrieval across organizations, caTIES uses a grid service architecture based on the Open Grid Service Architecture (OGSA).19
Grid services are stateful webservices that provide more functionality than the basic webservices they are built upon. The caTIES client communicates with three services to search for and retrieve documents. All caTIES services are implemented using the Globus Toolkit Webservices Resource Framework (GT4)—a reference implementation of the OGSA specification.
The caTIES MMTx service derives conceptual search criteria on the client side, based on a user query string. Users may modify concepts interactively.
The caTIES search service communicates the search criteria (including Boolean logic, temporal relationships, and concepts) to the server. On the server side this request is converted from SPIN query XML to Lucene query language. Hits from the search are organized into a response payload that consists of report unique identifiers and some report header information. Subsequent drill down into report specifics occurs on future server requests.
OGSA-data access and integration (DAI) service
The caTIES OGSA-DAI service provides a Web services conduit for basic Structured Query Language (SQL) Data Manipulation Language and Description Definition Language (DDL) interactions with a data source. OGSA-DAI is an extension to the core functionality of the Globus Toolkit, that provides access to a wide range of databases including MySQL, DB2, Oracle, Postgres, SQL Server, and XIndice, as well as indexed text files. Thus, caTIES may be implemented with any of these database management systems.
User interface, query and results visualization
The caTIES user interface (UI) is composed of four role-based perspectives: researcher, preliminary user, administrator, and honest broker. At login, the caTIES client loads the appropriate perspective for the user. The user can switch between perspectives if she is registered with more than one role in the system.
The caTIES client is a Java application deployed using Java WebStart. Open source libraries used in the construction of the client application include (1) JGraph library20
for displaying the Diagram query view, (2) GlazedLists library21
for displaying the results table and (3) JFreeChart library22
for constructing pie/bar charts for the results.
The researcher perspective supports query construction and execution, and order management for the distribution protocol. caTIES supports both query by text and query by concept. Users can constrain queries by demographic variables such as age and gender. Standard Boolean constructs including AND, OR and NOT can be used to combine all of the above constraints. Additionally, users can formulate temporal queries based on the timing of diagnostic reports. An example of a temporal query is: “Find all females who had Lobular Carcinoma in Situ, followed by mastectomy within 1 year” ().
User interface—diagram method for query construction.
Queries can be modeled using two views: dashboard and diagram. The dashboard view allows for simple text-box driven query construction. The diagram view permits more expressive nested Boolean query construction using a filter-flow metaphor (). Views are synchronized so that a query in the diagram view always matches the query in the dashboard view. However, since the diagram view is more expressive, not all queries modeled in the diagram view can be viewed in the dashboard view.
Results are visualized in tabular and tree format. In the tree format, they are hierarchically organized by owning organization, and then by patient. Selecting a report in this tree provides detailed document information and annotations (). The tabular view lists all reports by key criteria (eg, age, gender, concepts) and can be reorganized by sorting.
User interface—results visualization.
The preliminary user perspective is identical to the researcher perspective except that it returns only aggregate level data (histograms and pie charts). No record level data can be obtained. Preliminary users typically obtain access without IRB approval, to collect data preparatory to research.
The Administrator UI perspective is used by system administrators and honest brokers to accomplish administrative functions. It supports user account creation, registration of new IRB approved studies, registration of the institution as data provider or tissue provider to external IRB approved studies, and addition of researchers and honest brokers to studies from the administrator's local organization. In addition, it supports quality assurance of de-identification and concept coding. Reports flagged by users for potential errors in de-identification or coding may be reviewed by honest brokers using the Quality Assurance tab. Documents flagged for de-identification errors are quarantined and unavailable for subsequent use until the error is corrected or released.
Honest broker perspective
The honest broker UI perspective enables impartial individuals such as tissue bankers and cancer registrars to assist researchers in filling requests for tissue or further clinical data. On login, the honest broker perspective provides a queue of unfilled requests. Honest brokers can view data from the private (identified) database of their own institution only, in order to fill orders or provide further data in a de-identified manner.
Collaborative study management
caTIES uses a protocol-based model for collaborative research across a network of organizations. The paradigm is based on a fundamental assumption that exchange of de-identified data and/or tissue between any repository and any researcher requires two IRB protocols—(1) by the organization establishing a de-identified repository for providing data or tissue to one or more researchers, and (2) by the researcher for searching a de-identified repository established at one or more organizations. Differences among IRBs in regulation of data-sharing and materials transfer create the requirement for maximal local control over participation. Thus, organizations who host caTIES nodes may agree to provide data or tissue on external protocols on a study-by-study basis. In previous work, we have validated these assumptions in interviews with IRB and regulatory officials at six US cancer centers.23
The model of privacy, security, and collaboration needs for a research grid derived from this interview study differs dramatically from the open (‘airport’) model of collaboration that has been previously used.9
Access to caTIES must occur within the context of a valid (time-sensitive) approved IRB protocol. All users are bound to one or more IRB approved protocols at the time of user registration. When a protocol is registered by an administrator for a researcher seeking to obtain data or tissue, the administrator registers the home institution as a Data Consumer or Tissue Consumer respectively. The home institution becomes a Data Provider to this local IRB protocol automatically. And if the administrator registers the organization as a Tissue Consumer, then the home organization automatically becomes a Tissue Provider to this local IRB protocol.
Once the protocol is registered, other caTIES nodes may agree to participate on this study protocol. Honest brokers must determine whether a given protocol registered at an external organization meets the constraints of the repository IRB protocol for sharing data that has been approved at the providing organization. In previous work, we determined that many organizations may require only assurance that an external researcher has appropriate credentials and IRB protocol (which can be established at the time of provisioning), but that requirements for data sharing may be more stringent at some organizations.23
The approach we developed enables participation within the bounds of local regulatory requirements.
Within the constraints of this model, caTIES has many features that support collaborative research between organizations hosting caTIES nodes. For example, researchers from different institutions can be a part of the same study protocol, and thus they may create queries and order sets that can be viewed and edited by other researchers on the protocol who may reside at different institutions.
caTIES uses a series of security enforcement layers () to lock out unauthorized resource access. Security enforcement layers include:
Security architecture showing authentication, authorization and access layer.
Physical security of data is supported by the complete separation of de-identified and identified data (which reside on different machines in the typical configuration).
At the network layer, caTIES uses the security model of the Globus Toolkit and OGSA-DAI (). The Globus Toolkit uses Grid Security Infrastructure (GSI)24
for enabling secure authentication and communication over an open network. GSI provides a number of useful services for Grids, including mutual authentication and single sign on. GSI is based on public key encryption, X.509 certificates, and the Secure Sockets Layer communication protocol. Extensions to these standards have been added for single sign on and delegation. The Globus Toolkit's implementation of the GSI adheres to the Generic Security Service application programming interface,25
which is a standard interface for security systems promoted by the Internet Engineering Task Force.
All caTIES Grid Services are configured as secure grid services. CaTIES secure grid services authenticate and authorize based on a local Dorian installation. Dorian26
provides the caGRID Identity Provider and Identify Federation Service interfaces. Authentication is dependent on a valid Security Assertion Markup Language assertion, and generates a proxy certificate or grid identity. Entries for each authorized user are also stored in the grid maps at each caTIES node, providing an additional level of security.
At the application layer, caTIES maintains security mechanisms for restricting access based on the user's authorization attributes (). The caTIES CTRM client application dictates authorization using information in its CTRM datastore and embedded business logic. Users are granted restricted access based on their authorized resource set.
At the database layer, caTIES restricts access to data using RDBMS standard mechanisms (). RDBMS roles and their table access privileges mirror the high level authorization roles of caTIES: administrator, honest broker, researcher, and preliminary user.
Sharing of data between organizations requires agreements, policies and standard operating procedures among participants with regard to the adequacy of de-identification, provisioning of credentials, requirements for IRB review, and auditing of data and protocols. The caTIES project has developed a set of human processes and policies to support the functioning of a caTIES network, which are publicly available.27
The security policy was derived from an in depth interview-based study which used problem scenarios to elicit security and privacy requirements.23
Deployment and installation
The caTIES installer provides a common front end for installing and configuring all caTIES services and datastores. The caTIES website provides access to currently supported releases of caTIES, installation, administration and user manuals, demonstration videos, and other information to assist new users. The software is available on SourceForge28
which also hosts the caTIES user forums.29
caTIES is released under the caBIG open source license.