|Home | About | Journals | Submit | Contact Us | Français|
Proprietary approaches for representing annotations and image markup are serious barriers for researchers to share image data and knowledge. The Annotation and Image Markup (AIM) project is developing a standard based information model for image annotation and markup in health care and clinical trial environments. The complex hierarchical structures of AIM data model pose new challenges for managing such data in terms of performance and support of complex queries. In this paper, we present our work on managing AIM data through a native XML approach, and supporting complex image and annotation queries through native extension of XQuery language. Through integration with xService, AIM databases can now be conveniently shared through caGrid.
Image annotations, including observational or computational description of image features, are essential for medical interpretations. While DICOM is the standard for medical images, there is no standard for image annotation and markup. The Annotation and Image Markup Project (AIM)1 develops a standard information model for representing and exchanging of image annotation and markup. AIM standard provides not only syntactic but also semantic interpretability with caBIG,2 DICOM and HL7 in health care and clinical trial environments. The model itself is designed using UML, including dozens of classes such as patient, observer, equipment, image, anatomic entities, image observations and image observation characteristics, geometric shapes, text annotations, calculations, and so on. The model can be implemented in XML, where data elements are deeply hierarchical. For example, the codeMeaning/codeValue elements reside in elements at the sixth level in the XML hierarchy.
XML is fast becoming the standard information exchange language for web-based applications, and is ubiquitously used for data sharing and semantic interoperability in healthcare, life sciences and many other domains.3 As a result, commercial database vendors and research institutions are researching and developing the data persistence of XML documents by building XML databases either through extension of traditional databases or new database architectures. XML databases provide significant advantages as they support standard data definition languages based on XML standards such as XML Schema, and standard XML query languages such as XPath4 and XQuery.5 Furthermore, the XML-in and XML-out approach greatly simplifies the translation of data models and query languages. XML database technology is becoming mature, and XML database products are proliferating, such as Oracle Berkeley DB XML,6 Oracle XML DB,7 IBM DB2 pureXML,8 eXist,9 Tamino,10 etc.
One immediate question is how we can effectively manage and share AIM XML documents to support the objective of AIM standard. Our goals include: i) managing complex AIM documents; ii) providing complex query support such as metadata queries, spatial queries and semantic enabled queries; iii) efficient query support; and iv) standard based sharing of AIM documents. While it is possible to map XML based annotations as relational tables and manage them through a relational database management system, the process incurs much overhead for the translation of data models and queries, and expensive joins on multiple internal association tables for retrieving the data. Storing a whole XML document as a large object (LOB) in a relational database is not desirable either due to the lack of indexing structures in LOB to support efficient queries.
In this paper, we explore the latest XML data management technologies and develop a generic XML data management system that can effectively manage AIM XML documents in terms of requirements, query support and efficiency. The AIM data can then be shared through caGrid infrastructure through xService.
The paper is organized as follows. We will give a brief overview of AIM model in Section 2, then we discuss how we manage AIM data in XML databases in Section 3. Complex query support is demonstrated in Section 4, and performance study is presented in Section 5. Section 6 shows how we can share AIM data through xService we developed, followed by Related Work and Conclusion.
AIM is a caBIG In Vivo Imaging Workspace project developed by Northwestern University and Stanford University. The goal of AIM is to provide standardization for image annotation and markup, especially for clinical trials.
The AIM model describes the meaning of image pixels. Annotations become semantic information that can be used for queries and data mining, and markups can be used to outline regions of interest. The model defines a base class Annotation, and of which two classes can be instantiated: ImageAnnotation (for annotation made on images) and AnnotationOfAnnotation (for annotation derived from other annotations). Each Annotation is associated with a type attribute to define the type of annotation, such as “RECIST baseline target lesion”, “RECIST follow-up target lesion”, etc.
Annotation class has associations with User (the author), Equipment (the equipment used to create the AIM document), AnatomicEntity (the specific target in human body from which the images were generated), ImagingObservation (interpretation of an image or images, including visual features, morphologic or physiologic processes, and diseases), and Calculation (the quantitative result from mathematical or computational calculations).
ImageAnnotation class has associations with GeometricShape (with subclasses Point, MultiPoint, Polyline, Circle, and Ellipse), TextAnnotation (text displayed on images), and Segmentation (DICOM segmentation objects). ImageAnnotation class also has association with ImageReference, which could be a DICOMImageReference or a WebImageReference.
AnatomicEntity, ImagingObservation, and ImagingObservationCharacteristic classes represent essential features for an annotation, and the populated values come from certain controlled vocabulary, such as RadLex. Figure 1 shows an overview of the major classes of AIM model.
XML database systems are developed through extension of traditional relational databases or new database systems, where an XML document is the logic data object for manipulation. For example, major relational database engines such as Oracle and DB2 extend their systems with a new native data type XML or XMLType with its own backend storage and indexing support. There are many new XML database systems developed such as eXist and Tamino. Berkeley DB XML is an XML database system based on Berkeley DB engine. Currently, we support Berkeley DB XML, DB2, an eXist. XML databases conveniently support XML based data definition, and provides standard XML query languages such as XPath4 and XQuery.5 Next we discuss how we can take advantage of XML databases to manage AIM XML documents.
AIM data model is first designed as a UML model. Since one essential goal for AIM is to share and exchange standardized annotation and markup documents, the XML standard is a natural choice for the intermediate representation as XML is a well established standard for data interchange and is the standard data representation for web services. The AIM UML model is therefore mapped to an XML schema, and the AIM document data are represented as XML documents, which are constructed, transmitted, manipulated, and consumed by users and applications.
A sample AIM Schema (version 1) is shown in Figure 2.
We can classify XML databases as two major approaches: pure XML approach and hybrid relational/XML approach. For pure XML approach, only XML documents can be managed, and only XML based manipulations and APIs are available, such as Oracle Berkeley DB XML and eXist XML database; For the hybrid approach, both relational data and XML data can be managed, and the query could a combination of SQL and XML queries, such as IBM DB2 pureXML. In the pure XML approach, an AIM document is stored in an XML collection, and in the hybrid approach, an AIM document is stored in a column in XML data type. For example, for Berkeley DB XML, we can create an XML collection xmldb under folder /apps/databases/bdbxml as follows:
While in DB2, we create the following table to store AIM documents, where AIM documents are stored in the column xmlcolumn of table xmltable:
CREATE TABLE xmldb.xmltable ( docid VARCHAR(64) NOT NULL, xmlcolumn XML, metadata XML, PRIMARY KEY (docid) );
Occasionally there are additional metadata that need to be captured about an AIM document, such as the timestamp a document is inserted, the original filename for the AIM document, a collection the AIM document belongs to, etc. Such metadata are not part of the AIM model. We develop three different solutions for the three databases. For Berkeley DB XML, it provides its own proprietary metadata management through its XmlMetaData class. Each XML document can be associated with XMLMetaData items, and each XMLMetaData item has its uri, name, and value. For DB2 and eXist, since there is internal metadata support, we associate an XML document with a metadata XML document defined in following structure:
<metadadata> <meta> <uri>…</uri> <name>…</name> <value>…</value> </meta> … </metadata>
In eXist, a metadata XML document is managed as a regular XML document and we use its master document id as its prefix for the association. In DB2, a metadata XML column is defined for the XML table.
In the AIM model, there is a Segmentation class that represents DICOM segmentation objects such as binary, fractional probability, fractional occupancy and surface. The object will not be encoded into the AIM document due to its unstructured nature and large size. Instead, external objects will be linked in AIM documents.
We manage segmentation objects as files in the operating system, and provide mapping between the files and IDs in AIM documents that refer to the files through the metadata discussed above. The metadata are used to map a file’s name in the XML document to its physical storage path.
Due to the diversity of XML databases, to make it flexible for users to choose a database, we develop a set of unified interfaces for uploading, updating, deleting and querying XML data across different types of XML databases. Users can also quickly switch to a different database backend without change to their applications. The APIs include:
XML provides two standard based query languages XPath and XQuery. XQuery depends on XPath expressions to navigate data inside an XML tree, and uses the FLOWR (FOR, LET, ORDER, WHERE and RETURN) clauses for its major expressions. XQuery5 is Turing-complete13 and natively extensible through its user defined functions.
Next we summarize a list of common AIM query scenarios and demonstrate how to express them in XML queries.
While most queries can be specified with XPath, queries in the last categories have to be specified in XQuery, as they require joins which are only possible through XQuery. Next we provide several example queries and show how they can be supported in XPath/XQuery. Note that different XML databases have variations on the syntax of queries. For example, to specify an XML collection, Berkeley DB XML uses collection ("xmlcollection"), while DB2 uses db2–fn:xmlcolumn ("XMLDS.XMLTABLE.XMLCOLUMN"). Here we only show the queries expressed in standard syntax.
Common queries are using metadata as predicates in AIM to find related information, and they can normally be expressed as XPath based queries, as shown in the following examples.
declare namespace ns1="gme://caCORE.caCORE/3.2/edu.northwestern.radiology.AIM"; /ns1:ImageAnnotation [ns1: imageReferenceCollection/ns1:ImageReference/ ns1:study/ns1:Study/ns1:series/ns1:Series/@instanceUID="188.8.131.52.4.1.93184.108.40.206047"]/ ns1:imagingObservationCollection/ns1:ImagingObservation/ imagingObservationCharacteristicCollection/ImagingObservationCharacteristic
declare namespace ns1="gme://caCORE.caCORE/3.2/edu.northwestern.radiology.AIM"; /ns1:ImageAnnotation [ns1:imagingObservationCollection/ns1:ImagingObservation/ ns1:imagingObservationCharacteristicCollection/ns1:ImagingObservationCharacteristic [@codeValue="2568622" and ns1:rating/ns1:Rating/@value="1.0" and ns1:rating/ns1:Rating/@name="Soft Tissue"]]/ns1:imageReferenceCollection/ ns1:ImageReference/ns1:study/ns1:Study/@instanceUID
declare namespace ns1="gme://caCORE.caCORE/3.2/edu.northwestern.radiology.AIM"; /ns1:ImageAnnotation[ns1:imageReferenceCollection/ns1:ImageReference/ ns1:study/ns1:Study/ns1:series/ns1:Series/ns1:imageCollection/ns1:Image/ @sopInstanceUID="220.127.116.11.4.1.9318.104.22.168487"]/ns1:geometricShapeCollection/ ns1:GeometricShape/ns1:spatialCoordinateCollection/ns1:SpatialCoordinate
declare namespace ns1="gme://caCORE.caCORE/3.2/edu.northwestern.radiology.AIM"; doc("aim.xml")/ns1:ImageAnnotation[ns1:imageReferenceCollection/ns1:ImageReference/ ns1:study/ns1:Study/ns1:series/ns1:Series/@instanceUID="22.214.171.124.4.1.93126.96.36.19904"]
As markups are essentially spatial objects, it is important to support spatial queries. XQuery by itself does not support spatial operations. However, XQuery is Turing-complete and natively extensible. Thus many additional constructs needed for spatial queries can be defined in XQuery itself through user defined functions (UDF).
declare namespace ns1="gme://caCORE.caCORE/3.2/edu.northwestern.radiology.AIM"; for $a := collection () /ns1:ImageAnnotation[@uniqueIdentifier= "188.8.131.52.4.1.93184.108.40.206369-noduleID-13571"] /ns1:ImageAnnotation let $g := area( $a/ns1:geometricShapeCollection/ns1:GeometricShape/ ns1:spatialCoordinateCollection) return sum( $g )
declare namespace ns1="gme://caCORE.caCORE/3.2/edu.northwestern.radiology.AIM"; for $baseline := collection () /ns1:ImageAnnotation[@uniqueIdentifier= "1.2.288.3.2205383238.1072.1207947057.22"] /ns1:ImageAnnotation for $followup := collection () /ns1:ImageAnnotation[@uniqueIdentifier= "1.2.288.3.2205383238.1072.1207948422.99"] /ns1:ImageAnnotation let $areabaseline := longestDiameter($baseline/ ns1:geometricShapeCollection/ns1:GeometricShape/ns1:spatialCoordinateCollection) let $areafollowup := longestDiameter($followup/ ns1:geometricShapeCollection/ns1:GeometricShape/ns1:spatialCoordinateCollection) return (sum(areafollowup)- sum(areabaseline)) / sum(areabaseline)
To support spatial queries, we define a set of user-defined functions (UDFs) in XQuery. Such UDFs could be implemented system specifically, for example, DB2 uses its table UDF to define these functions. The functions include: i) spatial relationship functions, such as contains (), within (), adjacent (), overlaps (), etc.; ii) spatial property functions, such as area (), longestdiameter (), etc.
AIM model is semantically enabled in two aspects. First, AIM data model is caBIG silver compliant – AIM data model classes and attributes are annotated through concept IDs and other information and registered at caDSR;14 Second, AIM provides the framework where its content can be authored through references to ontologies or controlled vocabularies. For example, AIM defines ImageObservation, ImageObservationCharacteristic, and AnatomicEntity classes with the following attributes: i) codingSchemeDesignator – the ontology or controlled vocabulary where the data element depends on, such as RadLex, caDSR, or EVS; ii) codeMeaning – the meaning of the concept code; and iii) codeValue – the unique identification code for the concept.
The semantic enabled data model provides two opportunities to support semantic enabled queries: semantic interoperable queries across different data sources through caDSR, or semantic enabled queries through extending XML queries in AIM databases. We will consider the latter, and show how these are achieved through considering synonyms, hypernyms, and hyponyms.
declare namespace ns1="gme://caCORE.caCORE/3.2/edu.northwestern.radiology.AIM"; for $a in collection()/ImageAnnotation for $syn in getSynonyms("parenchyma of kidney") where $a/anatomicEntityCollection/AnatomicEntity/@codeMean/text() = $syn return $a
The implementation approaches of queries from ontology database or servers vary, as different ontology systems provide different access methods and APIs. RadLex provides RESTful services for queries. The user defined function is normally XML database dependent.
XML databases provide different approaches on access methods through value based indexing, edge based indexing, or internal association tables. We demonstrate that such approaches significantly boost the query performance. For example, in DB2, we can create indexes through generating keys using XPath patterns. For double-slash ”//” based queries, the performance deteriorates as more index nodes need to be searched, especially for Berkeley DB XML.
Additional optimization of query performance is possible through fast disks such as RAID or solid state disks. The latter provides much better performance on random reads. The latest partitioning support for XML data in DB2 can also distribute data across multiple CPUs and storage of parallel or cluster machines to boost the performance.
To test the performance of XML based data management, we take the AIM dataset converted from LIDC,15, 16 which includes 17, 927 documents, with a total size of 155MB. The machine we use for the test is a Lenovo W500 with Core 2 Duo at 2.8GHz, 4GB of RAM, SATA 7200rpm hard drive, installed with Windows XP Professional. We have DB2 Enterprise Edition V9.5 and Oracle Berkeley DB XML V2.4.16 installed. Note that eXist is not chosen for the study, as it takes a DOM based storage model, and frequently runs out of memory.
Figure 3 shows the result of several test cases. The result shows that the query support is very efficient, and DB2 performs better than Berkeley DB XML in most queries. For DB2, most queries take less than one second except the third query where two conditions are joined together. Batch uploading is very efficient even though the databases need to build the storage structure for the documents during the uploading process.
caGrid is a Grid middleware infrastructure to support collaborative biomedical research studies. With a service oriented architecture, caGrid allows researchers to share both their data and analytical resources as grid services, and provides federated queries across distributed databases. caGrid takes a model driven architecture, where data models are defined in UML, and an abstract data access and query layer for the data sources is provided. From the client side, users will be able to query caGrid data sources with Common/caGrid Query Language (CQL),17 an object-oriented query language for querying data defined through UML models. For relational databases, CQL queries are translated into SQL queries through hibernate mapping created using caCORE SDK, and executed on databases of the data sources. Hibernate creates the object relational mapping and the serialization and deserialization of the query results into objects, which are subsequently encoded as XML documents for transmission to the caGrid client.
However, the current caGrid infrastructure only supports data access of relational data stored in relational databases. Existing tools and architectures will not support querying or uploading of XML data managed in XML databases such as AIM databases, which provide XML Schema based schema definition language and XPath/XQuery based query languages.
To support caGrid-based data services that support existing XML schemas, existing XML data, and/or existing XML databases, we develop aGrid XML Data Service Framework (xService).18 xService is an extension of caGrid for querying and retrieving XML documents managed in XML databases. xService provides automated data model mapping from XML schema (defined in XML Schema language) to caGrid Domain Model (represented in XMI format), query mapping from CQL to XML query language (XPath), and generic XML database interfaces for uploading, updating, querying and retrieving of XML data from diverse XML data sources discussed in this paper (Figure 4). xService also provides an extension to Introduce Toolkit for users to flexibly and rapidly create their own caGrid data services based on a predefined XML Schema. Figure 5 shows the screenshot from Introduce where a user can select which database to use for the backend XML storage.
Through xService, caGrid clients will be able to query and upload AIM data through caGrid infrastructure. Operations provided by the AIM Data Service include: i) query, where a CQL query is taken and a CQL query result object is returned, such as count, attributes or AIM objects; ii) enumeration query, where an enumerable resource is returned for iterative next operation; iii) queryByTransfer, where query result is transported through caGrid transfer service. Transfer service uses HTTP protocol for efficient data transportation instead of SOAP protocol; iv) submit, where XML string is uploaded to the database; and v) submitWithTransfer, a caGrid transfer version of submit.
AIME19 is a collection of caGrid data services at Emory University for managing AIM documents used for multiple research projects, including TCGA Enterprise Use Case from caBIG In Vivo Imaging Workspace.
DICOM SR has been used to model and store image annotations and markups in DICOM,20 but DICOM SR does not provide the approach for querying the data, neither a semantic approach for representing annotations. Recently, a Semantic Web based approach has been proposed to manage ontologies and semantic metadata for medical image annotations.21
XML based approach for managing biomedical data becomes increasingly popular. One way is to provide XML based interfaces with relational based backend. For example, XNAT22 is an XML based platform for managing neuroimaging and related data, and represents data in XML at the schema level. XNAT uses relational database engine as the backend storage, and provides data and query mapping between XML and RDBMS. There is also work done to provide unique XQuery based frontend for relational based data sources.23 Project Mobius24 provides generic mapping of XML Schema elements to relational tables, although XML query support is limited.
Using XML databases is becoming popular in biomedical applications. For example, XML database has been used in25 to manage biological pathway datasets. SciPort26, 27, 28 provides an XML based approach for modeling, managing and integrating scientific experiments and data. XML is also used to integrate heterogeneous bio-molecular data.29
AIM is becoming the standard for representing image annotation and markup for translational research and clinical trials. In this paper, we discuss our work on using XML database based approach to manage AIM XML documents, metadata and attached segmentation objects. The XML approach shows significant benefit on supporting complex AIM queries, such as spatial queries and semantic enabled queries. Our performance study shows that the approach is efficient. Through xService, AIM data can be conveniently shared and queried through caGrid.
The project described was supported, in part, by Grant Number R24HL085343 from the National Heart, Lung, and Blood Institute. The project has been funded in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Contract No. HHSN261200800001E.