Annotation of an entire document
The simplest example of annotation is depicted in Figure , where an entity from the PRotein Ontology (PRO) [
2] has been linked. The predicate
ao:hasTopic has been used for this purpose. By design,
ao:hasTopic allows either classes and individuals as its range. This is to support the use of both "classes-only" ontologies such as PRO of the OBO Foundry, and/or SKOS-like terminologies whose terms are instances, as annotation. We are aware that this approach could lead to an OWL Full representation but we decided not to transform the predicate into an Annotation Property. In fact, we don’t want to preclude, for those who are interested, OWL DL reasoning and querying. Instead, we recommend our users, in case of reference to classes, to express the relationship as a restriction and therefore a valid DL statement.
The dashed ovals in the diagram represent instances while the remaining ovals represent classes. In this case annotation – an AlzSWAN claim
http://tinyurl.com/ykjn87p concerning the protein ‘Beta-Secretase 1’ – is linked to a whole document. In this particular case the annotation is of type
aot:Qualifier that states the document is related to the linked entity. Annotation types will be extensively discussed in a subsequent section. It is important to notice that it is trivial to apply the same annotation to multiple documents simply linking more documents to the annotation instance.
Annotation provenance
AO uses FOAF [
62] as the preferred way for representing agents and documents. However the integration between the two ontologies is performed through a module that is external to the core of the annotation ontology. This guarantees the independence of AO from FOAF allowing alternatives to users of AO. When FOAF is used, as in the rest of this paper, the AO annotation can be created by an agent (
foaf:Agent): a person (
foaf:Person), a software agent (
swan-agent:Software), a group (
foaf:Group) or an organization (
foaf:Organization) annotating a resource. Annotation provenance is represented using the PAV (Provenance Authoring and Versioning,
http://code.google.com/p/pav-ontology/) module of the SWAN ontology [
52,
63]. Resources that are the subject of an annotation are mainly documents (
foaf:Document) but they can be any online resource. Alternatively some provenance can be provided through Dublin Core or Dublin Core Terms (
http://purl.org/dc/elements/1.1/). We recommend using them not as an alternative but as add-on - PAV remains the first choice for AO due to its wide offer of features for curation that is particularly important for scientific communities. Curation will be discussed later in this paper.
Source document provenance
In Figure we see that the Document (
foaf:Document) is linked twice to the annotation - once directly through the Annotea relationship
aof:annotatesDocument and once through a
pav:SourceDocument class defined in the version 2.0 of the PAV ontology [
64]. FOAF defines – or better loosely defines - a
foaf:Document class as:
The Document class represents those things,
which are,
broadly conceived,
'documents'. In AO, user needs to point to the document that he is annotating and typically this happens by means of a URI. Unfortunately, often the document changes over time but the URI stays the same. Using a property such as
pav:accessedOn is useless if the URI is always the same as we will end up with a URI and multiple access dates with no way to determine the date is associated with each annotation. The solution we propose, which at first glance seems to introduce some redundancy, is to have a stable URI for the webpage and a different URI for each detected version of the document. To keep compatibility with FOAF, Figure would translate into the following RDF:
Through this approach it is now possible, for an annotation to point both to the URI of the webpage (foaf:Document) and to the URI of the specific version of the document (pav:SourceDocument).
Annotation core and selectors
In this section, we describe how the core and selectors of AO are used to attach annotations to various types of documents. In the context of online scientific communities, resources targeted by an annotation can be HTML pages, documents, images, videos, databases and fragment or sections of them. This variety of possible targets – not all of them with a definition of how to construct a fragment URI - and their mutability motivates the introduction of the class Selector.
A
Selector identifies a portion of a resource, and may work differently for different types of documents and content types. Selectors are meant to be stable URIs. It is also common to provide different selector models for the same resource type. For instance, for selecting a chunk of text in a XHTML document we can use mechanisms based on XPointer, or one based on an offset and range, or other more robust mechanisms, like a combination of them. In fact, for immutable content selectors of the type XPointer or offset and range might be easier to deal with. In general, though, it is well known that not all HTML pages are immutable and some sections of these pages may vary with time – news and advertisements are often embedded in the document - requiring more reliable and customized fragment identification mechanisms. In AO we use the Annotea property
ann:context to attach an XPointer to an annotation instance as shown below:
In AO, we also provide a aos:XPointerSelector that allows the use of XPointers and has additional benefits of a Selector, including the ability of defining the document provenance through an instance of pav:SourceDocument.
Figure provides a similar example to Figure , but instead of annotating the whole document, only a portion of it has been annotated. The fragment is detected through a type of aos:TextSelector that identifies the exact piece of text specifying its prefix and its postfix. Also, while the annotation in Figure was created manually by a foaf:Person, the annotation here depicted has been created by a text mining service. The annotation type is aot:ExactQualifier as the chunk of text is representing the protein beta-secretase 1.
Selector has several sub-classes that can be chosen based on the nature of the source document (images, videos, records in databases and text). Figure illustrates an example of annotating a portion of a text document - aos:PrefixPostfixSelector, a subclass of aos:TextSelector – a subclass of Selector. Such selector works through three properties: the aos:exact is equal to the exact string or sequence of characters being annotated. The aos:prefix and aos:postfix are defined as the sequence of characters preceding and following the match. The three properties are defined ignoring any HTML/XML markup and normalizing the white spaces. The concatenation of the values of aos:prefix, aos:exact and aos:postfix is used to identify the section of text.
The above selector allows us to identify a portion of the document text, and link it with a term from a formally defined vocabulary. This specific selector works particularly well when the document is not immutable. In fact, even if other sections of the document change, it is possible to still detect the context if the annotated content is still present.
The RDF generated is shown below:
Several selectors can be defined to cover different use cases.
Figure depicts an example annotation where a rectangular section of an MRI image depicting a brain tumor is selected. The selector identifies the portion of the image through specifying the coordinates of a rectangular box. In this particular case the reference to the specific version of the document through pav:SourceDocument has been omitted. This has been done to simplify the picture and it would normally be acceptable only if the image were immutable.
AO users have the flexibility of extending the selector class based on their particular use cases (instruction available on the wiki page). To allow the community around AO to grow coherently and enable interoperability, we recommend contributing the new selectors back to the AO project.
It is important to notice that multiple instances of the class Selector can be attached to each annotation item. This allows performing annotation of multiple targets located in the same document. The same mechanism allows the annotation of multiple targets located in different documents.
Annotation types
In Annotea, users can create additional sub-types of annotations by using sub-classes of the Annotation class. Through this mechanism, it is possible, for instance, to introduce ‘note’ where the purpose is not to attach a term but to attach an explanatory text to a portion of a document. The list of possible sub-types can be virtually unlimited and in Annotea the users could define types of annotations on the fly.
In AO we maintain the same mechanism, however we maintain a more conservative approach where a predefined set of annotation types are recommended. These are shown in Table . We also allow the implementation of types through a second mechanism called ‘composition’ which is based on multiple inheritance. We introduced it to foster reuse of the already available ontologies including SWAN [
52,
63]; it does not imply connecting classes but rather works by creating instances of both the
Annotation class and the one we reuse, for instance,
swan:ResearchStatement for a claim or hypothesis.
| Table 2AO Annotation types defined as sub-classes of the Annotea Annotation class. |
Aot:Qualifier is one of the key annotation types in AO and allows the definition of parallelism between AO and the Simple Knowledge Organization System (SKOS) model [
32]. A
aot:Qualifier (see Figure ) defines a generic connection between the annotated online resource – HTML pages, digital images, audio files, etc. - or resource fragments and the URI of a term in an ontology or terminology. This mimics the relationship
skos:relatedMatch. Qualifier SKOS-compatible subclasses that are:
aot:ExactQualifier (Figure ),
aot:CloseQualifier,
aot:NarrowQualifier and
aot:BroadQualifier; these correspond respectively to the SKOS properties:
skos:exactMatch,
skos:closeMatch,
skos:broadMatch and
skos:narrowMatch (see Table ).
| Table 3Additional annotation types can be used for creating SKOS like annotation. |
In Figure we depict the annotation of a portion of an MRI image representing a linear skull fracture, using aot:BroadQualifier, a subclass of aot:Qualifier.
The annotator declared the image to express a ‘linear skull fracture’ through a textual label. As a textual label has a limited classification value and as the annotator could not find this specific term in the available ontologies/terminologies, she/he declared the image fragment to be represented by the term ‘skull fracture’ coming from a specific ontology and identified by a URI. As the annotator considers the term ‘skull fracture’ to have a broader meaning than what the image really expresses, the qualifier is declared to be broader than the ideal one. The parallelism with SKOS allows exploring automatic ontology building and improving the analysis of clouds of tags and annotations. In general, it is possible to either relate ‘a skull fracture’ or ‘the skull fracture of patient X’ to the portion of the image. The choice is left to the users and their specific needs. Also, when referring to raster images the accuracy of the selection is not as good as it can be with, for instance, with vectorial images. That is why the property ao:hasTopic might result having a different precision in different contexts.
In the same example it would be trivial to add another Qualifier defining as context the very same Selector. For instance we could state that the portion of the images also exactly represents an instance of the entity ‘Hematoma’; in this case we can use an aot:ExactQualifier in a similar way to that depicted in Figure .
In summary, we state that the area identified by the selector has a more precise meaning than the term “Skull Fracture” and has the exact meaning of the term “Hematoma” where both of these terms are entities specified in some other ontology/vocabulary. For images and fragments of images, it is not as easy to see the advantage of applying the SKOS approach as it is for text. For instance, if we focus on the example presented in Figure , where a chunk of text “BACE1” has been classified exactly as “Beta Secretase 1” from the PRO ontology, we can easily add, reusing the same selector, that the same chunk of text is classified as a narrower concept than a Protein in BIRNLex – the Biomedical Informatics Research Network (BIRN) project lexicon – which is identified by the URI:
http://bioontology.org/projects/ontologies/birnlex#birnlex_23. As result, we could derive that the entity “Beta Secretase 1” in the PRotein Ontology (PRO) has a narrower meaning than the entity “Protein” in the BIRNLex vocabulary. As more annotations are attached to documents, we can infer cross-ontology relationships.
Integration with SWAN and other existing ontologies
As mentioned above, as in Annotea, AO allows the creation of additional sub-types of annotations by sub-classing the Annotation class. In addition, in AO it is possible to define annotation by composition or multiple inheritance. An example is the SWAN Ontology which can be integrated with AO by defining an instance of both
ao:Annotation and
swan:Claim as depicted in the following RDF.
This second mechanism works well for integrating existing entities of other ontologies. The SWAN Ontology is just an example; there are others that we are already considering integrating such as CiTO [
65] and BIBO [
66], [
65]which would be used for annotating citations.
Curation
Curation is a crucial aspect of scientific publication and therefore an important aspect for our annotation ontology. In order to enable the complete cycle of activities that we define here as a RECS (Run, Encode, Curate, Share) process, AO supports a curation process that includes both manual annotation and text mining services. Figure demonstrated an annotation generated by a text mining service, which is not yet evaluated and accepted by a human curator through what in AO is called ‘Curation Token’. In Figure we display the same example with the curation token – some details such as those of the selector are omitted to keep the picture fairly simple. In the case depicted in Figure , a foaf:Person accepts the Software generated annotation.
In general, every annotation can undergo a multi-step curation process that can involve one or more users generating one or more curation tokens each. In Figure , we show a typical example of semi-automatic annotation workflow that can be summarized as follows: annotation is created by a text mining service, and first a user expresses a judgment on the validity of the automatic annotation, for instance "rejected" (curation #1). Later on a second curator might want to discuss the reason(s) for rejection (curation #2 with status: discussed). And finally a decision is taken and the annotation is ether rejected or accepted (curation #3 with status: accepted). We assume that the curation process is a linear story where the timeline can be determined through the curation dates. Alternatively, it is possible to compile explicitly an ordered list of curation item using for instance the SWAN Ontology module for collections [
67].
Annotation sets
In our experience, it is often useful to be able to group annotations by a specific criterion. Examples of criteria can be: the collection of all the annotation items related to proteins, the collection of all the annotation items representing scientific discourse elements, the collection of all the annotation items that have been published by a scientific community as officially curated. Also sets can be used to collect all the results by a specific text mining service, in this case the criterion would be the sharing of the same provenance. For grouping annotation items we introduced the concept of annotation set. The ao:AnnotationSet is a container of annotations.
Versioning and evolution of the annotation
When observing carefully Figure it is possible to notice the presence of an object property pav:previousVersion and a datatype property pav:versionNumber. AO has been designed to be monotonic as much as possible. Therefore, once an annotation item is created in a set, it would be good practice to not be deleted by removal from the set. The term can be rejected through curation. Note that in Figure , the ao:AnnotationSet was created by a software agent and includes an automatically generated item.
Every new annotation item is created with a correspondent URI. Curation can be applied to the annotation. Edits of the annotation item, as well as new curation tokens, may (or may not) be defined as a new version of the annotation according to the requirements of the specific application. If a new version of the item is encoded, it will get a new URI and a pointer to the previous version of the same item. If multiple curation is performed a new item version will result having a longer curation chain as shown in Figure . An instance of an ao:AnnotationSet can be versioned every time the set of annotation items changes – a new item added - or even every time any item of the set changes. A new version of ao:AnnotationSet will result in a new item – with a new URI – pointing to the previous version of the same set.
It is also possible to derive one ao:AnnotationSet from another. This is common when a set that is publicly available is imported by an application and branched. In AO the second set will be connected to the first one through a relationship pav:derivedFrom. This will assure continuity to the annotation and the possibility to establish the correct attribution of the contributions. Branching one set into another is establishing the evolution of the annotation.
Another example of annotation evolution happens when an annotation, initially defined as a tag, is later on attributed with a semantic entity (Figure ). To keep the representation monotonic, we are going to define a second annotation with the ao:hasTopic property. The second annotation supersedes the first annotation that remains in the knowledge base.
Alignment with the SIOC ontology
In Figure , besides the annotation set and its provenance and versioning, it is also possible to detect the mappings of the annotation ontology to the SIOC ontology [
55,
68]. SWAN and SIOC have been objects of an alignment process in the context of the Scientific Discourse Task Force (
http://esw.w3.org/HCLSIG/SWANSIOC), one of the sub groups of the W3C Health Care and Life Sciences Working Group [
69]. As creators of the SWAN ontology we confirm our commitment to keeping the two efforts aligned. The classes
ao:Annotation and
ao:AnnotationSet are declared sub-classes of respectively
sioc:Item and
sioc:AnnotationSet.
Comparison to existing tagging ontologies
The scope of AO is much wider than tagging. However, tagging represents an important component for current online applications. We dedicated a fair amount of time trying to clarify the similarities/differences between AO and existing ontologies such as Newmann’s Tagging Ontology [
44] and MOAT [
56]
The MOAT (Meaning Of A Tag) project aims to solve the problem of ambiguous or redundant tags by providing a way for users to define meaning(s) of their tag(s) using URIs of Semantic Web resources. Even if AO was born to provide semantically defined data, we recognized the possibility of having a free text tag; we recommend this to be done natively leveraging the Annotea property ‘body’. Alternatively, it is possible to fully reuse the existing MOAT content in a way that increases its expressiveness and, at the same time, allows performing AO-style curation. The latter approach is depicted in Figure where the free text tag “Linear skull fracture” is modeled using the moat:Tag class and attached to the annotation through the relationship ao:hasTagging. We haven’t reused the relationship moat:taggedWith because, as depicted in the same picture, it is supposed to link the annotated resource to the instance of the moat:Tag class.
It is also possible to notice that a meaning of the tag has been expressed through an instance of the class
moat:Meaning. As the connected meaning is more general than the actual tag, the annotation instance refers to the meaning instance through the property
ao:hasNarrowerMeaningThan. As you can see, from the perspective of the annotation we can define something more precise than
moat:hasMeaning. AO allows us to define different levels of meaning through a set of properties that can be mapped directly into the Simple Knowledge Organization System (SKOS) model [
32]:
ao:hasRelatedMeaning,
ao:hasExactMeaning,
ao:hasCloseMeaning,
ao:hasNarrowerMeaningThan and
ao:hasBroaderMeaningThan. As was done for the Qualifier annotation types, these properties can mapped respectively to
skos:relatedMatch,
skos:exactMatch,
skos:closeMatch,
skos:narrowMatch and
skos:broadMatch.
The model in Figure can be translated or expressed – if the annotation does not derive from preexisting MOAT content - into pure AO annotation using qualifiers and the previously introduced SKOS-compatible subclasses: ExactQualifier, CloseQualifier, NarrowQualifier and BroadQualifier. This is the case of Figure where this example is expressed in pure AO.
Besides the possibility of reusing and improving the existing MOAT content in AO, it is also worth mentioning how the annotation type Qualifier, that in RDF can look like:
can be transposed in MOAT and Newmann’s Tagging Ontology as a tag:RestrictedTagging, which represents the meaning of a tag in a specific context:
Once again, the scope of AO is much bigger than MOAT, but we wanted to provide integration with MOAT to be able to reuse already existing content. It is also possible to apply MOAT tags to any other annotation type - for instance a Note – even if we generally recommend creating multiple annotations pointing to the same document or selector. Doing so, it will be possible to perform curation for each specific piece of information.