In the life sciences, the total number of items described by social tagging systems is currently tiny in comparison to the number of resources described by institutions. To illustrate, the MEDLINE bibliographic database contains over 16 million citations [19
] while, as of November 9, 2008, CiteULike, the largest of the academic social tagging services, contained references to only about 203,314 of these documents. Figure plots estimates of the numbers of new citations (with distinct PubMed identifiers) added to both PubMed and CiteULike per month over the past several years. The chart provides a visualization of both the current difference in scale and an indication of the rates of growth of both systems. It shows that both systems are indexing more items every month, that CiteULike appears to be growing slightly faster then PubMed, and that CiteULike is approaching 10,000 unique PubMed citations added per month with MEDLINE approaching 60,000.
Figure 2 The number of distinct new biomedical citations indexed per month by CiteULike and by MEDLINE. The figure illustrates the increasing rates of growth, per month, of new citations with PubMed identifiers to be indexed by MEDLINE (upper points in pink) and (more ...)
This data suggests that, despite the very large numbers of registered users of academically-focused social tagging services - on November 10, 2008, Connotea reported more than 60,000 (Ian Mulvaney, personal communication) - the actual volume of metadata generated by these systems remains quite low. While the sheer numbers of users of these systems renders it possible that this volume could increase dramatically, that possibility remains to be shown.
Density refers simply to the number of metadata terms associated with each resource described. Though providing no direct evidence of the quality of the metadata, it helps to form a descriptive picture of the contents of metadata repositories that can serve as a starting point for exploratory comparative analyses. To gain insight into the relative density of tags used to describe citations in academic social tagging services, we conducted a comparison of the number of distinct tags per PubMed citation for a set of 19,118 unique citations described by both Connotea and CiteULike. This set represents the complete intersection of 203,314 PubMed citations identified in the CiteULike data and 106,828 PubMed citations found in Connotea.
Table provides an assessment of the density of distinct tags used to describe these citations by individual users and by the aggregate of all users of the system. These numbers are contrasted with the average numbers of MeSH subject descriptors (both major and minor subject headings were included) used to index the same set of documents. Only the MeSH descriptors are reported (ignoring large amounts of additional subject-related metadata such as descriptor modifiers, supplementary concept records, and links to other databases such as NCBI Gene [20
Tag density in Connotea, CiteULike and MEDLINE on PubMed citations
In terms of tags per post, the users of CiteULike and Connotea were very similar. As Table indicates, the mean number of tags added per biomedical document by individual users was 3.02 for Connotea and 2.51 for CiteULike, with a median of 2 tags/document for both systems. These figures are consistent with tagging behaviour observed throughout both systems and with earlier findings on a smaller sample from CiteULike which indicated that users typically employ 1-3 tags per resource [21
]. On independent samples of 500,000 posts (tagging events) for both CiteULike and for Connotea, including posts on a wide variety of subjects, the medians for both systems were again 2 tags/document and the means were 2.39 tags/document for CiteULike and 3.36 for Connotea. The difference in means is driven, to some extent, by the fact that CiteULike allows users to post bookmarks to their collections without adding any tags while Connotea requires a minimum of one tag per post. Other factors that could influence observed differences are that the user populations for the two systems are not identical nor are the interfaces used to author the tags. In fact, given the many potential differences, the observed similarity in tagging behaviour across the two systems is striking.
As more individuals tag any given document, more distinct tags are assigned to it. After aggregating all of the tags added to each of the citations in the sample by all of the different users to tag each citation, the mean number of distinct tags/citation for Connotea was 4.15 and the mean number for CiteULike was 5.10. This difference is a reflection of the larger number of posts describing the citations under consideration by the CiteULike service. In total, 45,525 CiteULike tagging events produced tags for the citations under consideration while data from just 28,236 Connotea tagging events were considered.
Overall, the subject descriptors from MEDLINE exhibited a much higher density, at a mean of 11.58 and median of 11 descriptors per citation, than the social tagging systems as well as a lower coefficient of variation across citations. Figures , and plot the distribution of tag densities for Connotea, CiteULike, and MEDLINE respectively. From these figures we can see that even after aggregating the tags produced by all of the users, most of the citations in the social tagging systems are described with only a few distinct tags. Note that the first bar in the charts shows the fraction of citations with zero tags (none for Connotea).
Figure 3 The number of distinct tags assigned per PubMed citation by the aggregate of Connotea users. The figure provides a probability density histogram of the number of distinct tags per PubMed citation within Connotea. For each citation, the non-redundant set (more ...)
Figure 4 The number of distinct tags assigned per PubMed citation by the aggregate of CiteULike users. The figure provides a probability density histogram of the number of distinct tags per PubMed citation within CiteULike. For each citation, the non-redundant (more ...)
Figure 5 The number of MeSH subject descriptors assigned per PubMed citation by MEDLINE. The figure provides a probability density histogram of the number of MeSH subject descriptors assigned per PubMed citation. The peak (just under a density of 0.1) is at 10 (more ...)
One of the reasons for the low numbers of tags/citation, even in the aggregate sets, is that most citations are tagged by just one person, though a few are tagged by very many. To illustrate, Figures , , and plot the number of citations versus the number of users to post each citation in the Connotea-CiteULike-MEDLINE intersection. Figures and show the data from Connotea on both a linear (Figure ) and logarithmic scale (Figure ) and Figures and show the equivalent data from CiteULike. The plots clearly indicate exponential relationships between the number of resources and the number of times each resource is tagged that are consistent with previous studies of the structure of collaborative tagging systems [14
Figure 6 Relationship between number of PubMed citations and number of Connotea posts per citation. The X coordinates of each point on the plot correspond to the number of different people to post a particular citation. The Y coordinates are counts of the number (more ...)
Figure 7 Relationship between number of PubMed citations and number of Connotea posts per citation plotted on a log-log scale. The X coordinates of each point on the plot correspond to the Log of the number of different people to post a particular citation. The (more ...)
Figure 8 Relationship between number of PubMed citations and number of CiteULike posts per citation. The X coordinates of each point on the plot correspond to the number of different people to post a particular citation. The Y coordinates are counts of the number (more ...)
Figure 9 Relationship between number of PubMed citations and number of CiteULike posts per citation plotted on a log-log scale. The X coordinates of each point on the plot correspond to the Log of the number of different people to post a particular citation. The (more ...)
Current levels of tag density are indicative, but the rates of change provide more important insights regarding the potential of these young systems. Figures and plot the increase in distinct tags/citation as more Connotea (Figure ) and CiteULike (Figure ) users tag PubMed citations. These figures suggest that in order to reach the same density of distinct tags per resource as MeSH descriptors per resource produced by MEDLINE (median 11), roughly 5 to 7 social taggers would need to tag each citation. Since, at any given time it appears that the vast majority of citations will be described by just one person, as indicated in Figures , , and , the data suggests that the density of distinct socially generated tags used to describe academic documents in the life sciences will remain substantially lower than the density of institutionally created subject descriptors. This prediction is, of course, dependent on current parameters used for the implementations of academic social tagging systems. As interfaces for adding tags change, the density of tags per post as well as the level of agreement between the different taggers regarding tag assignments may change.
Figure 10 Increase in tag density per PubMed citation with increase in number of Connotea posts per citation. Each vertical box and whisker plot describes the distribution of the number of distinct Connotea tags associated with PubMed citations tagged by the number (more ...)
Figure 11 Increase in tag density per PubMed citation with increase in number of CiteULike posts per citation. Each vertical box and whisker plot describes the distribution of the number of distinct CiteULike tags associated with PubMed citations tagged by the (more ...)
Measures of inter-annotator agreement quantify the level of consensus regarding annotations created by multiple annotators. Where consensus is assumed to indicate correctness, it is used as measure of quality. The higher the agreement between multiple annotators, the higher the perceived confidence in the annotations.
In a social tagging scenario, agreement regarding the tags assigned to particular resources can serve as a rough estimate of the quality of those tags from the perspective of their likelihood to be useful to people other than their authors. When the same tag is used by multiple people to describe the same thing, it is more likely to directly pertain to the important characteristics of the item tagged (e.g
. 'VEGF' or 'solid organ transplantation') than to be of a personal or erroneous nature (e.g
. 'BIOLS_101', 'todo', or '**'). Rates of inter-annotator agreement can thus be used as an approximation of the quality of tag assignments from the community perspective. Note that, as [23
] discusses, there may be interesting, community-level uses for other kinds of tags, such as those bearing emotional content. For example, tags like 'cool' or 'important' may be useful in the formation of recommendation systems as implicit positive ratings of content. However, the focus of the present study is on the detection and assessment of tags from the perspective of subject-based indexing. Note also that the small numbers of tags per document in the systems under consideration here bring into question the relationship between consensus and quality.
To gauge levels of inter-annotator agreement, we calculate the average level of positive specific agreement (PSA) regarding tag assignments between different users [24
]. PSA is a measure of the degree of overlap between two sets - for example, the sets of tags used to describe the same document by two different people. It ranges from 0, indicating no overlap, to 1, indicating complete overlap. (See the Methods section for a complete description.) For this study, we measured PSA for tag assignments at five different levels of granularity: string, standardized string, UMLS concept, UMLS semantic type, and UMLS semantic group. At the first level, PSA is a measurement of the average likelihood that two people will tag a document with exactly the same string of characters. At the next level, we measure the likelihood that two people will tag the same resource with strings of characters that, after syntactic standardization (described in the Methods section), are again exactly the same. Moving up to the level of concepts, we assess the chances that pairs of people will use tags that a) can be mapped automatically to concept definitions in the UMLS and b) map to the same concepts. (Note that not all of the tags in the sample were successfully mapped to UMLS concepts; only tagging events where at least one UMLS concept was identified were considered for the concept, type, and group level comparisons.) At the level of semantic types, we are measuring the degree to which pairs of taggers are using the same basic kinds of concepts where these kinds are each one of the 135 semantic types that compose the nodes of the UMLS semantic network [25
]. At the uppermost level, we again measure the agreement regarding the kinds of tags used, but here, these kinds are drawn from just 15 top-level semantic groups designed to provide a coarse-grained division of all of the concepts in the UMLS [27
]. Table provides examples from each of these levels.
Examples of different levels of granularity
The reason for including multiple levels of granularity in the measures of agreement is to provide a thorough comparison of the meanings of the tags. Since the tags are created dynamically by users entering simple strings of text, we expect large amounts of variation in the representations of the same concepts due to the presence of synonyms, spelling errors, differences in punctuation, differences in plural versus singular forms, etc. The mapping to UMLS concepts should help to reduce the possibility of such non-semantic variations masking real conceptual agreements. Furthermore, by including analyses at the levels of semantic types and semantic groups, we can detect potential conceptual similarities that exact concept matching would not reveal. (While the present study is focused on measures of agreement, in future work this data could be used to pose questions regarding the semantic content of different collections of tags - for example, it would be possible to see if a particular semantic group like 'concepts and ideas' was over-represented in one group versus another.)
Table captures the average levels of PSA observed for CiteULike and Connotea users on taggings of PubMed citations. It shows that average PSA among CiteULike taggers ranged from a minimum of 0.11 at the level of the String to a maximum of 0.52 at the level of the Semantic Group with Connotea users following a very similar trajectory. Table also again illustrates the low numbers of tags per post in the social tagging data and the even lower number of UMLS Concepts that could be confidently associated with the tags. The majority of the posts from both social tagging services contained no tags that could be linked to UMLS concepts. For those posts for which at least one Concept was identified, means of just 1.39 UMLS Concepts per post were identified in CiteULike and 1.86 in Connotea.
Positive Specific Agreement among pairs of social taggers on PubMed citations
One interpretation of the low levels of agreement is that some users are providing incorrect descriptions of the citations. Another interpretation is that there are many concepts that could be used to correctly describe each citation and that different users identified different, yet equally valid, concepts. Given the complex nature of scientific documents and the low number of concepts identified per post, the second interpretation is tempting. Perhaps the different social taggers provide different, but generally valid views on the concepts of importance for the description of these documents. If that is the case, then, for items tagged by many different people, the aggregation of the many different views would provide a conceptually multi-faceted, generally correct description of each tagged item. Furthermore, in cases where conceptual overlap does occur, strength is added to the assertion of the correctness of the overlapping concepts.
To test both of these assumptions, some way of measuring 'correctness' regarding tag assignments is required. In the next several sections, we offer comparisons between socially generated tags and the MeSH subject descriptors used to describe the same documents. Where MeSH annotation is considered to be correct, the provided levels of agreement can be taken as estimates of tag quality; however, as will be shown in the anecdote that concludes the results section and addressed further in the Discussion section, MeSH indexing is not and could not be exhaustive in identifying relevant concepts nor perfect in assigning descriptors within the limits of its controlled vocabulary. There are likely many tags that are relevant to the subject matter of the documents they are linked to yet do not appear in the MeSH indexing; agreement with MeSH indexing can not be taken as an absolute measure of quality - it is merely one of many potential indicators.
Agreement with MeSH indexing
As both another approach to quality assessment and a means to precisely gauge the relationship between socially generated and professionally generated metadata in this context, we compared the tags added to PubMed citations to the MeSH descriptors added to the same documents. For these comparisons, we again used PSA, but in addition, we report the precision and the recall of the tags generated by the social tagging services with respect to the MeSH descriptors. (For readers familiar with machine learning or information retrieval studies, in cases such as this where one set is considered to contain true positives while the other is considered to contain predicted positives, PSA is equivalent to the F measure - the harmonic mean of precision and recall.)
For each of the PubMed citations in both CiteULike and Connotea, we assessed a) the PSA, b) the precision, and c) the recall for tag assignments in comparison to MeSH terms at the same five semantic levels used for measuring inter-annotator agreement. For each PubMed citation investigated, we compared the aggregate of all the distinct tags added by users of the social tagging service in question to describe that citation with its MeSH descriptors. Table provides the results for both systems at each level. It shows how the degree of agreement with MeSH indexing increases as the semantic granularity at which the comparisons are made widens. As should be expected based on the much lower numbers of UMLS Concepts associated with the social tagging events, the recall is much lower than precision at each level.
Average agreement between social tagging aggregates and MeSH indexing.
Focusing specifically on precision, we see that approximately 80% of the concepts that could be identified in both social tagging data sets fell into UMLS Semantic Groups represented by UMLS Concepts linked to the MeSH descriptors for the same resources. At the level of the Semantic Types, 59% and 56% of the kinds of concepts identified in the Connotea and CiteULike tags respectively, were found in the MeSH annotations. Finally, at the level of UMLS Concepts, just 30% and 20% of the concepts identified in the Connotea and CiteULike tags matched Concepts from the MeSH annotations.
Improving agreement with MeSH through voting
The data in Table represents the conceptual relationships between MeSH indexing and the complete, unfiltered collection of tagging events in CiteULike and Connotea. In certain applications, it may be beneficial to identify tag assignments likely to bear a greater similarity to a standard like this - for example, to filter out spam or to rank search result lists. One method for generating such information in situations where many different opinions are present is voting. Assuming that there is a greater tendency for tag assignments to agree with the standard than to disagree - where multiple tag assignments for a particular document are present - then the more times a tag is used to describe a particular document the more likely that tag is to match the standard.
To test this assumption in this context, we investigated the effect of voting on the precision of the concepts linked to tags in the CiteULike system with respect to MeSH indexing. (Once again Connotea was very similar to CiteULike.) Figure illustrates the improvements in precision gained with the requirement of a minimum of 1 through 5 'votes' for each Concept, Semantic Type, or Semantic Group assignment. As the minimum number of required votes increases from 1 to 4, precision increases in each category. At a minimum of 5 votes, the precision of semantic types and semantic groups continues to increase, but the precision of individual concepts drops slightly from 0.335 to 0.332. We did not measure beyond five votes because, as the minimum number of required votes per tag increases, the number of documents with any tags drops precipitously. For documents with no tags, no measurements of agreement can be made. Figure illustrates the decrease in citation coverage associated with increasing minimum numbers of votes per tag assignment. Requiring just two votes per tag eliminates nearly 80% of the citations in the CiteULike collection. By 5 votes, only 1.7% of the citations in the dataset can be considered. This reiterates the phenomenon illustrated in Figures , , and - at present, most PubMed citations within academic social tagging systems are only tagged by one or a few people.
Figure 12 Precision increase and coverage decrease with voting in CiteULike. The X axis indicates the minimum number of times a given UMLS Concept (in green), Semantic Type (in pink), or Semantic Group (in dark blue), would need to be associated with a PubMed citation (more ...)
Figure 13 Precision increase and coverage decrease with voting in CiteULike. The X axis indicates the minimum number of times a given UMLS Concept would need to be associated with a PubMed citation (through the assignment of a tag by a CiteULike user that could (more ...)
An anecdotal example where many tags are present
Though the bulk of the socially generated metadata investigated above is sparse - with most items receiving just a few tags from a few people - it is illuminating to investigate the properties of this kind of metadata when larger amounts are available both because it makes it easier to visualize the complex nature of the data and because it suggests potential future applications. Aside from enabling voting processes that may increase confidence in certain tag assignments, increasing numbers of tags also provide additional views on documents that may be used in many other ways. Here, we show a demonstrative, though anecdotal example where several different users tagged a particular document and use it to show some important aspects of socially generated metadata - particularly in contrast to other forms of indexing.
Figure illustrates the tags generated by users of Connotea and CiteULike to describe an article that appeared in Science
in June of 2008 [28
]. In the figure, the different tags are sized based on their frequency and divided into three differently coloured classes: 'personal', 'non-MeSH', and 'MeSH Overlap'. The MeSH descriptors for the document are also provided. The figure shows a number of important characteristics of social tagging given current implementations. There are personal tags like 'kristina' and 'bob' but the majority of the tags are topical - like 'neuro-computation'. There are spelling errors and simple phrasing differences in the tags; for example, 'astroctyes, 'astrocytes', 'Astrocytes', and 'astrocyte' are all present (highlighting some of the difficulties in mapping tag strings to concepts). The more frequently used tags ('astrocytes', 'vision', 'methods') are all of some relevance to the article (entitled "Tuned responses of astrocytes and their influence on hemodynamic signals in the visual cortex"). There is some overlap with MeSH indexing but many of the tags - such as 'receptive-field', 'V1', and 'neurovascular-coupling' - that do not match directly with MeSH descriptors also appear to be relevant to the article.
Figure 14 Tags for a popular PubMed citation from Connote and CiteULike. The tag cloud or "Wordle" at the top of the figure shows the tags from both CiteULike and Connotea for the Science article "Tuned responses of astrocytes and their influence on hemodynamic (more ...)
In some cases, the tags added by the users of the social tagging systems are more precise than the terms used by the MeSH indexers. For example, the main experimental method used in the article was two-photon microscopy - a tag used by two different social taggers (with the strings 'two-photon' and 'twophoton'). The MeSH term used to describe the method in the manuscript is 'Microscopy, Confocal'.
Within the MeSH hierarchy, two-photon microscopy is most precisely described by the MeSH heading 'Microscopy, Fluorescence, Multiphoton' which is narrower than 'Microscopy, Fluorescence' and not directly linked to 'Microscopy, Confocal'; hence it appears that the social taggers exposed a minor error in the MeSH annotation. In other cases, the social taggers chose more general categories - for example, 'hemodynamics' in place of the more specific 'blood volume'.
The tags in Figure show two important aspects of socially generated metadata: diversity and emergent consensus formation. As increasing numbers of tags are generated for a particular item, some tags are used repeatedly and these tend to be topically relevant; for this article, we see 'astrocytes' and 'vision' emerging as dominant descriptors. In addition to this emergent consensus formation (which might be encouraged through interface design choices) other tags representing diverse user backgrounds and objectives also arise such as 'hemodynamic'. 'neuroplasticity', 'two-photon', and 'WOW'. In considering applications of such metadata, both phenomenon have important consequences. Precision of search might be enhanced by focusing query algorithms on high-consensus tag assignments or by enabling Boolean combinations of many different tags. Recall may be increased by incorporating the tags with lower levels of consensus.
While we assert that this anecdote is demonstrative, a sample of one is obviously not authoritative. It is offered simply to expose common traits observed in the data where many tags have been posted for a particular resource.