Why a data publishing framework?
The foregoing discussion emphasises the need for a 'Data Publishing Framework' to evolve metrics and indicators that would provide due incentives to multiple actors involved in the data creation/collection to its publication. Both data originators and information systems/networks have emphasised a need for data usage metrics and indicators to ensure that the overall utility and impact of their data management and publishing activities can be objectively documented, leading to their recognition as a scientific activity on a par with the recognition one receives for the actual scholarly publications.
Such metrics should capture the quantitative and qualitative impacts of data management and publishing efforts. The collection, analysis and interpretation of impact metrics and indicators should form an integral part of the data management and publishing cycle.
Without a system of recognition and reward to the collectors, managers, and publishersof primary biodiversity data, we shall continue to rely on the good will or spare time (sic) of researchers to mobilise data into the public domain. Furthermore, measures of scientists' productivity would benefit through data publishing, which requires a cultural change in the recognition of scientific output [55
]. Such an incentive mechanism would achieve increased data mobilisation and increased accreditation, both desirable to scientists.
The Data Publishing Framework is not only essential for increased and expedited discovery and mobilisation of primary biodiversity data. In long run it could narrow the gap of uneven distribution of biodiversity data worldwide [54
] as data originator(s) irrespective of developed, developing and under-developed part of the globe would be equally encouraged to publish biodiversity data.
The data publishing framework: components
Five elements that constitutes 'Data Publishing Framework includes (A) socio-cultural, (B) technical-infrastructural, (C) policy-political, and (D) legal environments and (E) economic investments for supporting various activities of data publishing cycle (Figure. ). Three major technical-infrastructural components, without which meaningful implementation shall remain incomplete are: (i) Persistent Identifiers to data publishers, datasets, the data record itself, as well as to data versioning, and data citation; (ii) a Data Usage Index (DUI) at every access point; and (iii) an effective Data Citation mechanism (Figure ).
a - Five inter-dependent and complementary elements of the 'Data Publishing Framework'. b - Three core technical-infrastructural components of the 'Data Publishing Framework'.
These elements, as well core technical-infrastructural components are not only complementary, but they are also inter-dependent on each other. For instance, efficient implementation of a 'Data Usage Index (DUI)' requires that data publishers, datasets, data records, including their versioning, be assigned and resolved through 'Persistent Identifiers. Similarly, a Data Citation mechanism requires that each instance of data use and its citation be assigned and resolved through Persistent Identifiers. Thus, these three components must be treated as integral and inseparable aspects of the 'Data Publishing Framework'. No clear sequence of components exists, thus they need to be implemented concurrently. A certain degree of flexibility can be employed with regards to their sequence of implementation.
In the subsequent sections we discuss the possible choice and implementation approach of three technical-infrastructural components of the 'Data Publishing Framework' in the context of discovery and mobilisation of primary biodiversity data. As a hypothetical scenario we have considered the implementation of this framework for data discovered and mobilised through the GBIF as the preferred mechanism. This is simply because the existing GBIF network provides a complex, dynamic, yet functional platform of distributed and decentralised data discovery and mobilisation. However, it would not be restricted to GBIF alone. Such a framework could be implemented within any other domain-specific network irrespective of its size and magnitude of operations. Therefore, implementation of such a framework as described in subsequent sections should not be construed as 'GBIF-o-centric'. In fact, it could be implemented for any free and open, inclusive, community driven, primary scientific data discovery and access infrastructure of local to global scale.
Persistent or unique global identifier is a short name or character string guaranteed to be unique [52
]. It permanently identifies a data set independent of location. The persistent identification of digital resources can play a vital role in enabling their accessibility and re-usability over time [56
]. Thus, they form the first and the foremost essential component of the proposed 'Data Publishing Framework'.
Several kinds of persistent or unique global identifiers such as Handles, Digital Object Identifiers (DoIs), Archival Resource Keys (ARKs), Persistent Uniform Resource Locators (PURLs), Uniform Resource Names (URNs), and Life Science Identifiers (LSIDs), etc., are in use [57
]. There is a lack of agreement on which is optimal. Furthermore, progress in defining the nature and functional requirements for identifier systems is hindered by a lack of agreement on what identifiers should actually do. Commitment to deploy and reuse globally unique shared identifiers and the implementation of services that link those identifiers is the key to rich integration of distributed datasets [58
]. For instance, LSIDs [59
] were developed to provide globally unique identifiers for objects in biological databases [60
]. LSIDs are the recommended persistent identifiers by the Biodiversity Information Standards (TDWG) [61
]. However, uptake to date of LSIDs have been limited with only Universal Biological Indexer and Organiser (uBio) [62
], Catalogue of Life [63
], the International Plant Names Index (IPNI) [64
], and Index Fungorum [65
] implementing it.
For the biodiversity informatics community the attractions of LSIDs include the distributed nature of the identifier, the low cost, and the convention that resolving a LSID returns metadata in Resource Description Framework (RDF) model [66
]. The latter facilitates integrating information from multiple sources using tools being developed for the Semantic Web [67
], although the mechanism for resolving LSIDs is not supported by existing Semantic Web tools. By using the existing DNS infrastructure, LSIDs avoid the need to set up a new central naming authority [67
In the context of the proposed 'Data Publishing Framework', unique global identifiers should be assigned not only to datasets, but also to its publishers, every individual datum and its author(s), data versioning, and data citation. Further, simplified mechanisms are needed that make it easy for individuals to assign these identifiers to their data [55
]. Given the options available, the choice of choosing the suitable unique global identifier should reside with data publishers.
Institutions must recognise that the application and maintenance of unique global identifiers form just one part of an overall digital preservation and publishing strategy. Without adequate institutional commitment and clearly defined roles and responsibilities, unique global identifiers cannot offer any guarantee of persistence, locability, or actionability in the long or short term [58
Why a Data Usage Index (DUI)?
The DUI is intended to demonstrate to data publishers that their biodiversity efforts creating primary datasets do have impact
by being accessed and viewed or downloaded by fellow scientists. Dataset providers and publishers, such as the individual scientists struggling to generate and structure single or sequences of records into high-quality primary biodiversity datasets and their host institutions require incentives to continue their efforts and recognition of their usage. In a scientific digital library and open access environment, such as that developed for bibliographic information in Astronomy, usage is measured in a two-dimensional way. The straightforward way is to apply common bibliometric indicators with respect to citation patterns and impact. However, this track is not yet feasible in the case of biodiversity datasets. There exist no standards for dataset citations in scientific papers and quantitative analyses of citations to biodiversity datasets will provide unreliable results. A second avenue is to define usage metrics, based on requests, viewing and downloading of research publications in the form of metadata, abstracts or full text via the digital library client logs [68
Thus, the proposed DUI for biodiversity datasets is initially intended to apply the second avenue based on GBIF log data, as pointed out above. Because neither a standard data citation or persistent identifier mechanism exists for biodiversity datasets the isolation of actual references to datasets in the scientific literature is extremely difficult at present. Hence, traditional citation analyses are not yet feasible. However, a Data Usage Index consisting of a range of usage indicators extracted from the GBIF and other biodiversity dataset portal usage logs is definitively within reach. The proposed DUI is thus intended to make the (GBIF) dataset usage visible, providing deserved recognition of their creators and to encourage the biodiversity dataset publishers, providers and users to:
• Increase the volume of high quality data discovery, mobilisation and dataset digitisation;
• Further use biodiversity data and information in scientific work;
• Improve formal citation behaviour regarding datasets in research; and
• Develop standardisation of dataset information
The implementation of the proposed DUI is intended to be carried out according to a number of phases, as outlined below, starting with the data extraction from the GBIF main Web portal usage, web services and data dump logs covering 2008.
What is the DUI?
The proposed DUI consists of usage indicators concerned with
• Unique Visits;
• Loyal Visits (repeated visits by same IP address);
• Viewing of dataset records
• Download of datasets & dataset records
• Volume and (rank) distributions of datasets & dataset records
Since the biodiversity datasets are stored, located and used via the Web a combination of common bibliometric/scientometric issues [69
] and webometric analyses [70
] can be addressed. In terms of the former issues rank distributions
can be performed on produced datasets and dataset records over providers (institutions, regions, countries) or themes (species, taxa, geo-locations of habitat, etc.). Such distributions are similar to scientific publication analyses, and deal primarily with the volume of the datasets or number of dataset records generated over a specific time period. Clearly, time series of such distributions are feasible and may uncover patterns of dataset generation behaviour.
Most dataset usage indicators are associated with scientometric and webometric analyses, except that so far linking behaviour has not developed with respect to biodiversity dataset use. On the other hand, similar to the Web, dynamics like versioning and additions to already stored datasets are feasible. Usage indicators commonly measure interests, recognition or impact of the objects analysed, via visits, viewing and downloading activities (Nielsen Media, Ratings). By visiting (searching or retrieval) and viewing dataset records one may assume interest
in the dataset, whilst the volume of downloading volume may demonstrate usage
. Logging and analysing these activities are common to Web search engine log analyses and the issues of isolation of search sessions done by the same 'user' or 'visitor' [72
] (. In the DUI case we initially deal with 'visits' defined by IP address and search activity patterns over specific time windows - not individual visitors.
The DUI indicators may be in the form of absolute
measures, metrics normalised
according to stored volume of records, relative
to something, e.g. the world average of dataset usage measured by average download volume of dataset records across all datasets or selected thematic datasets, or weighted
according to specific dataset profiles of institutions or countries to be compared. The latter (weighted) indicators lead to dataset Usage Crown Indicators, in line with Crown Indicators for scientific publications and citations [73
]. Table demonstrates a number of basic absolute measures and a few selected normalised ones. At a later phase, after a Citation Mechanism and a Persistent Identifier have been designed and implemented, common scientific publication citation analysis and impact metrics can be devised to complement the usage indicator as part of a Universal DUI (Figure ). A range of comprehensive usage indicators is currently under definition and development by authors and will be discussed in forthcoming articles.
(Non) - normalised Data Usage Index (DUI) indicators
Figure 2 Phased implementation of the 'Data Usage Index' from global to local data discovery and access points. Abbreviations Used: GUDI: Global Data Usage Index, RDUI: Regional Data Usage Index, TDUI: Thematic Data Usage Index, NDUI: National Data Usage Index, (more ...)
The DUI: implementation
Currently available 'primary biodiversity data' has multiple access points. Considering the trend of data publishing activities and the involvement of multiple actors in this arena, it is safe to state that even the upcoming datasets would have multiple access points. This makes implementation of the DUI complex yet challenging. As depicted in Figure , data not only flow from contributors to local, regional, thematic, national and global access points, but it also flows equally in a reverse and lateral directions. For instance, data published through the GBIF global data portal is often also accessible through thematic or regional access points, which are contributed by 'data publishers' other than the one who operates/maintains the access point.
We propose a three-phase implementation of the DUI (Figure ). In the first phase a Global Data Usage Index (GDUI) is published by computing 'data usage logs' at global access points such as the 'GBIF Global Data Portal' at http://data.gbif.org
and its mirrors. In addition to GUDI, a second phase would compute Regional DUIs (RDUIs), Thematic DUIs (TDUIs) and National DUIs (NDUIs) using data usage logs of national, regional and/or thematic access points. The third phase would include computation of DUIs at all levels (GDUI, RDUI, TDUI and NDUI) using data usage logs of all access points. Normalisation of all DUIs (GDUI, RDUI, TDUI, and NDUI) and Local DUIs (LDUIs) would result into a Universal DUI (UDUI), which would be used as a normalised index to compute the 'Data Usage Index' of each participating publisher.
We propose that the DUI be computed on an annual basis beginning with GDUI during the first year, followed by RDUIs, TDUIs, and NDUIs during the second year and the inclusion of LDUIs during the third year leading to a Universal DUI. The implementation of such a multi-level DUIs can be a complex operation. Obvious questions that arise are whether such an exercise happens in a centralised or decentralised manner. We suggest that web services or RESTful services [74
] be implemented to harvest the 'data usage logs' of participating publishers by a coordinating agency. Coordination of such an exercise by a coordinating agency would provide much needed neutrality, credibility and acceptability of the DUI by all involved in the data management and publishing life cycle ranging from donors and, publishers to users. At the same time the Citation mechanism and the persistent identifiers for biodiversity datasets are planned to be devised and launched. This implementation should assure the start of common dataset citation impact analyses in a global manner.
The DUI: improving relevance
Several factors would influence the relevance and acceptance of the proposed DUI. Four major factors that would determine robustness and relevance of the DUI are (i) the implementation phases of the DUI; (ii) temporal richness of data usage logs; (iii) indicator robustness, and (iv) improved data management and publishing cycle. Both relevance and robustness of the DUI would be directly proportional to the implementation of the DUI, temporal richness of data usage logs, indicator robustness, data citation practices and improved DataLife Cycle management (Figure ).
Essential mechanisms to improve 'relevance' of the 'Data Usage Index'. Abbreviations Used: GDUL: Global Data Usage Logs, TDUL: Thematic Data Usage Logs, NDUL: National Data Usage Logs, LDUL: Local Data Usage Logs, DUI: Data Usage Index.
For instance, as management and access to data improves, data usage would increase both in its diversity and numbers. This would result in more hits to multiple access points of the same data, which in turn would result in an increased number of downloads, citations both in scholarly publications and e-publishing. Similarly, as one implementation phase advances to the next, the number of publishers participating in the DUI exercise would increase. This means that the normalised index would become more and more stable, credible and representative. The same assumption is applicable to a temporal increase spanning multiple years of data usage logs.
Data citation mechanism
Without an effective data Citation Mechanism the implementation of the 'Data Publishing Framework' would remain incomplete. Thus, universal standards for citing datasets are essential. As mentioned above, currently we lack consistency in data citations, which is sure to provide much needed high visibility to data. It is difficult or impossible given the existing citation metrics system to identify who originally created or added value to a datum [55
For data to be citeable it is necessary that they can be referred to in a consistent way [51
]. Thus, a data citation standard/mechanism should retain the advantages of print citations, be distinguishable from them, add other components made possible by (and needed due to), the digital form and systematic nature of data sets, and should be consistent with most approaches. Further, citation formats need to ensure clear credit/acknowledgement to the originator(s) and linking ability to data sets.
Mechanisms or standards for 'deep citation' or references to subsets of data sets are essential to appropriately acknowledge the creator/collector of data records, which constitute the data set(s) used. Data citation standards need to be flexible enough to accommodate deep citations, versioning, as well as any amount of additional information of interest to archivers, producers, distributors, publishers, or others without losing functionality [52
]. An issue around citations of versions of same data sets is critical and needs to be resolved in such a way that links between prior and new versions are functional and consistent. Such a citation scheme should enable forward referencing from the data set to subsequent citations or versions, and even a direct search for all citations to any data set.
Altman and King [52
] proposed a standard for citing quantitative data, which has six components, i.e. Author(s), date of dataset publishing, data set title, persistent identifier, universal numeric fingerprint, and a bridge service. One might add that a standard also must include an identifier at the start of reference entry that denotes that the entry is concerned with a dataset, not a scholarly publication or other information type. This goes beyond the technologies available for printed matter and responds to issues of confidentiality, verification, authentication, access, technology changes, existing subfield-specific practices, and possible future extensions. With such a citation standard various components can be permuted to suit different journal styles without loss of functionality.
Though the standards for citing quantitative data proposed by Altman and King [52
] address most of the existing challenges, further review and understanding of other options needs to be evaluated. Enriched metadata for datasets is essential for deriving appropriate citations either for the entire or part of the data set. The persistence of the connection between data citation and the actual data depend on some form of institutional commitment.