If we are to consider a web-native approach to capturing the scientific record we need first to consider the laboratory notebook. The lab notebook is, at its core, a journal of events, an episodic record containing dates, times, bits and pieces of often disparate material, cut and pasted into a paper notebook. There are strong analogies between this view of the lab notebook as a journal and the functionality of Web logs or "Blogs". Blogs contain posts which are dated, usually linked to a single author, and may contain embedded digital objects such as images or videos, or indeed graphs and charts generated from online datasets as well as free or structured text. Each post has an address on the web, given by a post number or title (or both). Thus a Blog provides much of the functionality of a laboratory notebook: it feels like a journal and it can contain and present both free text and structured text such as tables. While it may not have page or volume numbers each post will have its own address on the web, a URL that points uniquely at that one piece of the record, and can be passed to collaborators to share specific objects or be used to index specific protocols, experiments, samples, or pieces of data.
A "semantic web ready" laboratory record
The creation of individually addressable objects is crucial because it enables these objects, whether they are datasets, protocols, or pointers to physical objects such as samples, to play a part in the semantic web [10
]. The root concept of the semantic web is that the relationships between objects can be encoded and described. For this to be possible those objects must be uniquely addressable resources on the web. By creating individual posts or pages the researcher is creating these individual resources; and again these can represent physical objects, processes, or data. It is possible to describe the relationships between these resources via sophisticated semantic tools such as the Resource Description Framework (RDF) or locally via statements within the posts. However it not necessary to take these approaches as it is also possible to simply express relationships that directly leverages the existing toolset on the web is by linking posts together.
Feeds change the lab notebook from a personal record to a collaborative document
The other key functionality of the web to focus on is that of the "feed". Feeds, whether they are RSS or Atom are XML documents that are regularly updated providing a stream of "events" which can then be consumed by various readers, Google Reader being one of the most popular. Along with the idea of hyperlinks between objects the feed provides the crucial difference between the paper based and web-native lab notebook. A paper notebook (whether it is a physical object or "electronic paper") is a personal record. The web-native lab notebook is a collaborative notification tool that announces when something has happened, when a sample has been created, or a piece of data analysed.
Despite of the historical tendency to isolated research groups discussed above, these independent groups are banding together as research funders demand larger coordinated projects. Tasks are divided up by expertise and in many cases also divided geographically between groups that have in the past probably not even had good internal communication systems. Rapid and effective communication between groups on the details of ongoing projects is becoming more and more important and is increasingly a serious deficiency in the management of these collaborations. In addition reporting back to sponsors via formal reports is an increasing burden. The notification systems enabled via the generation of feeds go a significant way towards providing a means of dealing with these issues. Within a group the use of feeds and feed readers can provide an extremely effective means of pushing information to those who need to either track or interact with it. In addition the idea of selectively pushing specific elements of the record of interest to a specific group could also be adopted to push either raw productivity data or a much smaller subset of items, including summaries, of interest to funding agencies (Figure ). The web native lab notebook should bring the collaborative authoring and discussion tools provided by the read-write web to bear on the problem of communicating research results.
Figure 1 Using feeds and feed readers to aggregate and push laboratory records. A) A screenshot of Google Reader showing an aggregated feed of laboratory notebook entries from http://biolab.isis.rl.ac.uk. Two buttons are highlighted which enable "sharing" to anybody (more ...)
Integrating tools and services
With the general concept of the record as a blog in place, enabling us to create a set of individually addressable objects, and link them together, as well as providing feeds describing the creation of these objects, we can consider what tools and services we need to author and to interact with these objects. Again blogs provide a good model here as many widely used authoring tools can be used directly to create documents and publish them to blog systems. Tools based on the Atom Publishing Protocol [11
] can hide the complications of publishing documents to the web from the user and made it easy to develop sophisticated services that can push content from one place to another online. Recent versions of Microsoft Office include the option of publishing documents to online services and a wide range of web services now make it easy to push content from wordprocessors, mobile phones, email, or any connected source to virtually any other.
A wide variety of web based tools and plugins are available to make the creation and linking of blog posts easy. Particularly noteworthy are tools such as Zemanta, a plugin which automatically suggests appropriate links for concepts within a post [13
]. Zemanta scans the text of a post and identifies company names, concepts that are described in Wikipedia and other online information sources, using an online database that is built up from the links created by other users of the plugin. The service suggests possible links and tags to the users, and then exploits the response of the user to those suggestions to refine the model for future suggestions.
Sophisticated semantic authoring tools such as the Integrated Content Environment (ICE) developed at the University of Southern Queensland [14
] provide a means of directly authoring semantic documents that can then be published to the web. ICE can also be configured to incorporate domain specific semantic objects that generate rich media representations such as three dimensional molecular models. These tools are rapidly become very powerful and highly useable, and will play an important role in the future by making rich document authoring straightforward.
Where do we put the data?
With the authoring of documents in hand we can consider the appropriate way of handling data files. At first sight it may seem simplest to upload data files and embed them directly in blog posts. However, the model of the blog points us in a different direction here again. On a blog images and video are not generally uploaded directly, they are hosted on an appropriate, specialised, external service and then embedded them on the blog page. Issues about managing the content and providing a highly user-friendly viewer are handled by the external data service. Hosting services are optimized for handling specific types of conten; Flickr for photos, YouTube (or Viddler or Bioscreencast) for video, Slideshare for presentations, Scribd for documents (Figure ).
Figure 2 Distributing research objects to online services and re-aggregating them via feeds. Researchers, instruments, or computers may create digital objects such as data, workflows, descriptions, and presentations as well as references to physical objects such (more ...)
While it might be argued that the development of these specialist services and embedding capabilities grew out of deficiencies in generic hosting platforms it is also true that these specialist platforms have exploited the economies of scale that arise from handling similar content together. Scientific infrastructure is resource limited and there is a strong argument that rather than building specialist publishing platforms it is more effective to use generic platforms for publishing. Specialist datahandling services can then grow up around specific data types and benefit from the economies of scale that arise from aggregating types together. In an ideal world there would be a trustworthy data hosting service, optimized for your specific type of data, that would provide cut and paste embed codes providing the appropriate visualizations in the same way that videos from YouTube can easily be embedded.
Some elements of these services exist for research data. Trusted repositories exist for structural data, for gene and protein sequences, and for chemical information. Large-scale projects are often required to put a specific repository infrastructure in place to make the data they generate available. And in most cases it is possible to provide a stable URL which points at a specific data item or dataset. It is therefore possible in many cases to provide a link directly to a dataservice that places a specific dataset in context and can be relied on to have some level of curation or quality control and provide additional functionality appropriate to the datatype. Currently many of these URLs encode database queries rather than providing a direct link. To be most effective and reliable such URLs need to be "Cool" [16
]. That is they should be stable, human readable, and direct addresses rather than queries. Query engines may change, and database schemas may be modified, but the address of the underlying objects needs to stay constant for the linked data web to be stable enough to form.
What is less prevalent is the type of embedding functionality provided by many consumer data repository services. ChemSpider http://www.chemspider.com
is one example of a service that does enable the embedding of both molecules and spectra into external web pages. This is still clearly an area for development and there are discussions to be had about both the behind the scenes implementation of these services as well as the user experience but it is clear that this kind of functionality could play a useful role in helping researchers to connect information on the web up. If multiple researchers use the ChemSpider molecule embedding service to reference a specific molecule then all of those separate documents can be unambiguously assigned as describing the same molecule. This linking up of individual objects through shared identifiers is precisely what gives the semantic web its potential power.
A more general question is the extent to which such repositories can or will be provided and supported for less common data types. The long term funding of such data repositories is at best uncertain and at worst non-existent. Institutional repositories are starting to play a role in data archiving and some research funders are showing an interest. However there is currently little or no coordinated response to the problem of how to deal with archiving data in general. Piecemeal solutions and local archiving are likely to play a significant role. This does not necessarily make the vision of linked data impossible, all that is required is that the data be placed somewhere where it can be referenced via a URL. However, to enable rich functionality to manipulate and visualize that data it will be necessary to find funding sources and business models that can support and enable the development of high quality data repositories. In our model of the Blog as a lab notebook a piece of data can be uploaded directly to a post within the blog. This provides the URL for the data, but will not in and of itself enable visualization or manipulation. Nonetheless the data will remain accessible and addressable in this form. We can take a step forward by simply putting it on the web but to enable other researchers to use those objects most effectively it will be important to provide rich functionality. This will be best supported via provision in centralised services where economies of scale can be found.
A key benefit of this way of thinking about the laboratory record is that items can be distributed in many places depending on what is appropriate. It also means is that the search mechanisms we use to find objects and information on the web to index and search our own laboratory material. Web search relies primarily on mechanisms like Page Rank that prioritise how specific addresses are linked in to the wider web. By linking our record into that wide web we enable Google and other search engines to identify our most important datasets, based on how they are connected to the rest of our research record as well as to the wider research effort.
This is a lightweight way of starting to build up a web of data. It doesn't provide the full semantic power of the linked data web as envisioned by Tim Berners-Lee and others but it also doesn't hold the same challenges and fears for users. If we can get data up on the web and identify relationships between them it doesn't matter so much to start with whether these relationships are fully described as long as there is enough contextual data to make it useful. Tagging or key-value pairs using the tools that are already available and more widely adopted by the general user community would enable us to make a good start on improving data availability and discoverability while the tools to provide more detailed semantic markup are developed.
However while distribution has benefits, it also poses significant risks. Services can fail, links can and do break, and interoperability is made more complex and can easily be compromised by developments of one service that are not mirrored on another. It would also seem at first sight to be opposed to integrative approaches that aggregate related objects together. However, such approaches, inspired by the "Datument" concept of Rzepa and Murray-Rust [17
], can be more properly seen as providing the opportunity to aggregate, contain, and represent knowledge
once it has been generated from the raw material. Our aim in distributing the elements of the record, the raw data
, is therefore to provide the contextual information
either through links, or through metadata, to make it straightforward to aggregate those elements into datuments for the presentation, publishing, and archival of knowledge
Distributed sample logging systems
The same logic of distributing data according to where it is most appropriate to store it can also be applied to the recording of samples. In many cases, tools such as Laboratory Information Management System (LIMS) or sample databases will already be in place. In most cases these are likely to applied to a specific subset of the physical objects being handled; a LIMS for analytical samples, a spreadsheet for oligonucleotides, and a local database, often derived from a card index, for lab chemicals? As long as it is possible to point to the record for each physical object independently with the required precision you need then these systems can be used directly. Although a local spreadsheet may not be addressable at the level of individual rows Google Spreadsheets can be addressed in this way. Individual cells can be addressed via a URL for each cell and there is a powerful API that makes it possible to build services to make the creation of links easy. Web interfaces can provide the means of addressing databases via URL through any web browser or http capable tool.
Samples and chemical can also be represented by a post within a Blog. This provides the key functionality that we desired; a URL endpoint that represents that object. This can provide a flexible approach which may be more suited to small laboratories than heayweight, database backed systems, designed for industry. When samples involve a wide variety of different materials put to different uses, the flexibility of using an open system of posts rather than a database with a defined schema can be helpful.
In many cases it may be appropriate to use multiple different systems, a database for recording oligonucleotides, a spreadsheet for tracking environmental samples, and a full blown LIMS to enable barcoding and monitoring samples through preparation for sequencing. Similar to the data case, it is best to use a system that is designed for or best suited to creating a record for the specific set of samples. These systems are better developed than they are for data - but many of the existing systems don't allow a good way of pointing at the record for specific samples from an external document - and very few make it possible to do this via a simple and cool URI.
Full distribution of materials, data, and process: The lab notebook as a feed of relationships
At this point it may seem that the core remaining component of the lab notebook is the description of the actions that link material objects and data files the record of process. However even these records could be passed to external services that might be better suited to the job. Procedures are also just documents. Maybe they are text documents, but perhaps they are better expressed as spreadsheets or workflows (or rather the record of running a workflow). These may well be better handled by external services, be they word processors, spreadsheets, or specialist services. They just need to be somewhere where, once again, it is possible to unambiguously point at them.
What we are left with is the links that describe the relationship between materials, data, and process, arranged along a timeline. The laboratory record, the web-native laboratory notebook, is reduced to a feed that describes these relationships; that notifies users when a new relationship is created or captured (Figure ). This could be a simple feed containing plain hyperlinks or it might be a sophisticated and rich feed that uses one or more formal vocabularies to describe the semantic relationship between items. In principle it is possible to mix both, gaining the best of detailed formal information where it is available but linking in relationships that are less clearly described where possible. That is, this approach can provide a way of building up a linked web of data and objects piece by piece, even when the details of vocabularies are not yet agreed or in place.
Figure 3 The lab notebook as a time dependent feed of relationships. A) A series of research objects collected as part of an experiment. A raw material is converted into a sample that is then subjected to analysis. A photo is taken of the sample. The analysis (more ...)