Recent interest has arisen in the field of neuroscience, and particularly in neuroimaging, in identifying or creating standards to facilitate software tool interoperability. The NIMH Neuroimaging Informatics Technology Initiative (NIfTI; http://nifti.nimh.nih.gov
) was formed to aid in the development and enhancement of informatics tools for neuroimaging. Though best known for the Data Format Working Group (DFWG) that has defined the NifTI image file format standard, this effort has recently turned its attentions to how provenance metadata might be standardized. The Biomedical Informatics Research Network (BIRN; http://nbirn.net
) is another high profile effort working to develop standards among its consortia membership, including the development of study data provenance.
Descriptions of data provenance have been used successfully in other fields of endeavor. For example, the Dublin Core Metadata Initiative (DCMI) is an organization dedicated to promoting the widespread adoption of interoperable metadata standards and developing specialized metadata vocabularies for describing resources that enable more intelligent information discovery systems (http://dublincore.org
). This also includes meta-data related to workflow provenance. Also, the “Minimum Information About a Microarray Experiment” (MIAME) standard describes the minimum information that is needed to enable the interpretation of the microarray results of the experiment unambiguously and potentially to reproduce the experiment. This standard includes essential experimental and data processing protocols (http://www.mged.org/Workgroups/MIAME/miame_2.0.html
). Efforts such as these examples have sought to capture data and workflow information sufficient to reproduce reported study findings and that enable cross-study comparison. Specific workflow description frameworks also exist in other fields that help to sequence data processing steps and that can be used to populate provenance descriptions. These frameworks are highly sophisticated tools that require substantial investment to learn and deploy. They are generally divided in to either data- or process-oriented approaches ().
The Collaboratory for Multi-scale Chemical Science (CMCS) project is a data-oriented informatics toolkit for collaboration and data management for multi-scale chemistry (Myers et al., 2005
). CMCS collects pedigree information about individual data objects by defining input and output data and capturing “pedigree chains” describing the processing that the data has undergone (http://cmcs.org
). The provenance data is explicitly defined in associations, placing the burden of documentation upon the user.
Other data workflow systems, such as the virtual data system (formerly known as Chimera and incorporating Pegasus) (Zhao et al., 2006
), are process-oriented provenance models. The Virtual Data System (VDS) provides middleware for the GriPhyN project (http://www.griphyn.org
), expressing, executing, and tracking the results of workflows. Provenance is used for regeneration of derived data, comparison of data, and auditing data derivations. Users construct workflows using a standard virtual data language (VDL) describing “transformations” (executable programs) that are executed by a VDL interpreter producing a “derivation” (the execution of a transformation). “Data objects” are entities that are consumed or produced by a derivation. In the VDS model, provenance is inferred from the processing by inverting the processing to associate the output data with the input data. This approach places very little burden on the user to document data provenance.
The myGrid project (Oinn et al., 2004
) provides middleware in support of computational experiments in the biological sciences, modeled as workflows in a grid environment. Users construct workflows written in XScufl language using the Taverna engine. The LogBook is a plugin for Taverna engine that allows users to log their experiments in a mySQL database and browse, rerun, and maintain previously run workflows (http://www.mygrid.org.uk/wiki/Mygrid/LogBook
). This provenance log contains the executables invoked, the parameters used, data used and derived, and is automatically produced when the workflow executes. This process-oriented provenance log is also inverted to infer the provenance for the intermediate and final set of data.
Within the neuroimaging community, the XCEDE (XML-based Clinical Experiment Data Exchange) schema (Keator et al., 2006
) also provides for the storage of data provenance information. Provenance information manually captured includes hardware, compilation and libraries linked, operating system and software versions, and parameters used to generate and document results. XCEDE is a data-oriented system where the provenance metadata is associated with the actual data files.
However, the tools described above do not provide a simple mechanism for the capture of provenance metadata from multiple packages, the capacity to represent complex, non sequential analyses, nor at a sufficient level of detail to allow the reproduction of a derived set of data on a new platform. Hence the need for the development of a provenance framework that can easily be applied to complex neuroimaging analyses.
The eXtensible Neuroimaging Archive Toolkit (XNAT) (Marcus et al., 2007
) is a data management system for the collection of data from multiple sources, storage of the data in a repository, and to facilitate access to the data by authorized users. It is a software platform that comprises a data archive, a user interface, and the middleware to support both. As such, XNAT is not a provenance system, but a data archiving system that makes use of an ArcBuild script to preprocess data and writes a provenance record to the database. Data must be processed within ArcBuild within the XNAT system in order for provenance metadata to be recorded and captured. It is possible that the provenance schema we have described could be incorporated into the XNAT archiving system, permitting users to process or submit their previously processed data from outside of the ArcBuild/XNAT system.
By maintaining software provenance, it will be possible to build intelligent systems that guide users in the design of analytic strategies. These tools can capture the expertise of algorithm developers, as well as the experience of experts at local institutions who have spent significant periods of time learning how best to apply specific tools to the analysis needs of the laboratory. The tools will inform the users of missing processing stages, suggest available and verified processing modules, and warn of incompatible data types.
The development of the provenance framework described here will continue to be an open discussion for the neuroimaging community to continue its evolution (http://provenance.loni.ucla.edu
). All software tools and documentation will be available for distribution, discussion, and modification through our website. We encourage the community to become involved in the development of not only the schema, but of software tools that can take advantage of it.
Future directions include the adaptation of existing processing tools to make use of the provenance framework. The LONI Pipeline was developed to facilitate neuroimaging analyses. It is currently being extended to incorporate executable provenance into its modules and to intelligently propagate provenance files with its imaging results. The LONI Pipeline can accommodate almost any form of workflow, the underlying architecture is module agnostic, not limiting the kind of applications that can be run within it. LONI Pipeline workflows can therefore serve to document workflow provenance in almost any field of endeavor.
Workflows can be enrolled in a database, creating a processing and provenance database. Having a readily searchable database of commonly used (and rarely used) workflows would greatly aid investigators in recreating the conditions of a particular analysis, reproducing previous results and rerunning analyses with small modifications.
In an era where digital information underlies much of the scientific enterprise and the manipulation of that data has become increasingly complex, the recording of data and methods provenance takes on greater importance. In this article, we describe an XML-based neuroimaging provenance description that can be implemented in any workflow environment. We envision the LONI Pipeline as fulfilling a role for neuroimaging similar to other frameworks in chemistry or high energy physics. We believe that data and workflow provenance form a major element of the program that promotes data processing methods description, data sharing, and study replication.