In biological or clinical research the creation of knowledge, here defined as "the realisation and understanding of patterns and their implications existing in information
" relies on data mining. This in turn requires the collection and integration of a diverse set of up-to-date data and the associated context i.e. information
. These sets include unstructured information from the literature, specifically extracted information from the multitude of available databases, experimental data from "-omics" platforms as well as phenotype information and clinical data. Although a large amount of information is stored in numerous different databases (the 2010 NAR database issue listing more than 1200 [1
]) even more is still embedded in unstructured free text. Over the last 15 years a large number of methods and software tools have been developed to integrate aspects of biological knowledge such as signalling pathways or functional annotation with experimental data. However, it has proven extremely difficult to couple true semantic integration (i.e. the mapping of equivalent meaning and objects) across all information types relevant in a life science project with a flexible and extendible data model, robustness against structural changes in services and data, transparent usage, and low set-up and maintenance requirements (see [2
] for an excellent recent review). In principle this difficulty arises from the high complexity of life science data, which is partly an artefact of the fragmented landscape of data sources but also stems from reasons integral to the life sciences. The ever extending "parts-list of life" itself already offers an astounding number of object classes, from the molecular to the organism, even if common naming/identifier and definitions could be agreed upon. In addition experimental data can only be interpreted in the context of the exact identity of the experimental sample, the samples environment, the samples processing and the processing and quality of the generated data. Even more than the occasional extension of the "parts-list" from our growing knowledge, technical development continually generates new data types, processing methods and experimental conditions. While life science projects in general will (hopefully) share some concepts, almost each one will require some individual adjustment to integrate and view the relevant information. Therefore an optimal data integration approach will ensure that the data model can be based on existing concepts (ideally ontological i.e. controlled, structured vocabulary) yet remains flexible and extendible by the advanced user. In this respect today's most successful (i.e. widely used) data integration approaches such as SRS [3
] or Entrez [4
] show only weak, cross-reference based data integration without semantic mapping to a common concept (categorised as link/index integration by Köhler [5
] and Stein [6
]). They depend on pairwise mappings between individual database entries provided by the data source e.g. from a protein sequence entry to the corresponding transcript, the mappings lack semantic meaning i.e. the notion that a protein is expressed from a gene can not be stated or queried. Additional processing and data mapping is required to answer even simple questions such as "which molecular mechanisms are known to be involved in the pathology of chronic obstructive pulmonary disease ?". Currently custom-developed data warehouses such as Atlas [7
], BIOZON [8
] or BioGateway [9
], are the most common technical concept to achieve full semantic integration (in public and industry projects). While these are ideally suited to answer complex queries their inflexible and pre-determined data model and the necessary, often difficult, data synchronisation result in high set-up and maintenance costs. Further, adaptation of such data warehouses structure to an ever changing environment or requirements are difficult at best [6
]. Fortunately, as more data sources start to adopt semantic web representations such as OWL [10
] and RDF [11
] maintenance for semantic mappings becomes less of an issue as concomitantly to adopting a common language to transport semantics many data sources also standardise the semantics they provide such as using common entity references and ontologies.
An optimisation, at least regarding data synchronisation, has been to present a semantically fully integrated view of the data while the underlying data is assembled on-the-fly from distributed sources using a coherent data model and semantic mappings [12
] (categorised as federation/view integration by Köhler and Stein [5
]). Details of this approach vary widely. The ad-hoc data assembly process can be provided by home made scripts or, more recently, using workflow engines such as Taverna [14
]. The data model can be programmed with a specific language as in Kleisli [15
] or may make use of standard ontologies as with TAMBIS [16
]. Semantic mapping to the common concept can be produced by a view providing environment, such as BioMediator [17
] and the Bio2RDF project [18
], or can come from individual integrated data sources. In the latter case the data sources either provide such mappings voluntarily, working for the common good of the "semantic web" [19
] or are forced to do so by a closed application environment such as caBIG [20
], Gen2Phen/PaGE-OM [12
] or GMOD [22
]. While conceptually elegant, these approaches have some disadvantages: the start-up costs are quite high (e.g. [13
]), the performance is determined by the slowest, least stable of the integrated resources, complex queries result in large joins which are hard to optimise, and data models are often hard to extend. Ad-hoc desktop data integration and visualisation tools such as Cytoscape [24
], Osprey [25
] or ONDEX [26
] on the other hand combine excellent flexibility with good performance due to local data storage, however they do not allow large scale knowledge bases to be collaboratively generated, managed and shared.
Another issue, which is only partially addressed by current data integration solutions, is the need to organise not only public information but project-specific knowledge and data, keep it private or partially private for some time, store and connect experimental results and corresponding metainformation about materials and methods and, if eventually verified, merge it into the pool of common knowledge. This may for example take the form of an existing signal transduction pathway which is privately extended with new members or connections. The extension is then published and discussed within a specific project until it is accepted as common knowledge. While data resources such as GEO provide the option to keep submitted data private for some time, they generally do not allow existing knowledge to be extended as described above or allow existing data to be annotated with private or public comments.
Our challenge was to develop a knowledge management environment that achieves several goals: focus on the management of project-specific knowledge; ease data model generation and extension; provide completely flexible data integration and reporting methods combined with intuitive visual navigation and query generation; and address the issues of set-up and maintenance cost.
To do so we chose to apply different aspects of the approaches described above. In the next sections we describe the creation of a knowledge base for chronic diseases based on the BioXM software platform that efficiently models complex research environments with a flexible management, query and reporting interface which automatically adapts to the conceptualisation of the modelled information.