In this post-genomics era, data of increasing volume and complexity is being deposited into databases around the world. Biologists need to ask complex queries of this data to test and drive their research hypotheses. Typically, each data source provides an advanced query interface on their website to satisfy this requirement. However, each site has its own solution and subsequently, the user has a learning curve before they can start interacting with the data. A further problem the researcher has is that they often need to query more than one data source, necessitating mastering more than one interface and having to cut and paste results between the sites. If the analysis involves high-throughput data, this approach is not usually scalable. To overcome this problem, many groups rely on bioinformaticians who can generate scripts to interact with the varying programmatic interfaces of the different data sources. They also often have to learn a number of different web services or application programmatic interfaces (APIs) for each resource. A preferable solution would be to have generic software that a biologist can use on top of any data source. BioMart[
1] is such a solution.
BioMart is an open source data management system that comes with a range of query interfaces that allow users to group and refine data based upon many different criteria. In addition, the software features a built-in query optimiser for fast data retrieval. A BioMart installation can provide domain-specific querying of a single data source or function as a one-stop shop (web portal) to a wide range of BioMarts as our central portal [
2] does. All BioMart websites have the same look and feel (only varying in colour scheme and branding), which has obvious advantages to users moving between different resources. However, the power of the system comes from integrated querying of the different BioMarts. If any datasets share common identifiers (such as Ensembl gene IDs or Uniprot IDs) or even mappings to a common genome assembly, these can be used to link BioMarts together in integrated queries. Additionally, these datasets do not have to be located on the same server or even at the same geographical location. This distributed solution has many advantages; not least of which is the fact that each site can utilise their own domain expertise to deploy their BioMart.
BioMart also has the advantage of being integrated with external software packages such as BioConductor [
3], the Distributed Annotation System (DAS) [
4], Galaxy [
5], Cytoscape [
6], Taverna [
7]. This enables users to perform integrated queries with non-BioMart data sources as well as detailed analysis of the results. BioMart is also part of the GMOD (Generic Model Organism Database) [
8] suite of tools for building a model organism site.
Originally developed for the Ensembl genome browser [
9] as the EnsMart data warehouse [
10], BioMart has now become a fully generic data integration solution. Although applicable to any type of data, BioMart is particularly suited for advanced searching of the complex descriptive data typically found in biological datasets. Numerous BioMarts have now been installed by external groups, in large part because of its automated deployment tools and cross platform compatibility. These include model organism databases such as Gramene [
11], Dictybase [
12], Wormbase [
13] and RGD (Rat Genome Database) [
14] as well as HapMap variation [
15], pancreatic expression database [
16], Reactome pathways [
17] and PRIDE proteomic [
18] databases (see Table for the full list). A wide variety of analyses and tasks are possible from the publicly available BioMarts, ranging from SNP (single nucleotide polymorphism) selection for candidate gene screening, microarray annotation, cross-species analysis, through to recovery of disease links, sequence variations and expression patterns.
| Table 1Description of all publicly accessible BioMarts to date |
The range of interfaces is designed with both biologists and bioinformaticians in mind. The simplest way of querying BioMart is via the web interface called MartView (either on our central portal [
2] or follow the links on our main page [
1] to the individual sites). Programmatic access is available via a Perl API or BioMart's web services (MartServices). An important and novel feature of BioMart is that it offers "scripting at the click of a button". A user can generate an API or MartServices script by building up a query on the MartView website followed by a simple click of a button. All the interfaces allow the user to build up biological queries by first selecting
dataset(s) of interest, then the data to view and/or save (
attributes), some optional restrictions (
filters) on the query and finally the
format for the data.