The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources; and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
The current 18th Database Issue of Nucleic Acids Research features descriptions of 96 new and 83 updated online databases covering various areas of molecular biology. It includes two editorials, one that discusses COMBREX, a new exciting project aimed at figuring out the functions of the ‘conserved hypothetical’ proteins, and one concerning BioDBcore, a proposed description of the ‘minimal information about a biological database’. Papers from the members of the International Nucleotide Sequence Database collaboration (INSDC) describe each of the participating databases, DDBJ, ENA and GenBank, principles of data exchange within the collaboration, and the recently established Sequence Read Archive. A testament to the longevity of databases, this issue includes updates on the RNA modification database, Definition of Secondary Structure of Proteins (DSSP) and Homology-derived Secondary Structure of Proteins (HSSP) databases, which have not been featured here in >12 years. There is also a block of papers describing recent progress in protein structure databases, such as Protein DataBank (PDB), PDB in Europe (PDBe), CATH, SUPERFAMILY and others, as well as databases on protein structure modeling, protein–protein interactions and the organization of inter-protein contact sites. Other highlights include updates of the popular gene expression databases, GEO and ArrayExpress, several cancer gene databases and a detailed description of the UK PubMed Central project. The Nucleic Acids Research online Database Collection, available at: http://www.oxfordjournals.org/nar/database/a/, now lists 1330 carefully selected molecular biology databases. The full content of the Database Issue is freely available online at the Nucleic Acids Research web site (http://nar.oxfordjournals.org/).
The recent explosion of biological data and the concomitant proliferation of distributed databases make it challenging for biologists and bioinformaticians to discover the best data resources for their needs, and the most efficient way to access and use them. Despite a rapid acceleration in uptake of syntactic and semantic standards for interoperability, it is still difficult for users to find which databases support the standards and interfaces that they need. To solve these problems, several groups are developing registries of databases that capture key metadata describing the biological scope, utility, accessibility, ease-of-use and existence of web services allowing interoperability between resources. Here, we describe some of these initiatives including a novel formalism, the Database Description Framework, for describing database operations and functionality and encouraging good database practise. We expect such approaches will result in improved discovery, uptake and utilization of data resources.
Database URL: http://www.casimir.org.uk/casimir_ddf
The Unified Medical Language System (UMLS) is being designed to provide a uniform user interface to heterogeneous machine-readable bio-medical information resources, such as bibliographic databases, genetic databases, expert systems and patient records.1 Such an interface will have to recognize different ways of saying the same thing, and provide links to ways of saying related things. One way to represent the necessary associations is via a domain thesaurus. As no such thesaurus exists, and because, once built, it will be both sizable and in need of continuous maintenance, its design should include a methodology for building and maintaining it. We propose a methodology, utilizing lexically expanded schema inversion, and a design, called T. Lex, which together form one approach to the problem of defining and building a bio-medical thesaurus. We argue that the semantic locality implicit in such a thesaurus will support model-based reasoning in bio-medicine.2
Motivation: The world-wide community of life scientists has access to a large number of public bioinformatics databases and tools, which are developed and deployed using diverse technologies and designs. More and more of the resources offer programmatic web-service interface. However, efficient use of the resources is hampered by the lack of widely used, standard data-exchange formats for the basic, everyday bioinformatics data types.
Results: BioXSD has been developed as a candidate for standard, canonical exchange format for basic bioinformatics data. BioXSD is represented by a dedicated XML Schema and defines syntax for biological sequences, sequence annotations, alignments and references to resources. We have adapted a set of web services to use BioXSD as the input and output format, and implemented a test-case workflow. This demonstrates that the approach is feasible and provides smooth interoperability. Semantics for BioXSD is provided by annotation with the EDAM ontology. We discuss in a separate section how BioXSD relates to other initiatives and approaches, including existing standards and the Semantic Web.
Availability: The BioXSD 1.0 XML Schema is freely available at http://www.bioxsd.org/BioXSD-1.0.xsd under the Creative Commons BY-ND 3.0 license. The http://bioxsd.org web page offers documentation, examples of data in BioXSD format, example workflows with source codes in common programming languages, an updated list of compatible web services and tools and a repository of feature requests from the community.
Contact: firstname.lastname@example.org; email@example.com; firstname.lastname@example.org
BioPAX (Biological Pathway Exchange) is a standard language to represent biological pathways at the molecular and cellular level. Its major use is to facilitate the exchange of pathway data (http://www.biopax.org). Pathway data captures our understanding of biological processes, but its rapid growth necessitates development of databases and computational tools to aid interpretation. However, the current fragmentation of pathway information across many databases with incompatible formats presents barriers to its effective use. BioPAX solves this problem by making pathway data substantially easier to collect, index, interpret and share. BioPAX can represent metabolic and signaling pathways, molecular and genetic interactions and gene regulation networks. BioPAX was created through a community process. Through BioPAX, millions of interactions organized into thousands of pathways across many organisms, from a growing number of sources, are available. Thus, large amounts of pathway data are available in a computable form to support visualization, analysis and biological discovery.
pathway data integration; pathway database; standard exchange format; ontology; information system
Robust, programmatically accessible biomedical information services that syntactically and semantically interoperate with other resources are challenging to construct. Such systems require the adoption of common information models, data representations and terminology standards as well as documented application programming interfaces (APIs). The National Cancer Institute (NCI) developed the cancer common ontologic representation environment (caCORE) to provide the infrastructure necessary to achieve interoperability across the systems it develops or sponsors. The caCORE Software Development Kit (SDK) was designed to provide developers both within and outside the NCI with the tools needed to construct such interoperable software systems.
The caCORE SDK requires a Unified Modeling Language (UML) tool to begin the development workflow with the construction of a domain information model in the form of a UML Class Diagram. Models are annotated with concepts and definitions from a description logic terminology source using the Semantic Connector component. The annotated model is registered in the Cancer Data Standards Repository (caDSR) using the UML Loader component. System software is automatically generated using the Codegen component, which produces middleware that runs on an application server. The caCORE SDK was initially tested and validated using a seven-class UML model, and has been used to generate the caCORE production system, which includes models with dozens of classes. The deployed system supports access through object-oriented APIs with consistent syntax for retrieval of any type of data object across all classes in the original UML model. The caCORE SDK is currently being used by several development teams, including by participants in the cancer biomedical informatics grid (caBIG) program, to create compatible data services. caBIG compatibility standards are based upon caCORE resources, and thus the caCORE SDK has emerged as a key enabling technology for caBIG.
The caCORE SDK substantially lowers the barrier to implementing systems that are syntactically and semantically interoperable by providing workflow and automation tools that standardize and expedite modeling, development, and deployment. It has gained acceptance among developers in the caBIG program, and is expected to provide a common mechanism for creating data service nodes on the data grid that is under development.
BioLit is a web server which provides metadata describing the semantic content of all open access, peer-reviewed articles which describe research from the major life sciences literature archive, PubMed Central. Specifically, these metadata include database identifiers and ontology terms found within the full text of the article. BioLit delivers these metadata in the form of XML-based article files and as a custom web-based article viewer that provides context-specific functionality to the metadata. This resource aims to integrate the traditional scientific publication directly into existing biological databases, thus obviating the need for a user to search in multiple locations for information relating to a specific item of interest, for example published experimental results associated with a particular biological database entry. As an example of a possible use of BioLit, we also present an instance of the Protein Data Bank fully integrated with BioLit data. We expect that the community of life scientists in general will be the primary end-users of the web-based viewer, while biocurators will make use of the metadata-containing XML files and the BioLit database of article data. BioLit is available at http://biolit.ucsd.edu.
BioHackathon 2010 was the third in a series of meetings hosted by the Database Center for Life Sciences (DBCLS) in Tokyo, Japan. The overall goal of the BioHackathon series is to improve the quality and accessibility of life science research data on the Web by bringing together representatives from public databases, analytical tool providers, and cyber-infrastructure researchers to jointly tackle important challenges in the area of in silico biological research.
The theme of BioHackathon 2010 was the 'Semantic Web', and all attendees gathered with the shared goal of producing Semantic Web data from their respective resources, and/or consuming or interacting those data using their tools and interfaces. We discussed on topics including guidelines for designing semantic data and interoperability of resources. We consequently developed tools and clients for analysis and visualization.
We provide a meeting report from BioHackathon 2010, in which we describe the discussions, decisions, and breakthroughs made as we moved towards compliance with Semantic Web technologies - from source provider, through middleware, to the end-consumer.
BioHackathon; Open source; Software; Semantic Web; Databases; Data integration; Data visualization; Web services; Interfaces
Pathway-oriented experimental and computational studies have led to a significant accumulation of biological knowledge concerning three major types of biological pathway events: molecular signaling events, gene regulation events, and metabolic reaction events. A pathway consists of a series of molecular pathway events that link molecular entities such as proteins, genes, and metabolites. There are approximately 300 biological pathway resources as of April 2009 according to the Pathguide database; however, these pathway databases generally have poor coverage or poor quality, and are difficult to integrate, due to syntactic-level and semantic-level data incompatibilities.
We developed the Human Pathway Database (HPD) by integrating heterogeneous human pathway data that are either curated at the NCI Pathway Interaction Database (PID), Reactome, BioCarta, KEGG or indexed from the Protein Lounge Web sites. Integration of pathway data at syntactic, semantic, and schematic levels was based on a unified pathway data model and data warehousing-based integration techniques. HPD provides a comprehensive online view that connects human proteins, genes, RNA transcripts, enzymes, signaling events, metabolic reaction events, and gene regulatory events. At the time of this writing HPD includes 999 human pathways and more than 59,341 human molecular entities. The HPD software provides both a user-friendly Web interface for online use and a robust relational database backend for advanced pathway querying. This pathway tool enables users to 1) search for human pathways from different resources by simply entering genes/proteins involved in pathways or words appearing in pathway names, 2) analyze pathway-protein association, 3) study pathway-pathway similarity, and 4) build integrated pathway networks. We demonstrated the usage and characteristics of the new HPD through three breast cancer case studies.
HPD http://bio.informatics.iupui.edu/HPD is a new resource for searching, managing, and studying human biological pathways. Users of HPD can search against large collections of human biological pathways, compare related pathways and their molecular entity compositions, and build high-quality, expanded-scope disease pathway models. The current HPD software can help users address a wide range of pathway-related questions in human disease biology studies.
SalmonDB is a new multiorganism database containing EST sequences from Salmo salar, Oncorhynchus mykiss and the whole genome sequence of Danio rerio, Gasterosteus aculeatus, Tetraodon nigroviridis, Oryzias latipes and Takifugu rubripes, built with core components from GMOD project, GOPArc system and the BioMart project. The information provided by this resource includes Gene Ontology terms, metabolic pathways, SNP prediction, CDS prediction, orthologs prediction, several precalculated BLAST searches and domains. It also provides a BLAST server for matching user-provided sequences to any of the databases and an advanced query tool (BioMart) that allows easy browsing of EST databases with user-defined criteria. These tools make SalmonDB database a valuable resource for researchers searching for transcripts and genomic information regarding S. salar and other salmonid species. The database is expected to grow in the near feature, particularly with the S. salar genome sequencing project.
Database URL: http://genomicasalmones.dim.uchile.cl/
DAS is a widely adopted protocol for providing syntactic interoperability among biological databases. The popularity of DAS is due to a simplified and elegant mechanism for data exchange that consists of sources exposing their RESTful interfaces for data access. As a growing number of DAS services are available for molecular biology resources, there is an incentive to explore this protocol in order to advance data discovery and integration among these resources.
We developed DASMiner, a Matlab toolkit for querying DAS data sources that enables creation of integrated biological models using the information available in DAS-compliant repositories. DASMiner is composed by a browser application and an API that work together to facilitate gathering of data from different DAS sources, which can be used for creating enriched datasets from multiple sources.
The browser is used to formulate queries and navigate data contained in DAS sources. Users can execute queries against these sources in an intuitive fashion, without the need of knowing the specific DAS syntax for the particular source. Using the source's metadata provided by the DAS Registry, the browser's layout adapts to expose only the set of commands and coordinate systems supported by the specific source. For this reason, the browser can interrogate any DAS source, independently of the type of data being served.
The API component of DASMiner may be used for programmatic access of DAS sources by programs in Matlab. Once the desired data is found during navigation, the query is exported in the format of an API call to be used within any Matlab application. We illustrate the use of DASMiner by creating integrative models of histone modification maps and protein-protein interaction networks. These enriched datasets were built by retrieving and integrating distributed genomic and proteomic DAS sources using the API.
The support of the DAS protocol allows that hundreds of molecular biology databases to be treated as a federated, online collection of resources. DASMiner enables full exploration of these resources, and can be used to deploy applications and create integrated views of biological systems using the information deposited in DAS repositories.
Asparagine N-linked glycosylation is one of the most important forms of protein post-translational modification in eukaryotes and is one of the first metabolic pathways described at a biochemical level. Here, we report a new annotation of this pathway for the Human species, published after passing a peer-review process in Reactome. The new annotation presented here offers a high level of detail and provides references and descriptions for each reaction, along with integration with GeneOntology and other databases. The open-source approach of Reactome toward annotation encourages feedback from its users, making it easier to keep the annotation of this pathway updated with future knowledge. Reactome's web interface allows easy navigation between steps involved in the pathway to compare it with other pathways and resources in other scientific databases and to export it to BioPax and SBML formats, making it accessible for computational studies. This new entry in Reactome expands and complements the annotations already published in databases for biological pathways and provides a common reference to researchers interested in studying this important pathway in the human species. Finally, we discuss the status of the annotation of this pathway and point out which steps are worth further investigation or need better experimental validation.
aspargine N-linked glycosylation; Reactome database
Bacteriocins are very diverse group of antimicrobial peptides produced by a wide range of bacteria and known for their inhibitory activity against various human and animal pathogens. Although many bacteriocins are now well characterized, much information is still missing or is unavailable to potential users. The assembly of such information in one central resource such as a database would therefore be of great benefit to the exploitation of these bioactive molecules in the present context of increasing antibiotic resistance and natural bio-preservation need.
In the present paper, we present the development of a new and original database BACTIBASE that contains calculated or predicted physicochemical properties of 123 bacteriocins produced by both Gram-positive and Gram-negative bacteria. The information in this database is very easy to extract and allows rapid prediction of relationships structure/function and target organisms of these peptides and therefore better exploitation of their biological activity in both the medical and food sectors.
The BACTIBASE database is freely available at , web-based platform enabling easy retrieval, via various filters, of sets of bacteriocins that will enable detailed analysis of a number of microbiological and physicochemical data.
The laboratory mouse has become the organism of choice for discovering gene function and unravelling pathogenetic mechanisms of human diseases through the application of various functional genomic approaches. The resulting deluge of data has led to the deployment of numerous online resources and the concomitant need for formalized experimental descriptions, data standardization, database interoperability and integration, a need that has yet to be met. We present here the Mouse Resource Browser (MRB), a database of mouse databases that indexes 217 publicly available mouse resources under 22 categories and uses a standardised database description framework (the CASIMIR DDF) to provide information on their controlled vocabularies (ontologies and minimum information standards), and technical information on programmatic access and data availability. Focusing on interoperability and integration, MRB offers automatic generation of downloadable and re-distributable SOAP application-programming interfaces for resources that provide direct database access. MRB aims to provide useful information to both bench scientists, who can easily navigate and find all mouse related resources in one place, and bioinformaticians, who will be provided with interoperable resources containing data which can be mined and integrated.
Database URL: http://bioit.fleming.gr/mrb
Objective: Generalizing the data models underlying two prototype neurophysiology databases, the authors describe and propose the Common Data Model (CDM) as a framework for federating a broad spectrum of disparate neuroscience information resources.
Design: Each component of the CDM derives from one of five superclasses—data, site, method, model, and reference—or from relations defined between them. A hierarchic attribute-value scheme for metadata enables interoperability with variable tree depth to serve specific intra- or broad inter-domain queries. To mediate data exchange between disparate systems, the authors propose a set of XML-derived schema for describing not only data sets but data models. These include biophysical description markup language (BDML), which mediates interoperability between data resources by providing a meta-description for the CDM.
Results: The set of superclasses potentially spans data needs of contemporary neuroscience. Data elements abstracted from neurophysiology time series and histogram data represent data sets that differ in dimension and concordance. Site elements transcend neurons to describe subcellular compartments, circuits, regions, or slices; non-neuroanatomic sites include sequences to patients. Methods and models are highly domain-dependent.
Conclusions: True federation of data resources requires explicit public description, in a metalanguage, of the contents, query methods, data formats, and data models of each data resource. Any data model that can be derived from the defined superclasses is potentially conformant and interoperability can be enabled by recognition of BDML-described compatibilities. Such metadescriptions can buffer technologic changes.
As the volume and complexity of data sets archived at NCBI grow rapidly, so does the need to gather and organize the associated metadata. Although metadata has been collected for some archival databases, previously, there was no centralized approach at NCBI for collecting this information and using it across databases. The BioProject database was recently established to facilitate organization and classification of project data submitted to NCBI, EBI and DDBJ databases. It captures descriptive information about research projects that result in high volume submissions to archival databases, ties together related data across multiple archives and serves as a central portal by which to inform users of data availability. Concomitantly, the BioSample database is being developed to capture descriptive information about the biological samples investigated in projects. BioProject and BioSample records link to corresponding data stored in archival repositories. Submissions are supported by a web-based Submission Portal that guides users through a series of forms for input of rich metadata describing their projects and samples. Together, these databases offer improved ways for users to query, locate, integrate and interpret the masses of data held in NCBI's archival repositories. The BioProject and BioSample databases are available at http://www.ncbi.nlm.nih.gov/bioproject and http://www.ncbi.nlm.nih.gov/biosample, respectively.
Quantitative models of biochemical and cellular systems are used to answer a variety of questions in the biological sciences. The number of published quantitative models is growing steadily thanks to increasing interest in the use of models as well as the development of improved software systems and the availability of better, cheaper computer hardware. To maximise the benefits of this growing body of models, the field needs centralised model repositories that will encourage, facilitate and promote model dissemination and reuse. Ideally, the models stored in these repositories should be extensively tested and encoded in community-supported and standardised formats. In addition, the models and their components should be cross-referenced with other resources in order to allow their unambiguous identification.
BioModels Database http://www.ebi.ac.uk/biomodels/ is aimed at addressing exactly these needs. It is a freely-accessible online resource for storing, viewing, retrieving, and analysing published, peer-reviewed quantitative models of biochemical and cellular systems. The structure and behaviour of each simulation model distributed by BioModels Database are thoroughly checked; in addition, model elements are annotated with terms from controlled vocabularies as well as linked to relevant data resources. Models can be examined online or downloaded in various formats. Reaction network diagrams generated from the models are also available in several formats. BioModels Database also provides features such as online simulation and the extraction of components from large scale models into smaller submodels. Finally, the system provides a range of web services that external software systems can use to access up-to-date data from the database.
BioModels Database has become a recognised reference resource for systems biology. It is being used by the community in a variety of ways; for example, it is used to benchmark different simulation systems, and to study the clustering of models based upon their annotations. Model deposition to the database today is advised by several publishers of scientific journals. The models in BioModels Database are freely distributed and reusable; the underlying software infrastructure is also available from SourceForge https://sourceforge.net/projects/biomodels/ under the GNU General Public License.
Toxicity is a complex phenomenon involving the potential adverse effect on a range of biological functions. Predicting toxicity involves using a combination of experimental data (endpoints) and computational methods to generate a set of predictive models. Such models rely strongly on being able to integrate information from many sources. The required integration of biological and chemical information sources requires, however, a common language to express our knowledge ontologically, and interoperating services to build reliable predictive toxicology applications.
This article describes progress in extending the integrative bio- and cheminformatics platform Bioclipse to interoperate with OpenTox, a semantic web framework which supports open data exchange and toxicology model building. The Bioclipse workbench environment enables functionality from OpenTox web services and easy access to OpenTox resources for evaluating toxicity properties of query molecules. Relevant cases and interfaces based on ten neurotoxins are described to demonstrate the capabilities provided to the user. The integration takes advantage of semantic web technologies, thereby providing an open and simplifying communication standard. Additionally, the use of ontologies ensures proper interoperation and reliable integration of toxicity information from both experimental and computational sources.
A novel computational toxicity assessment platform was generated from integration of two open science platforms related to toxicology: Bioclipse, that combines a rich scriptable and graphical workbench environment for integration of diverse sets of information sources, and OpenTox, a platform for interoperable toxicology data and computational services. The combination provides improved reliability and operability for handling large data sets by the use of the Open Standards from the OpenTox Application Programming Interface. This enables simultaneous access to a variety of distributed predictive toxicology databases, and algorithm and model resources, taking advantage of the Bioclipse workbench handling the technical layers.
BioMart Central Portal is a first of its kind, community-driven effort to provide unified access to dozens of biological databases spanning genomics, proteomics, model organisms, cancer data, ontology information and more. Anybody can contribute an independently maintained resource to the Central Portal, allowing it to be exposed to and shared with the research community, and linking it with the other resources in the portal. Users can take advantage of the common interface to quickly utilize different sources without learning a new system for each. The system also simplifies cross-database searches that might otherwise require several complicated steps. Several integrated tools streamline common tasks, such as converting between ID formats and retrieving sequences. The combination of a wide variety of databases, an easy-to-use interface, robust programmatic access and the array of tools make Central Portal a one-stop shop for biological data querying. Here, we describe the structure of Central Portal and show example queries to demonstrate its capabilities.
Database URL: http://central.biomart.org.
The National BioResource Project (NBRP) is a Japanese project that aims to establish a system for collecting, preserving and providing bioresources for use as experimental materials for life science research. It is promoted by 27 core resource facilities, each concerned with a particular group of organisms, and by one information center. The NBRP database is a product of this project. Thirty databases and an integrated database-retrieval system (BioResource World: BRW) have been created and made available through the NBRP home page (http://www.nbrp.jp). The 30 independent databases have individual features which directly reflect the data maintained by each resource facility. The BRW is designed for users who need to search across several resources without moving from one database to another. BRW provides access to a collection of 4.5-million records on bioresources including wild species, inbred lines, mutants, genetically engineered lines, DNA clones and so on. BRW supports summary browsing, keyword searching, and searching by DNA sequences or gene ontology. The results of searches provide links to online requests for distribution of research materials. A circulation system allows users to submit details of papers published on research conducted using NBRP resources.
With the growth of biological data in volume and heterogeneity, web search engines become key tools for researchers. However, general-purpose search engines are not specialized for the search of biological data.
Here, we present an approach at developing a biological web search engine based on the Semantic Web technologies and demonstrate its implementation for retrieving gene- and protein-centered knowledge. The engine is available at http://www.integromedb.org.
The IntegromeDB search engine allows scanning data on gene regulation, gene expression, protein-protein interactions, pathways, metagenomics, mutations, diseases, and other gene- and protein-related data that are automatically retrieved from publicly available databases and web pages using biological ontologies. To perfect the resource design and usability, we welcome and encourage community feedback.
data integration; search engine; biological ontologies
Summary: PubChem is a public repository of chemical structures and associated biological activities. The PubChem BioAssay database contains assay descriptions, conditions and readouts and biological screening results that have been submitted by the biomedical research community. The PubChem web site and Power User Gateway (PUG) web service allow users to interact with the data and raw files are available via FTP.
These resources are helpful to many but there can also be great benefit by using a software API to manipulate the data. Here, we describe a Java API with entity objects mapped to the PubChem Schema and with wrapper functions for calling the NCBI eUtilities and PubChem PUG web services. PubChem BioAssays and associated chemical compounds can then be queried and manipulated in a local relational database. Features include chemical structure searching and generation and display of curve fits from stored dose–response experiments, something that is not yet available within PubChem itself. The aim is to provide researchers with a fast, consistent, queryable local resource from which to manipulate PubChem BioAssays in a database agnostic manner. It is not intended as an end user tool but to provide a platform for further automation and tools development.
BioMart is a freely available, open source, federated database system that provides a unified access to disparate, geographically distributed data sources. It is designed to be data agnostic and platform independent, such that existing databases can easily be incorporated into the BioMart framework. BioMart allows databases hosted on different servers to be presented seamlessly to users, facilitating collaborative projects between different research groups. BioMart contains several levels of query optimization to efficiently manage large data sets and offers a diverse selection of graphical user interfaces and application programming interfaces to ensure that queries can be performed in whatever manner is most convenient for the user. The software has now been adopted by a large number of different biological databases spanning a wide range of data types and providing a rich source of annotation available to bioinformaticians and biologists alike.
Database URL: http://www.biomart.org
The exponential growth of research in molecular biology has brought concomitant proliferation of databases for stocking its findings. A variety of protein sequence databases exist. While all of these strive for completeness, the range of user interests is often beyond their scope. Large databases covering a broad range of domains tend to offer less detailed information than smaller, more specialized resources, often creating a need to combine data from many sources in order to obtain a complete picture. Scientific researchers are continually developing new specific databases to enhance their understanding of biological processes.
In this article, we present the implementation of a new tool for protein data analysis. With its easy-to-use user interface, this software provides the opportunity to build more specialized protein databases from a universal protein sequence database such as Swiss-Prot. A family of proteins known as bacteriocins is analyzed as 'proof of concept'.
SciDBMaker is stand-alone software that allows the extraction of protein data from the Swiss-Prot database, sequence analysis comprising physicochemical profile calculations, homologous sequences search, multiple sequence alignments and the building of new and more specialized databases. It compiles information with relative ease, updates and compares various data relevant to a given protein family and could solve the problem of dispersed biological search results.