New cross-references have been added, linking InterPro entries to related enzyme and pathway information in the PRIAM (21
), Reactome (22
), KEGG (23
), MetaCyc (24
) and UniPathway (25
) resources. An automatic procedure checks the type of proteins matched to an InterPro entry and, if a significant proportion (>80%) are found to belong to a particular enzyme family or pathway, a link is made to the appropriate resource. By adding this information, InterPro can now be used for pathway analysis; for example, to examine whether or not a complete genome contains the protein components predicted to be sufficient for a particular reaction or pathway.
A new XML schema has been adopted by all InterPro Consortium members to promote data exchange with each other and with third-parties.
The schema defines three data formats: signature annotation, protein matches and nucleotide sequence matches for all six reading frames. Currently the signature annotation XML format is used in the InterPro production process to import annotation from four Consortium members (PRINTS, PROSITE, Pfam and PIRSF), which has led to a reduction in import time and complexity. The intention is to roll this format out to other Consortium partners in the near future. The protein-match XML format is available from the beta version of InterProScan 5 (see below) to facilitate interoperability and integration with third-party pipelines and applications: this facility will be available for nucleotide sequences shortly.
A new user interface
With the aim of improving the InterPro user experience, a new Web-based interface has been developed. The interface has been publicly available at http://wwwdev.ebi.ac.uk/interpro
as a beta release since January 2011. Several goals have been addressed in this development, including improvements in usability, the provision of additional functionality, and many improvements to the aesthetics of the interface. These goals have been driven by a user-centred design approach to improve usability and identify important functionality, coupled with a professional graphic design process. Findings gathered from user surveys, formal usability testing, user interviews and reviews of several years of support requests have allowed the InterPro team to focus interface development on real user needs.
Developing an interface to a conceptually complex system such as InterPro is challenging. The complexity of the underlying data model and the integrated nature of the InterPro resources make it difficult to avoid placing a high cognitive load on the user. A major emphasis of the new design has been to develop individual pages that are as clutter-free and intuitive as possible, freeing the user to focus on the biological problem that they are attempting to address, rather than forcing them to think about how to interact with the interface.
Concrete examples of these improvements include the division of the previously complex and confusing ‘Entry page’ into eight separate, cross-referenced pages. Each page is clearly named, so users can easily find the content they require, without having to wade through irrelevant detail. Graphical elements have been employed to provide contextual clues, including icons representing proteins, member database signatures and InterPro entries, with the latter having different icons to represent protein families, domains, sites and repeats. This simple change has had a demonstrably positive impact, allowing users to identify the entities presented on the interface with greater ease and speed.The entry ‘overview’ page is illustrated in .
Figure 1. The ‘Overview' page on the new set of InterPro entry pages, including the family hierarchy for this entry, an extensive description of the family and cross references to three GO terms that are associated with this family. In this case, the entry (more ...)
Users can now search InterPro directly with a protein sequence by pasting the sequence into the text area provided on the home page. InterPro then performs a fast look-up of proteins for which matches have already been calculated. If the sequence is available in InterPro, the user is taken to the new protein page directly. If the sequence is not present in InterPro, it is submitted automatically to the InterProScan service, which returns results once the analysis is complete. Tighter integration of these two search services is currently being developed to ensure that users are presented with results in the same way by both InterPro and InterProScan. This improvement will be included in the final released version of the new InterPro Website.
Over the last 3 years, InterProScan has been completely re-written using the Java programming language. The new InterProScan is now available as a beta release (version 5beta2) for public evaluation and comment; details of how to obtain and install it can be found at http://code.google.com/p/interproscan/wiki/RunningStandaloneInterProScan5
. The new version exploits modern, stable Java technologies. A major focus of development has been to improve both the reliability and the scalability of InterProScan to allow it to support large-scale, high-throughput sequence analysis. The final version will be easy to download and install on a variety of platforms.
New functionality has been incorporated into InterProScan version 5, including a fast pre-calculated match lookup Web-service. This has the advantage that users wishing to install InterProScan locally are not obliged to download the complete set of pre-calculated matches; however, it is possible to download and install this service locally, should users wish to make confidential use of InterProScan behind a firewall. The existing cross-references to InterPro entries and GO annotations are also provided, as in the current version of InterProScan. A mechanism to allow matches to be calculated against nucleotide sequence data will be available in the final version, using the EMBOSS getorf program. This new service allows the mapping of predicted features back to coordinates on the submitted nucleic acid sequence.
In July 2009, a BioMart was added to the InterPro suite of services. BioMart provides users with the ability to retrieve large sets of data, based on sophisticated queries that may incorporate multiple filters. Users are able to specify precisely which fields are included in the results returned. The InterPro BioMart has been described previously (26
), including a detailed explanation of how to use the BioMart with several example queries.
The most important benefit provided by this feature is the ability to interrogate InterPro for multiple entries, proteins or member database signatures in a single query, which is a feature not available from the main InterPro Web interface. In addition, BioMart provides an easy to use REST Web service for programmatic access to InterPro data. The InterPro BioMart is linked from the InterPro home-page, and is also available directly from the BioMart Central Portal at http://www.biomart.org
. The BioMart is exploited extensively throughout the main InterPro Web pages to allow users to download results in ‘tab-separated values’ (TSV) format. The BioMart user interface is illustrated in .
Figure 2. The InterPro BioMart. This example illustrates the use of the BioMart to return a large set of data. In this case, a query has been built to return all proteins that are predicted to be members of the rhodopsin-like GPCRs (IPR000276) in Drosophila melanogaster. (more ...)
InterPro DAS service
The Distributed Annotation System, DAS (27
) is used extensively throughout bioinformatics to allow sharing of annotation on both nucleotide and protein sequences and protein structure. InterPro data were previously available as a single DAS data-source provided and maintained by the Ensembl team at the Wellcome Trust Sanger Institute.
In March 2010 InterPro DAS-service provision moved to the EBI, at the same time being extended to provide three DAS data-sources as described in .