InterPro (
1) is an integrative database which was founded 10 years ago when the PROSITE (
2), PRINTS (
3), Pfam (
4) and ProDom (
5) databases formed a consortium to amalgamate the predictive signatures they individually produced into a single resource. Since then, six other member databases have also joined and their data has been integrated: SMART (
6), TIGRFAMs (
7), PIRSF (
8), SUPERFAMILY (
9), PANTHER (
10) and Gene3D (
11). The signatures of each member database are built using different but complementary methodologies.
When different signatures match the same set of proteins in the same region on the sequence, they are presumed to be describing the same functional family, domain or site and are placed into a single InterPro entry by a curator. Grouping equivalent signatures from different sources together in this way has obvious benefits, giving signatures consistent names and annotation. It also highlights potentially erroneous signature hits. One would expect that remote homologues might only match a single signature from a multiple signature entry but these outliers could also be explained by single matches being false positive, hence the user should regard these results more cautiously.
Collectively considering the total set of signatures from the member databases also increases overall coverage of protein space. The coverage of various sequence databases by InterPro signatures is shown in . InterPro signature matches to the UniProt Knowledgebase [UniProtKB; (
12)] are regularly calculated using the InterProScan software package (
13) and this information is used to aid UniProtKB curators in their annotation of Swiss-Prot proteins, as well as being the basis of the automatic systems which add annotation to UniProtKB/TrEMBL (
12). The UniParc protein archive and UniMES meta-genomic sequence databases (
14) are also put through InterPro analysis pipelines and many genomic sequencing projects continue to use InterPro and its software to functionally characterize whole genomes (
15,
16).
| Table 1.Coverage of the major sequence databases UniProtKB, UniParc and UniMES by InterPro signatures |
If a signature only matches a subset of proteins compared to another signature, it is likely that this signature is more functionally or taxonomically specific than the other. In this case, the signatures would be deemed to be related; the signature matching the subset would be termed a child, the other signature being its parent. These parent–child relationships are created by InterPro's curators during the integration process and a hierarchy of how the integrated signatures relate to each other is thus constructed. In this way, InterPro also increases the depth of annotation of protein space.
Once an InterPro entry is created, curators add annotation, such as a descriptive abstract, name and cross-references to other resources, including Gene Ontology (GO) terms (
17). Semi-automatic procedures create and maintain links to an array of other databases, including the protease resource MEROPS (
18), the protein interaction database IntAct (
19), the protein sequence clusters in CluSTr (
20) and the 3D protein structure database PDB (
21). Additionally, if a protein has a solved 3D structure in PDB or a structure modelled in either the MODBASE (
22) or SWISS-MODEL (
23) databases, this information is shown together with the member databases’ signature matches in the graphical display on the InterPro Web interface.
Users are able to access all pre-computed matches of signatures to UniProtKB via the web interface in a variety of graphical and text-based formats. They can change how these matches are shown by either sorting by UniProtKB identifier or name, for example, or by electing to display matches based on their taxonomy, solved 3D structures or splice variants. They can also download XML-format files of matches to UniProtKB, the UniProt Archive (UniParc) and UniMES meta-genomic sequence database.
InterProScan is made available via the web at
http://www.ebi.ac.uk/Tools/InterProScan/, and the entire package can be downloaded from the FTP site ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/index.html. InterProScan allows users to submit their own sequences to the search algorithms and processing from InterPro and its member databases. They can receive results in various formats showing the signatures that match their sequence(s), the InterPro entry (if any) into which each signature is integrated and any GO terms associated with those entries. SOAP-based web services also exist (
http://www.ebi.ac.uk/Tools/webservices/WSInterProScan.html) which allow users to submit their own nucleotide and protein sequences programmatically (
24).