The genome sequencing centres are generating raw sequence data at an alarming rate, and the result is a need for automated sequence analysis methods. The automatic analysis of protein sequences is possible through the use of ‘protein signatures’, which are methods for diagnosing a domain or characteristic region of a protein family in a protein sequence. A number of protein signature databases have been developed, each using a variation on the handful of signature methods available, which include patterns, profiles and hidden Markov models (HMMs). These databases are most effective when used together, rather than in isolation. InterPro (1
) integrates into one resource the major protein signatures databases: PROSITE (2
), which uses regular expressions and profiles, PRINTS (3
), which uses position-specific scoring matrix-based (PSSM-based) fingerprints, ProDom (4
), which uses automatic sequence clustering, and Pfam (5
), SMART (6
), TIGRFAMs (7
), PIRSF (also known as PIR SuperFamily) (8
) and SUPERFAMILY (9
), all of which use HMMs.
Signatures from the member databases are integrated manually as they are developed. A team of biologists have this responsibility, as well as that of annotating the new or existing entries. Each InterPro entry is described by one or more signatures, and corresponds to a biologically meaningful family, domain, repeat or site, e.g. post-translational modification (PTM). Not every entry will contain a signature from each member database, only those that correspond to each other are united. Entries are assigned a type to describe what they represent, which may be family, domain, repeat, PTM, active site or binding site. The last two are new entry types, which were introduced to better describe the signatures in some of the entries. Entries may be related to each other through two different relationships: the parent/child and contains/found in relationship. Parent/child relationships are used to describe a common ancestry between entries, whereas the contains/found in relationship generally refers to the presence of genetically mobile domains. InterPro entries are annotated with a name, an abstract, mapping to Gene Ontology (GO) terms and links to specialized databases. InterPro groups all protein sequences matching related signatures into entries. All hits of the protein signatures in InterPro against a composite of the Swiss-Prot and TrEMBL components of UniProt (10
) are precomputed. The matches are available for viewing in each InterPro entry in different formats.
The number of entries and coverage of protein space by InterPro is continuing to grow. The beta release of InterPro in 1999 contained 2423 entries, while the latest release of the database contains 11 007 entries, representing nearly a 5-fold increase in 5 years. In its infancy, InterPro covered ~66% of all proteins in Swiss-Prot and TrEMBL, and this has increased to over 90% for Swiss-Prot, 76% for TrEMBL and 78% for UniProt (Swiss-Prot and TrEMBL). A number of new features have been added to the InterPro database since its publication in Nucleic Acids Research in 2003. These include additional protein match views, the InterPro Domain Architectures Viewer, taxonomic range information, additional database links and protein 3D structural information. New members databases that have been integrated are the full-length sequence-based PIRSF database and the structure-based SUPERFAMILY. These are described in more detail below.