Here, we describe the use of the web browser based version of InterProScan available from the EBI at
http://www.ebi.ac.uk/InterProScan (). This service is free to all academic and commercial organizations and offers interactive as well as email job submission. Direct email submissions should be directed to
interproscan/at/ebi.ac.uk. Instructions and documentation are available when sending an email to the above address that contains the word ‘help’ in the message body. Users requiring high-throughput use of the application or who wish to carry out analysis using other databases can download a standalone version from
ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan. Users requiring programmatic access to the InterProScan can do so using a SOAP-based Web Service called WSInterProScan (
5), which is described at
http://www.ebi.ac.uk/Tools/webservices/WSInterProScan.html. All of these make use of a centrally maintained core version of InterProScan version 4.0.
The job input form
The first section of the input form consists of the users email address and how the results are to be displayed. The first thing a user needs to decide when using the InterProScan submission form () is how he/she wants to see the results. This is carried out by making a selection on the RESULTS menu. Two options are available: ‘interactive’, which will return the results to the browser once the job is completed, and ‘email’, which will return the results to the email specified in the YOUR EMAIL text dialog.
The next section has a set of check boxes that either choose all or clears all the methods available. Each method can be ticked on or off, according to the user's requirements. For example, users interested only in signal peptide cleavage sites or the transmembrane domains described in InterPro entries may choose the corresponding methods individually.
The third section of the submission form is specific for DNA as the sequence input. DNA sequences will be translated to protein according to the translation rules specified in the TRANSLATION TABLE menu. The default is the standard code. Each translation will generate peptide sequences in six frames and all will be searched. The minimum length of an open reading frame produced after translation can be specified in the MIN. OPEN READING FRAME SIZE menu. This dictates that only peptides above the selected value will be searched by the methods chosen in the second section.
The fourth section of the input form consists of the sequence input panel. The components of this panel include a selection menu for the molecule type. This one can be DNA or protein. The default is protein. When DNA is selected it enables the TRANSLATION TABLE menu in the third section of the form. Help is available by clicking on the HELP image. This will open a new browser window that contains comprehensive information about InterProScan. There is also an UPLOAD dialog that can be used instead of cutting and pasting sequences into the input window. Finally, there are the Submit and Reset buttons. The sequence input text dialog will accept protein or DNA sequence in any of the standard sequence formats in use today. These include EMBL, SWISS, GenBank, NBRF/PIR, CODATA, Fasta, GCG and RAW text. Primary or secondary identifiers (accession number or identifier) of a protein sequence in the databases can also be used. In this case, the user will type a database name followed by a colon and the identifier. For example, ‘UNIPROT:INSR_HUMAN’. It is not possible to submit more than 10 protein input sequences at the same time. Each protein sequence must be at least five amino acids long. Only one nucleic acid sequence may be used at a time and the length for this sequence must be ≤5000 bases.
InterProScan output
Before InterProScan launches each of the protein sequence analysis applications, it takes advantage of pre-computed results whenever possible. It calculates a checksum (CRC64) for the query sequence and compares it with the checksums of the protein sequences that are present in a database called IPRMATCHES. This is a database that lists all the entries from UniProt/Swiss-Prot and UniProt/TrEMBL that match one or more InterPro entries. If the checksum calculated for the query sequence does not match any checksums found in the IPRMATCHES database, the protein sequence analysis applications are launched in parallel; otherwise the IPRMATCHES entry is returned.
Once a job is completed, the output of each of the applications is individually parsed to produce a merged results file. This file is in the tab-delimited format. A converter is called onto generate, on the fly, an XML document, which is used to generate the HTML output. This consists of two views: a picture or a graphical view () that displays a cartoon of the sequence with highlighted domains or functional sites corresponding to the matches in the InterPro databases. Each match contains hypertext links to the InterPro database main web resource as well as to the individual member databases' websites where the matches are further described. A table view () is also available by clicking on the ‘table view’ button. This one consists of complete database names, hyperlinked match identifiers, the sequence coordinates (start–stop pairs) where the match occurs, E-values and the status of the match in InterPro (e.g. ‘T’ for true or ‘?’ for unknown). Parent–child relationships are displayed if they exist in an InterPro entry. GO annotation is also shown if available. Other options in the HTML results page include the raw output in the tab-delimited format, the XML document and the sequences used as input (original sequences). The results for each job are stored at the EBI for at least 24 h.