PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-9 (9)
 

Clipboard (0)
None

Select a Filter Below

Journals
Year of Publication
author:("marslen, John")
1.  InterProScan 5: genome-scale protein function classification 
Bioinformatics  2014;30(9):1236-1240.
Motivation: Robust large-scale sequence analysis is a major challenge in modern genomic science, where biologists are frequently trying to characterize many millions of sequences. Here, we describe a new Java-based architecture for the widely used protein function prediction software package InterProScan. Developments include improvements and additions to the outputs of the software and the complete reimplementation of the software framework, resulting in a flexible and stable system that is able to use both multiprocessor machines and/or conventional clusters to achieve scalable distributed data analysis. InterProScan is freely available for download from the EMBl-EBI FTP site and the open source code is hosted at Google Code.
Availability and implementation: InterProScan is distributed via FTP at ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/ and the source code is available from http://code.google.com/p/interproscan/.
Contact: http://www.ebi.ac.uk/support or interhelp@ebi.ac.uk or mitchell@ebi.ac.uk
doi:10.1093/bioinformatics/btu031
PMCID: PMC3998142  PMID: 24451626
2.  EBI metagenomics—a new resource for the analysis and archiving of metagenomic data 
Nucleic Acids Research  2013;42(Database issue):D600-D606.
Metagenomics is a relatively recently established but rapidly expanding field that uses high-throughput next-generation sequencing technologies to characterize the microbial communities inhabiting different ecosystems (including oceans, lakes, soil, tundra, plants and body sites). Metagenomics brings with it a number of challenges, including the management, analysis, storage and sharing of data. In response to these challenges, we have developed a new metagenomics resource (http://www.ebi.ac.uk/metagenomics/) that allows users to easily submit raw nucleotide reads for functional and taxonomic analysis by a state-of-the-art pipeline, and have them automatically stored (together with descriptive, standards-compliant metadata) in the European Nucleotide Archive.
doi:10.1093/nar/gkt961
PMCID: PMC3965009  PMID: 24165880
4.  InterPro in 2011: new developments in the family and domain prediction database 
Nucleic Acids Research  2011;40(Database issue):D306-D312.
InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
doi:10.1093/nar/gkr948
PMCID: PMC3245097  PMID: 22096229
5.  InterPro: the integrative protein signature database 
Nucleic Acids Research  2008;37(Database issue):D211-D215.
The InterPro database (http://www.ebi.ac.uk/interpro/) integrates together predictive models or ‘signatures’ representing protein domains, families and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. Integration is performed manually and approximately half of the total ∼58 000 signatures available in the source databases belong to an InterPro entry. Recently, we have started to also display the remaining un-integrated signatures via our web interface. Other developments include the provision of non-signature data, such as structural data, in new XML files on our FTP site, as well as the inclusion of matchless UniProtKB proteins in the existing match XML files. The web interface has been extended and now links out to the ADAN predicted protein–protein interaction database and the SPICE and Dasty viewers. The latest public release (v18.0) covers 79.8% of UniProtKB (v14.1) and consists of 16 549 entries. InterPro data may be accessed either via the web address above, via web services, by downloading files by anonymous FTP or by using the InterProScan search software (http://www.ebi.ac.uk/Tools/InterProScan/).
doi:10.1093/nar/gkn785
PMCID: PMC2686546  PMID: 18940856
6.  New developments in the InterPro database 
Nucleic Acids Research  2007;35(Database issue):D224-D228.
InterPro is an integrated resource for protein families, domains and functional sites, which integrates the following protein signature databases: PROSITE, PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D and PANTHER. The latter two new member databases have been integrated since the last publication in this journal. There have been several new developments in InterPro, including an additional reading field, new database links, extensions to the web interface and additional match XML files. InterPro has always provided matches to UniProtKB proteins on the website and in the match XML file on the FTP site. Additional matches to proteins in UniParc (UniProt archive) are now available for download in the new match XML files only. The latest InterPro release (13.0) contains more than 13 000 entries, covering over 78% of all proteins in UniProtKB. The database is available for text- and sequence-based searches via a webserver (), and for download by anonymous FTP (). The InterProScan search tool is now also available via a web service at .
doi:10.1093/nar/gkl841
PMCID: PMC1899100  PMID: 17202162
7.  An evaluation of GO annotation retrieval for BioCreAtIvE and GOA 
BMC Bioinformatics  2005;6(Suppl 1):S17.
Background
The Gene Ontology Annotation (GOA) database aims to provide high-quality supplementary GO annotation to proteins in the UniProt Knowledgebase. Like many other biological databases, GOA gathers much of its content from the careful manual curation of literature. However, as both the volume of literature and of proteins requiring characterization increases, the manual processing capability can become overloaded.
Consequently, semi-automated aids are often employed to expedite the curation process. Traditionally, electronic techniques in GOA depend largely on exploiting the knowledge in existing resources such as InterPro. However, in recent years, text mining has been hailed as a potentially useful tool to aid the curation process.
To encourage the development of such tools, the GOA team at EBI agreed to take part in the functional annotation task of the BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) challenge.
BioCreAtIvE task 2 was an experiment to test if automatically derived classification using information retrieval and extraction could assist expert biologists in the annotation of the GO vocabulary to the proteins in the UniProt Knowledgebase.
GOA provided the training corpus of over 9000 manual GO annotations extracted from the literature. For the test set, we provided a corpus of 200 new Journal of Biological Chemistry articles used to annotate 286 human proteins with GO terms. A team of experts manually evaluated the results of 9 participating groups, each of which provided highlighted sentences to support their GO and protein annotation predictions. Here, we give a biological perspective on the evaluation, explain how we annotate GO using literature and offer some suggestions to improve the precision of future text-retrieval and extraction techniques. Finally, we provide the results of the first inter-annotator agreement study for manual GO curation, as well as an assessment of our current electronic GO annotation strategies.
Results
The GOA database currently extracts GO annotation from the literature with 91 to 100% precision, and at least 72% recall. This creates a particularly high threshold for text mining systems which in BioCreAtIvE task 2 (GO annotation extraction and retrieval) initial results precisely predicted GO terms only 10 to 20% of the time.
Conclusion
Improvements in the performance and accuracy of text mining for GO terms should be expected in the next BioCreAtIvE challenge. In the meantime the manual and electronic GO annotation strategies already employed by GOA will provide high quality annotations.
doi:10.1186/1471-2105-6-S1-S17
PMCID: PMC1869009  PMID: 15960829
8.  InterPro, progress and status in 2005 
Nucleic Acids Research  2004;33(Database Issue):D201-D205.
InterPro, an integrated documentation resource of protein families, domains and functional sites, was created to integrate the major protein signature databases. Currently, it includes PROSITE, Pfam, PRINTS, ProDom, SMART, TIGRFAMs, PIRSF and SUPERFAMILY. Signatures are manually integrated into InterPro entries that are curated to provide biological and functional information. Annotation is provided in an abstract, Gene Ontology mapping and links to specialized databases. New features of InterPro include extended protein match views, taxonomic range information and protein 3D structure data. One of the new match views is the InterPro Domain Architecture view, which shows the domain composition of protein matches. Two new entry types were introduced to better describe InterPro entries: these are active site and binding site. PIRSF and the structure-based SUPERFAMILY are the latest member databases to join InterPro, and CATH and PANTHER are soon to be integrated. InterPro release 8.0 contains 11 007 entries, representing 2573 domains, 8166 families, 201 repeats, 26 active sites, 21 binding sites and 20 post-translational modification sites. InterPro covers over 78% of all proteins in the Swiss-Prot and TrEMBL components of UniProt. The database is available for text- and sequence-based searches via a webserver (http://www.ebi.ac.uk/interpro), and for download by anonymous FTP (ftp://ftp.ebi.ac.uk/pub/databases/interpro).
doi:10.1093/nar/gki106
PMCID: PMC540060  PMID: 15608177
9.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology 
Nucleic Acids Research  2004;32(Database issue):D262-D266.
The Gene Ontology Annotation (GOA) database (http://www.ebi.ac.uk/GOA) aims to provide high-quality electronic and manual annotations to the UniProt Knowledgebase (Swiss-Prot, TrEMBL and PIR-PSD) using the standardized vocabulary of the Gene Ontology (GO). As a supplementary archive of GO annotation, GOA promotes a high level of integration of the knowledge represented in UniProt with other databases. This is achieved by converting UniProt annotation into a recognized computational format. GOA provides annotated entries for nearly 60 000 species (GOA-SPTr) and is the largest and most comprehensive open-source contributor of annotations to the GO Consortium annotation effort. By integrating GO annotations from other model organism groups, GOA consolidates specialized knowledge and expertise to ensure the data remain a key reference for up-to-date biological information. Furthermore, the GOA database fully endorses the Human Proteomics Initiative by prioritizing the annotation of proteins likely to benefit human health and disease. In addition to a non-redundant set of annotations to the human proteome (GOA-Human) and monthly releases of its GO annotation for all species (GOA-SPTr), a series of GO mapping files and specific cross-references in other databases are also regularly distributed. GOA can be queried through a simple user-friendly web interface or downloaded in a parsable format via the EBI and GO FTP websites. The GOA data set can be used to enhance the annotation of particular model organism or gene expression data sets, although increasingly it has been used to evaluate GO predictions generated from text mining or protein interaction experiments. In 2004, the GOA team will build on its success and will continue to supplement the functional annotation of UniProt and work towards enhancing the ability of scientists to access all available biological information. Researchers wishing to query or contribute to the GOA project are encouraged to email: goa@ebi.ac.uk.
doi:10.1093/nar/gkh021
PMCID: PMC308756  PMID: 14681408

Results 1-9 (9)