Search tips
Search criteria

Results 1-12 (12)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  Analysis of the tryptic search space in UniProt databases 
Proteomics  2014;15(1):48-57.
In this article, we provide a comprehensive study of the content of the Universal Protein Resource (UniProt) protein data sets for human and mouse. The tryptic search spaces of the UniProtKB (UniProt knowledgebase) complete proteome sets were compared with other data sets from UniProtKB and with the corresponding International Protein Index, reference sequence, Ensembl, and UniRef100 (where UniRef is UniProt reference clusters) organism-specific data sets. All protein forms annotated in UniProtKB (both the canonical sequences and isoforms) were evaluated in this study. In addition, natural and disease-associated amino acid variants annotated in UniProtKB were included in the evaluation. The peptide unicity was also evaluated for each data set. Furthermore, the peptide information in the UniProtKB data sets was also compared against the available peptide-level identifications in the main MS-based proteomics repositories. Identifying the peptides observed in these repositories is an important resource of information for protein databases as they provide supporting evidence for the existence of otherwise predicted proteins. Likewise, the repositories could use the information available in UniProtKB to direct reprocessing efforts on specific sets of peptides/proteins of interest. In summary, we provide comprehensive information about the different organism-specific sequence data sets available from UniProt, together with the pros and cons for each, in terms of search space for MS-based bottom-up proteomics workflows. The aim of the analysis is to provide a clear view of the tryptic search space of UniProt and other protein databases to enable scientists to select those most appropriate for their purposes.
PMCID: PMC4298651  PMID: 25307260
Bioinformatics; Protein isoforms; Sequence redundancy; Trypsin digestion; Variation
2.  The GOA database: Gene Ontology annotation updates for 2015 
Nucleic Acids Research  2014;43(Database issue):D1057-D1063.
The Gene Ontology Annotation (GOA) resource ( provides evidence-based Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB). Manual annotations provided by UniProt curators are supplemented by manual and automatic annotations from model organism databases and specialist annotation groups. GOA currently supplies 368 million GO annotations to almost 54 million proteins in more than 480 000 taxonomic groups. The resource now provides annotations to five times the number of proteins it did 4 years ago. As a member of the GO Consortium, we adhere to the most up-to-date Consortium-agreed annotation guidelines via the use of quality control checks that ensures that the GOA resource supplies high-quality functional information to proteins from a wide range of species. Annotations from GOA are freely available and are accessible through a powerful web browser as well as a variety of annotation file formats.
PMCID: PMC4383930  PMID: 25378336
3.  Understanding how and why the Gene Ontology and its annotations evolve: the GO within UniProt 
GigaScience  2014;3:4.
The Gene Ontology Consortium (GOC) is a major bioinformatics project that provides structured controlled vocabularies to classify gene product function and location. GOC members create annotations to gene products using the Gene Ontology (GO) vocabularies, thus providing an extensive, publicly available resource. The GO and its annotations to gene products are now an integral part of functional analysis, and statistical tests using GO data are becoming routine for researchers to include when publishing functional information. While many helpful articles about the GOC are available, there are certain updates to the ontology and annotation sets that sometimes go unobserved. Here we describe some of the ways in which GO can change that should be carefully considered by all users of GO as they may have a significant impact on the resulting gene product annotations, and therefore the functional description of the gene product, or the interpretation of analyses performed on GO datasets. GO annotations for gene products change for many reasons, and while these changes generally improve the accuracy of the representation of the underlying biology, they do not necessarily imply that previous annotations were incorrect. We additionally describe the quality assurance mechanisms we employ to improve the accuracy of annotations, which necessarily changes the composition of the annotation sets we provide. We use the Universal Protein Resource (UniProt) for illustrative purposes of how the GO Consortium, as a whole, manages these changes.
PMCID: PMC3995153  PMID: 24641996
Gene Ontology; Annotation; Function prediction; Misinterpretation
4.  Molecular Basis for DNA Double-Strand Break Annealing and Primer Extension by an NHEJ DNA Polymerase 
Cell Reports  2013;5(4):1108-1120.
Nonhomologous end-joining (NHEJ) is one of the major DNA double-strand break (DSB) repair pathways. The mechanisms by which breaks are competently brought together and extended during NHEJ is poorly understood. As polymerases extend DNA in a 5′-3′ direction by nucleotide addition to a primer, it is unclear how NHEJ polymerases fill in break termini containing 3′ overhangs that lack a primer strand. Here, we describe, at the molecular level, how prokaryotic NHEJ polymerases configure a primer-template substrate by annealing the 3′ overhanging strands from opposing breaks, forming a gapped intermediate that can be extended in trans. We identify structural elements that facilitate docking of the 3′ ends in the active sites of adjacent polymerases and reveal how the termini act as primers for extension of the annealed break, thus explaining how such DSBs are extended in trans. This study clarifies how polymerases couple break-synapsis to catalysis, providing a molecular mechanism to explain how primer extension is achieved on DNA breaks.
Graphical Abstract
•Structure of a NHEJ polymerase bound to an annealed DNA double-strand break•Break synapsis is stabilized by microhomology and polymerase surface loops•3′ hydroxyl of the primer strand is positioned into active-site pocket in trans•Templating base selection relies on loop 1 and conserved phenylalanine residues
In this article, Brissett and colleagues have elucidated the molecular basis for the extension of the termini of a synapsed DNA double-strand break by a mycobacterial NHEJ repair polymerase. This structure captures the moments after the break has been annealed back together, revealing how the broken ends are positioned into the active site of adjacent polymerases in readiness for filling in of the remaining gaps to, ultimately, facilitate rejoining of the break by DNA ligase.
PMCID: PMC3898472  PMID: 24239356
5.  The COMBREX Project: Design, Methodology, and Initial Results 
Anton, Brian P. | Chang, Yi-Chien | Brown, Peter | Choi, Han-Pil | Faller, Lina L. | Guleria, Jyotsna | Hu, Zhenjun | Klitgord, Niels | Levy-Moonshine, Ami | Maksad, Almaz | Mazumdar, Varun | McGettrick, Mark | Osmani, Lais | Pokrzywa, Revonda | Rachlin, John | Swaminathan, Rajeswari | Allen, Benjamin | Housman, Genevieve | Monahan, Caitlin | Rochussen, Krista | Tao, Kevin | Bhagwat, Ashok S. | Brenner, Steven E. | Columbus, Linda | de Crécy-Lagard, Valérie | Ferguson, Donald | Fomenkov, Alexey | Gadda, Giovanni | Morgan, Richard D. | Osterman, Andrei L. | Rodionov, Dmitry A. | Rodionova, Irina A. | Rudd, Kenneth E. | Söll, Dieter | Spain, James | Xu, Shuang-yong | Bateman, Alex | Blumenthal, Robert M. | Bollinger, J. Martin | Chang, Woo-Suk | Ferrer, Manuel | Friedberg, Iddo | Galperin, Michael Y. | Gobeill, Julien | Haft, Daniel | Hunt, John | Karp, Peter | Klimke, William | Krebs, Carsten | Macelis, Dana | Madupu, Ramana | Martin, Maria J. | Miller, Jeffrey H. | O'Donovan, Claire | Palsson, Bernhard | Ruch, Patrick | Setterdahl, Aaron | Sutton, Granger | Tate, John | Yakunin, Alexander | Tchigvintsev, Dmitri | Plata, Germán | Hu, Jie | Greiner, Russell | Horn, David | Sjölander, Kimmen | Salzberg, Steven L. | Vitkup, Dennis | Letovsky, Stanley | Segrè, Daniel | DeLisi, Charles | Roberts, Richard J. | Steffen, Martin | Kasif, Simon
PLoS Biology  2013;11(8):e1001638.
Experimental data exists for only a vanishingly small fraction of sequenced microbial genes. This community page discusses the progress made by the COMBREX project to address this important issue using both computational and experimental resources.
PMCID: PMC3754883  PMID: 24013487
6.  The Impact of Focused Gene Ontology Curation of Specific Mammalian Systems 
PLoS ONE  2011;6(12):e27541.
The Gene Ontology (GO) resource provides dynamic controlled vocabularies to provide an information-rich resource to aid in the consistent description of the functional attributes and subcellular locations of gene products from all taxonomic groups ( System-focused projects, such as the Renal and Cardiovascular GO Annotation Initiatives, aim to provide detailed GO data for proteins implicated in specific organ development and function. Such projects support the rapid evaluation of new experimental data and aid in the generation of novel biological insights to help alleviate human disease. This paper describes the improvement of GO data for renal and cardiovascular research communities and demonstrates that the cardiovascular-focused GO annotations, created over the past three years, have led to an evident improvement of microarray interpretation. The reanalysis of cardiovascular microarray datasets confirms the need to continue to improve the annotation of the human proteome.
GO annotation data is freely available from:
PMCID: PMC3235096  PMID: 22174742
7.  The UniProt-GO Annotation database in 2011 
Nucleic Acids Research  2011;40(Database issue):D565-D570.
The GO annotation dataset provided by the UniProt Consortium (GOA: is a comprehensive set of evidenced-based associations between terms from the Gene Ontology resource and UniProtKB proteins. Currently supplying over 100 million annotations to 11 million proteins in more than 360 000 taxa, this resource has increased 2-fold over the last 2 years and has benefited from a wealth of checks to improve annotation correctness and consistency as well as now supplying a greater information content enabled by GO Consortium annotation format developments. Detailed, manual GO annotations obtained from the curation of peer-reviewed papers are directly contributed by all UniProt curators and supplemented with manual and electronic annotations from 36 model organism and domain-focused scientific resources. The inclusion of high-quality, automatic annotation predictions ensures the UniProt GO annotation dataset supplies functional information to a wide range of proteins, including those from poorly characterized, non-model organism species. UniProt GO annotations are freely available in a range of formats accessible by both file downloads and web-based views. In addition, the introduction of a new, normalized file format in 2010 has made for easier handling of the complete UniProt-GOA data set.
PMCID: PMC3245010  PMID: 22123736
8.  Infrastructure for the life sciences: design and implementation of the UniProt website 
BMC Bioinformatics  2009;10:136.
The UniProt consortium was formed in 2002 by groups from the Swiss Institute of Bioinformatics (SIB), the European Bioinformatics Institute (EBI) and the Protein Information Resource (PIR) at Georgetown University, and soon afterwards the website was set up as a central entry point to UniProt resources. Requests to this address were redirected to one of the three organisations' websites. While these sites shared a set of static pages with general information about UniProt, their pages for searching and viewing data were different. To provide users with a consistent view and to cut the cost of maintaining three separate sites, the consortium decided to develop a common website for UniProt. Following several years of intense development and a year of public beta testing, the domain was switched to the newly developed site described in this paper in July 2008.
The UniProt consortium is the main provider of protein sequence and annotation data for much of the life sciences community. The website is the primary access point to this data and to documentation and basic tools for the data. These tools include full text and field-based text search, similarity search, multiple sequence alignment, batch retrieval and database identifier mapping. This paper discusses the design and implementation of the new website, which was released in July 2008, and shows how it improves data access for users with different levels of experience, as well as to machines for programmatic access.
is open for both academic and commercial use. The site was built with open source tools and libraries. Feedback is very welcome and should be sent to
The new UniProt website makes accessing and understanding UniProt easier than ever. The two main lessons learned are that getting the basics right for such a data provider website has huge benefits, but is not trivial and easy to underestimate, and that there is no substitute for using empirical data throughout the development process to decide on what is and what is not working for your users.
PMCID: PMC2686714  PMID: 19426475
9.  Molecular Genetics Reveal That Silvatic Rhodnius prolixus Do Colonise Rural Houses 
Rhodnius prolixus is the main vector of Chagas disease in Venezuela. Here, domestic infestations of poor quality rural housing have persisted despite four decades of vector control. This is in contrast to the Southern Cone region of South America, where the main vector, Triatoma infestans, has been eliminated over large areas. The repeated colonisation of houses by silvatic populations of R. prolixus potentially explains the control difficulties. However, controversy surrounds the existence of silvatic R. prolixus: it has been suggested that all silvatic populations are in fact Rhodnius robustus, a related species of minor epidemiological importance. Here we investigate, by direct sequencing (mtcytb, D2) and by microsatellite analysis, 1) the identity of silvatic Rhodnius and 2) whether silvatic populations of Rhodnius are isolated from domestic populations.
Methods and Findings
Direct sequencing confirmed the presence of R. prolixus in palms and that silvatic bugs can colonise houses, with house and palm specimens sharing seven cytb haplotypes. Additionally, mitochondrial introgression was detected between R. robustus and R. prolixus, indicating a previous hybridisation event. The use of ten polymorphic microsatellite loci revealed a lack of genetic structure between silvatic and domestic ecotopes (non-significant FST values), which is indicative of unrestricted gene flow.
Our analyses demonstrate that silvatic R. prolixus presents an unquestionable threat to the control of Chagas disease in Venezuela. The design of improved control strategies is essential for successful long term control and could include modified spraying and surveillance practices, together with housing improvements.
Author Summary
Chagas disease is spread by blood-feeding insects (triatomine bugs) that colonise poor-quality houses. Disease control relies primarily on killing domestic bugs by spraying dwellings with residual insecticide. In Venezuela, sustained control has proved difficult despite four decades of campaigns. Considered the main vector in Venezuela, the bug Rhodnius prolixus may also infest palm trees and might repeatedly recolonise houses from palms. A complication is that a morphologically similar species, R. robustus, also infests palms but is of minor medical importance. Therefore, confusion exists as to the true identity of palm bugs and their importance in disease transmission.
We applied two molecular methods (sequencing DNA of the cytochrome b gene, and analysing microsatellites) to triatomines collected in Venezuela so that we could identify unequivocally the species of palm-dwelling Rhodnius and establish their role in maintaining house infestations. We demonstrated that R. prolixus is indeed present in palms, and that such silvatic populations can colonise houses and are a threat to the successful control of Chagas disease in Venezuela. This finding resolves a longstanding controversy of fundamental epidemiological importance. It is also an example of the application of molecular epidemiology to correct vector identification and successful disease control.
PMCID: PMC2270345  PMID: 18382605
10.  The Universal Protein Resource (UniProt): an expanding universe of protein information 
Nucleic Acids Research  2005;34(Database issue):D187-D191.
The Universal Protein Resource (UniProt) provides a central resource on protein sequences and functional annotation with three database components, each addressing a key need in protein bioinformatics. The UniProt Knowledgebase (UniProtKB), comprising the manually annotated UniProtKB/Swiss-Prot section and the automatically annotated UniProtKB/TrEMBL section, is the preeminent storehouse of protein annotation. The extensive cross-references, functional and feature annotations and literature-based evidence attribution enable scientists to analyse proteins and query across databases. The UniProt Reference Clusters (UniRef) speed similarity searches via sequence space compression by merging sequences that are 100% (UniRef100), 90% (UniRef90) or 50% (UniRef50) identical. Finally, the UniProt Archive (UniParc) stores all publicly available protein sequences, containing the history of sequence data with links to the source databases. UniProt databases continue to grow in size and in availability of information. Recent and upcoming changes to database contents, formats, controlled vocabularies and services are described. New download availability includes all major releases of UniProtKB, sequence collections by taxonomic division and complete proteomes. A bibliography mapping service has been added, and an ID mapping service will be available soon. UniProt databases can be accessed online at or downloaded at .
PMCID: PMC1347523  PMID: 16381842
11.  UniProt: the Universal Protein knowledgebase 
Nucleic Acids Research  2004;32(Database issue):D115-D119.
To provide the scientific community with a single, centralized, authoritative resource for protein sequences and functional information, the Swiss-Prot, TrEMBL and PIR protein database activities have united to form the Universal Protein Knowledgebase (UniProt) consortium. Our mission is to provide a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and query interfaces. The central database will have two sections, corresponding to the familiar Swiss-Prot (fully manually curated entries) and TrEMBL (enriched with automated classification, annotation and extensive cross-references). For convenient sequence searches, UniProt also provides several non-redundant sequence databases. The UniProt NREF (UniRef) databases provide representative subsets of the knowledgebase suitable for efficient searching. The comprehensive UniProt Archive (UniParc) is updated daily from many public source databases. The UniProt databases can be accessed online ( or downloaded in several formats ( The scientific community is encouraged to submit data for inclusion in UniProt.
PMCID: PMC308865  PMID: 14681372
12.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 
Nucleic Acids Research  2003;31(1):365-370.
The SWISS-PROT protein knowledgebase ( and connects amino acid sequences with the current knowledge in the Life Sciences. Each protein entry provides an interdisciplinary overview of relevant information by bringing together experimental results, computed features and sometimes even contradictory conclusions. Detailed expertise that goes beyond the scope of SWISS-PROT is made available via direct links to specialised databases. SWISS-PROT provides annotated entries for all species, but concentrates on the annotation of entries from human (the HPI project) and other model organisms to ensure the presence of high quality annotation for representative members of all protein families. Part of the annotation can be transferred to other family members, as is already done for microbes by the High-quality Automated and Manual Annotation of microbial Proteomes (HAMAP) project. Protein families and groups of proteins are regularly reviewed to keep up with current scientific findings. Complementarily, TrEMBL strives to comprise all protein sequences that are not yet represented in SWISS-PROT, by incorporating a perpetually increasing level of mostly automated annotation. Researchers are welcome to contribute their knowledge to the scientific community by submitting relevant findings to SWISS-PROT at
PMCID: PMC165542  PMID: 12520024

Results 1-12 (12)