The increasing number of completely sequenced microbial genomes represents an unparalleled opportunity to achieve a better understanding of prokaryotes, including their metabolic pathways, virulence factors, phylogeny, etc. However, the sequences themselves are not enough. It is of fundamental importance that these genomes be annotated with high quality and that the nomenclature be standardized.
Since the publication in 1995 of the complete Haemophilus influenzae
), more than 700 bacterial and archaeal genomes have been entirely sequenced; the development of new sequencing techniques, such as parallel pyrosequencing of 454 Life Sciences (2
) and Solexa/Illumina Genome Analyzer sequencing-by-synthesis technology (3
), has greatly increased the amount of sequenced data that is generated, and they complement the classic Sanger DNA sequencing method (4
). Public databases currently hold more than 100Gb of sequence and this amount will continue to increase exponentially as sequencing centres will soon have an annual throughput of several gigabases each.
Most of the proteins coming from these sequencing projects will probably never be characterized, and the annotation at the DNA level is succinct. Sequencing centres have developed automated pipelines from a combination of methods, such as sequence similarity, presence of domains and pathway prediction, among many other sequence analysis methods usually employed (5
) to attempt to annotate the proteome of a certain microorganism. Though the prediction of coding sequences (CDSs) is usually very good, the quality of the functional annotation attached to them is very variable.
Many methods have been developed to improve genome functional annotation, including the use of genomic context information (6
), mapping of pathways in orthologous groups (7
), or defining protein function based on protein–protein interactions (8
). Genome annotation by the scientific community using Wiki software has lately been the focus of several initiatives (9–11
), but one of the major hurdles is the establishment of common standards for the annotation provided by each expert. Since sequencing centres and users in general rely on large protein databases, and especially on UniProtKB/Swiss-Prot (12
), to annotate new genomes and identify new proteins, we consider it to be an important mission of UniProtKB to provide as many annotated proteins as possible, with the highest possible quality.
In order to address this need, we have implemented HAMAP (High-quality Automated and Manual Annotation of microbial Proteomes), a semi-automated pipeline system within UniProtKB/Swiss-Prot, dedicated to high-throughput, high-quality annotation of proteins from microbial complete proteomes, that also provides complete proteome sets that are consistent and non-redundant. Its aim is to maximize the complementarity between manual and automated annotation; the HAMAP system is composed of two databases and an automatic annotation pipeline. It targets proteins from bacteria, archaea and plastids, the latter being included due to their bacterial origin.
On the HAMAP website (http://www.expasy.org/sprot/hamap
), two databases are available: one that provides curated information on all bacterial, archaeal and plastid proteomes—only fully sequenced and assembled genomes submitted to the public databases and whose CDSs have been annotated are taken into account—and a family database that contains all manually created protein families and annotation templates (also called ‘family rules’). There is also a tool for user-derived complete protein annotation (protein recommended name, gene name, function, subunit, membership to a protein family, sequence features, etc., as specified in the family annotation template) that is provided upon submission of either one protein sequence, if it belongs to one of the HAMAP families, or of a complete genome even before submission to the public DNA databases. Since the system provides not only annotation, but also warnings regarding atypical N-termini, lack of conserved residues and many other features, we believe that this tool can help the scientific community in the annotation of whole microbial genomes or any protein from bacteria, archaea and plastids.