In spite of significant advances, gene annotation of newly sequenced genomes remains a challenging task. While manual curation is still essential to produce high-quality gene and transcript annotations (Guigó
et al.,
2006), automatic genome annotation pipelines produce increasingly accurate gene sets (Harrow
et al.,
2009), in particular for well-characterized protein coding families and when other well-annotated evolutionary close genomes exist. Due to their peculiar recoding of the standard genetic code, selenoproteins constitute the most notable exception; even for well-annotated genomes, they are often mispredicted. Indeed, as we have shown through the analysis described here, most eukaryotic selenoproteins are misannotated in the available reference gene sets. Since misannotation invariably involves the deletion of the region in the protein sequence including the Sec-UGA—key to proper family assignation—misprediction in the case of selenoproteins have the additional negative effect, beyond simply protein truncation, of impairing proper functional characterization.
Proper annotation of selenoprotein genes—even those belonging to well-characterized protein families—requires substantial human intervention. Indeed, due to the degeneration of the sequence of the SECIS element, and to the complex evolutionary history of selenoprotein genes, with frequent gene duplications and family expansions, pseudogenizations, and the yet not completely understood evolutionary dynamics of Cys to Sec interconversion (Castellano
et al.,
2009), detection of sequence homology is, in general, not sufficient for correct selenoprotein identification. In fact, the correct annotation of the two dozen (at the most) selenoprotein genes corresponding to known selenoprotein families, which may be encoded in a newly sequenced eukaryotic genome takes, in our experience, 2–3 weeks of full-time work of an experienced scientist. He/she has to browse through a maze of multiple sequence alignments and SECIS predictions, making often
ad hoc decisions, which generally involve running additional, more sophisticated alignment programs and post-processing their output. In selenoprofiles, we have attempted to encapsulate the experience that we have accumulated during the years in manual identification of selenoproteins. Selenoprofiles includes standard sequence similarity search and sequence alignment programs together with custom made post-processing scripts and a number of rules that direct the overall flow of the process. The core of selenoprofiles is a set of very high-quality multiple sequence alignments for the different selenoprotein families and subfamilies. Given that we know a priori which positions in a profile alignment are allowed to bear a selenocysteine, selenoprofiles favors the alignment to UGA codons only if these are aligned to one such position. Therefore, an important feature of each profile alignment is the position or positions that contain Sec, and one of the major determinants of the efficiency of the selenoprofiles pipeline are the species and the subfamilies represented in the profile. Selenoprofiles automatically selects the best sequence to be used as query from the profile. Consequently, if the profile contains at least one sequence that is very similar to the protein coded by the gene that is predicting, the prediction will be accurate. But if the most similar sequence in the profile differs from the real protein encoded in the investigated genome in the presence or absence of some domains, or if there is poor conservation between the two sequences at some regions (often at one or both ends), then the prediction may be inaccurate. Input profile alignments for selenoprofiles should, therefore, be as consistent, complete and representative as possible. In this regard, as new genomes are being analyzed, we keep updating selenoprofiles, and we are working in a procedure to substantially automate this updating.
While selenoprofiles does not completely eliminate the need for manual intervention, it dramatically reduces it. We estimate that, after running selenoprofiles on a (newly sequenced) genome, an experienced scientist will need, in general, only a few hours to produce a high-quality annotation of the selenoprotein genes corresponding to known families in the genome. But, given its low false positive rate, even the default output of selenoprofiles will generally be a much superior annotation of selenoprotein genes than that produced by automatic annotation pipelines—including the most sophisticated ones. In this regard, we believe that selenoprofiles would be a useful complement of such pipelines, and we are working on a method to automatically correct the misannotated selenoproteins taking into account the selenoprofiles output. Using directly this output may not be an option, since sophisticated annotation pipelines rely on transcript information (such as ESTs and cDNA sequences), as well as genomic sequence conservation across species, and the overall gene structure delineated using this information is likely to be superior to the one delineated by selenoprofiles, with the exception of the region including the Sec-UGA. Therefore, a better strategy will be to conciliate the selenoprofiles prediction with the annotated gene, giving predominance to the selenoprofiles prediction in the region (exon) containing the Sec-UGA, but to the annotated prediction in the rest of the gene/transcript.
One limitation of selenoprofiles is that it predicts, with a few exceptions only one transcript per gene. Nonetheless, if alternative splicing forms (Sec/non-Sec) exist for a gene, the pipeline is likely to pick the Sec containing transcript, or one of them, due to the scoring scheme used. If selenoprofiles is used on transcribed sequences (such as ESTs, cDNAs, or RNA sequences) instead of genomic sequences, it could potentially produce predictions for multiple splicing isoforms of selenoprotein genes. While we have developed and tested selenoprofiles to annotate eukaryotic selenoproteomes, the strategy that we have employed can be easily ported to prokaryotic genomes as well. This requires the building and curation of the corresponding profiles, the usage of the bacterial and archaeal SECIS patterns, and the modification of some of the selenoprofiles rules.