Yeast species have many features that make them an attractive model system for eukaryotic comparative genomics. These features include a high level of synteny conservation and small genome sizes (9-21 Mb) due to a low content of repetitive DNA and few introns [
1,
2]. We previously developed an online tool - the Yeast Gene Order Browser (YGOB) - for comparing local gene order relationships among species in genera such as
Saccharomyces,
Kluyveromyces and
Lachancea [
3]. YGOB now contains genomic data from 11 species (Figure ). Among these species, some form a clade of descendants from an ancestral whole-genome duplication (WGD) that changed the basal chromosome number from 8 to 16 [
4], whereas others diverged before the WGD occurred. We are unsure what depth of evolutionary time is represented by the species in YGOB, but when measured in terms of average protein sequence divergence this group of yeasts is approximately as diverse as the whole phylum Chordata [
5].
YGOB contains 'pillars' of homology assignments across the 11 species. Each pillar can contain up to one gene from each non-WGD species and up to two genes from each post-WGD species [
3]. The genes in a pillar are therefore orthologs or (in the case of a post-WGD species retaining two genes) paralogs resulting from the WGD. The pillars have undergone several years of manual editing to make them as accurate as possible. YGOB also contains an 'Ancestral Genome', which is the inferred gene content and gene order of the extinct ancestor that existed immediately prior to WGD [
6].
The gene annotations in YGOB are derived from the original authors' annotations of the genome sequence of each species. In some cases we have 'switched off' genes in the original annotation that we believe to be spurious, but until now we have not added any genes to the original annotation sets (or to the current Saccharomyces Genome Database [
7] annotation for
S.
cerevisiae). However, while using YGOB we noticed many instances in which a particular gene appears to be missing in a particular species, in a genomic region that otherwise shows conserved synteny among all the species. Such loci appear as gaps in the YGOB display. For the post-WGD species it is quite common for one of the two paralogs formed by WGD to have been deleted, but it is more surprising to find genes that are completely absent (zero copies) in either a non-WGD or a post-WGD species.
Upon further examination we found that many of these apparently zero-copy loci are artifacts. When we examine the relevant DNA region, we find
bona fide genes that were not annotated or were mistakenly labeled as pseudogenes, even in the case of highly curated genomes. This is particularly a problem with short genes of less than 100 codons, highly diverged genes, and genes containing introns. In some cases, all genes <100 codons were excluded entirely from the original curators' annotations due to the difficulty in telling these apart from spurious ORFs [
8,
9]. However, current estimates according to the Saccharomyces Genome Database (SGD) [
10] are that the
S.
cerevisiae nuclear genome contains 131 verified ORFs of <100 codons and even among these, 28 contain introns. Detecting these 'missing genes' is important for many reasons, but our particular interest in this topic is that it would allow the correct identification of genuine lineage-specific gene gains and losses which may have evolutionary significance.
The primary reason why short genes are difficult to annotate is that they do not generate sufficiently strong hits (low
E-values) in BLAST searches [
11]. For instance the amino acid sequence of ribosomal protein L41 is nearly identical among all the species in YGOB, but because this protein is only 25 residues long the BLASTP E-value between any two Rpl41 sequences is only of the order of e-07 to e-06. Many annotation pipelines would regard such a hit as insignificant, because
E-values of this magnitude are often obtained purely by chance when longer query sequences are used. More generally, any gene whose predicted protein product cannot generate a significantly strong BLAST score against its orthologs will tend to remain unannotated. Weak BLAST scores can be caused by very rapid sequence divergence [
12,
13], or a high content of repetitive sequence that is masked by sequence-filter software [
14], as well as by short sequence length.
In this manuscript we describe SearchDOGS, a piece of software that works in conjunction with BLAST [
11] to identify unannotated genes. It is particularly designed to find genes that generate only weak BLAST hits, but whose syntenic context indicates that they are genuine orthologs to known genes. The major feature of SearchDOGS is that the genomes in the nucleotide sequence database used in the BLAST search have been pre-processed to subdivide them into sets of genomic regions that are already known to be orthologous. DOGS is an acronym for Database of Orthologous Genomic Segments. The BLAST results can then be post-processed to identify cases, even with very high
E-values, where (i) the query protein hits genomic regions from multiple species in the database, and these regions are orthologous; or (ii) the syntenic context of the query protein is known, and it matches that of one or more of the database entries it hits.
SearchDOGS was initially developed as a standalone tool for displaying the syntenic contexts of the genomic hits obtained in a TBLASTN search using a single protein query. We then adapted it to carry out an automated and systematic search for unannotated genes in the genomes of all 11 yeast species in YGOB. Because the detection of a small or highly-diverged gene in one species may in turn make it possible to detect further orthologs of this gene in other species when the first gene is used as a query, we re-ran successive iterations of SearchDOGS on the yeast genomes until the program failed to find any more new genes.