Usually, adjacent genes in prokaryotic chromosomes tend to be separated by a short intergenic distance or overlap by some base pairs in a preferred phase [
6,
12,
14,
15]. Particularly common are overlaps where the stop codon of the upstream gene is overlapping with the start codon of the downstream gene (overlaps of 1 or 4 bps) [
6,
7,
11,
14,
15,
18]. Overlapping genes among prokaryotes represented around 17% (173,663 overlapping pairs) out of the total gene pairs contained in 338 microbial genomes (1,016,129 gene pairs). Although it is lower percentage than some authors have reported before [
6], those overlapping genes are a consistent feature of the prokaryotic chromosomes and are worthy of study. Of these 173,663 overlaps we selected 42,055 where both genes were well-characterized for our study. Among the prokaryotic overlaps, those with co-directional overlaps were clearly the most frequent, reflecting the fact that this is the most common orientation of two adjacent prokaryotic genes [
18]. Furthermore, the genes in the prokaryotic chromosomes tend to be grouped into operons of functionally related genes and usually, those genes of a given operon are on the same strand [
19-
24]. In fact, co-directional overlaps represented around 92% (38,563 overlaps) of the well-characterized overlaps considered here, while convergent overlaps represented 7% (3,035) and divergent overlaps 1% (457). Of these overlaps, we chose a set of 968 overlaps longer than 60 bps that had consistent coordinates in three different databases.
Types of misannotation
We were looking for functional overlaps among the 968 overlaps longer than 60 bps. Every gene of the overlapping pairs was compared with its orthologs. If there is a difference in gene length between the gene and its orthologs the overlap is probably unreal and caused by a sequencing or annotation error in one of the genes of the overlap. This difference in gene length could also mean that the overlap is real though unconserved and therefore, not functional. Although we can not definitively distinguish between these two facts, by categorizing the long overlaps manually, we can notice patterns that provide us with hints. For a list of all the overlaps manually analysed here see Additional file
1.
First of all, we manually analyzed 715 co-directional overlaps longer than 60 bps. Surprisingly all of them fell into the following categories (Figure ):
i) 5'-end extension of the downstream gene due to either a mispredicted start codon or a frameshift at 5'-end of the gene. The upstream gene had the same length as its orthologs, while the downstream gene was longer than its orthologs at the 5'-end. Furthermore, in all the 409 cases classified, the downstream gene had alternative start codons which were downstream of the predicted initial codon, which could produce a product with a similar or even an equal length to its orthologs. These cases represented around 57% of the co-directional overlaps longer than 60 bps analysed. Therefore this suggests that the most important cause of long overlaps is a misprediction of the start codon of a gene;
ii) Fragmentation of a gene caused by a frameshift. In these cases the upstream gene was longer than its orthologs at the 3'-end and the downstream gene was clearly shorter than its orthologs. Furthermore, in these 163 cases both members of the overlapping pair could be mapped to a single gene in a closely related species, suggesting that a frameshift mutation/sequencing error fragmented one gene into an overlapping pair. These cases represented around 23% of the co-directional overlaps longer than 60 bps analysed and therefore, this is the second most important group of misannotations.
iii) 3'-end extension of the upstream gene due to either a frameshift at 3'-end of gene or point mutation at the stop codon. The upstream gene was longer than its orthologs at the 3'-end, whereas the downstream gene had a similar length to its orthologs. Either a frameshift at the 3'-end or a point mutation at the stop codon may cause the loss of the stop codon, thus extending the reading frame to the next in-frame stop codon. We found 68 cases (9,5% of the co-directional overlaps analysed) that showed this pattern.
iv) Redundant gene prediction where the genes overlap entirely or almost entirely and are in the same reading frame. This is a really strange case and actually we only found 4 gene pairs (0,5%), most of them labelled as putative genes.
v) 5' & 3'-end extension which is a combination of i) and iii). The upstream gene is longer than its orthologs at the 3'-end as well as the downstream gene being longer than its orthologs at the 5'-end. We classified in this group 71 overlaps (10%).
Regarding the overlapping lengths, the overlapping mean length of the 5', 3' and 5' & 3'-end extension groups was 104, 121 and 106 bps respectively. Nevertheless, the overlapping mean length of the fragmentation type was 162 bps, therefore this type of misannotations appears to cause longer overlaps. In order to know what type of misannotations causes the longest overlaps, we did not take into account the lengths of the overlaps caused by redundant gene prediction, because the gene pair is overlapping entirely or almost entirely and actually this type of misannotations occurs very rarely.
Although we extensively focused on the co-directional orientation, we also examined the long overlaps in the other orientations, specifically, 75 divergent overlaps and 178 convergent overlaps longer than 60 bps. All the divergent long overlaps belonged to group i), which means that all of them were misannotations due to a 5'-end extension of one or both genes of the divergent overlap. However, among the convergent overlaps we found putative true overlaps. Actually, as other authors have reported before [
14], conserved convergent overlaps are affected by annotation errors to a lesser extent because they are not affected by the high rate of misannotated start codons. However, we could classify 124 convergent overlaps into group iii) as misannotations. Therefore, the misannotations are also affecting convergent overlaps, particularly those misannotations caused by a 3'-end extension in one or both genes of the pair. The other 54 convergent overlaps might be real, although most of them are only conserved in very close species.
Thus, we can now suggest ways to correct 914 gene pairs and clear the respective overlaps that are the result of misannotations. These overlaps caused by misannotations represent around the 2% of the overlaps of well characterized genes (42,055). Therefore, this is worth taking into account in the annotation processes.
Misannotations in prokaryotic genomes
As expected, the number of overlaps decreases with an increasing overlap length (Figure ). Equally expected is the avoidance of multiples of 3 bps overlaps for adjacent co-directional genes [
6,
14,
15]. Although Figure shows multiples of 3 bps convergent and divergent overlaps, none co-directional overlap was found with an overlapping length of multiple of 3 bps. We also studied in co-directional overlaps whether some particular genomes stood out in terms of overlaps because of their annotation protocols. Indeed, in some genomes large overlaps are more abundant with
Brucella melitensis 16 M leading with 38 likely misannotated events. Interestingly, 25 of those pairs were due to fragmentations [see Additional file
2]. Second in the list is
Rhodopirellula baltica SH1, which has a really strange genome. It contains 28 misannotated overlaps, 26 of them are due to 5' or 5' & 3'-end extensions and it is the genome which has more divergent overlaps misannotated. Also we have observed that Xanthomonas genomes accumulated a high number of misannotations. Probably, the initial mispredictions in the first Xanthomonas genomes sequenced were propagated within this taxon due to the high sequence similarity among their genomes. For a list of 27 genomes with high number of overlaps see Additional file
3.
We tried to further identify reasons that might cause frameshifts and misannotations in the genome projects [see Additional file
3]. The genomes that accumulate a high number of errors are not the longest in size or the highest in gene content. For instance, the
Brucella melitensis 16 M chromosome has 3294931 nucleotides and 3198 predicted genes and accumulated 38 misannotations, whereas the
Vibrio vulnificus YJ016 chromosome has 5211578 nucleotides and 5098 predicted genes but accumulated only 12 annotation errors. A high AT content could be related to a high number of mispredictions of start codons. However, no correlation between a high number of misannotations and a high percentage of AT was observed. We also did not observe any clear bias to any sequencing or annotation method, though 6 out of the 28 genomes worst annotated were done by Glimmer predictor [
25] exclusively. However, the use of a determined gene predictor or a combination of different gene predictors, does not assure us that we will avoid the types of misannotations described here. The number of misannotations could also be related to the sequencing date. On one hand, an early sequencing date could be related to a high number of misannotations because less maturated technologies and tools were used. On the other hand, a recent sequencing date could be related to a high number of misannotations due to lower coverage and a higher degree of automation. However, no trend was observed in the number of misannotations regarding the sequencing date.
Mispredicted start codons
5'-end extensions clearly have the highest number of misannotations because of mispredictions of start codons or upstream frameshifts whereby the former is clearly dominant (data not shown). Therefore we can say that the main problem in the annotation of real genes is the misprediction of start codons. Most genes tend to start with AUG while the alternatives GUG and UUG are used sparingly [
16]. AUG is a more potent initiator than GUG or UUG [
26], which are considered weak start codons. To quantify the observed effect regarding start codon usage, we compared the start codons of potentially misannotated genes with those from randomly chosen microbial genes. The genes which have putative mispredicted start codons (the genes with a 5'-end extension from wrong categories i), v) and from misannotated divergent overlaps group) had alternative start codons (AUG, GUG or UUG) downstream in the sequence. This could indicate that a gene with a mispredicted start codon has an additional correct one nearby. Furthermore, we observed statistical differences (P < 0.0001, Chi square analysis) which were extremely significant among the start codon usage between genes with a putative mispredicted start codon and a random set of genes. It seems that the use of the weak start codons (GUG, UUG) is overrepresented among the genes with putative mispredicted start codons [see Additional file
4]. We found that from the 579 genes, which potentially could have a mispredicted start codon, 270 start with AUG, whereas 172 and 133 with GUG and UUG respectively. In contrast, among the random sets of genes around ~462 start with AUG, whereas only around ~77 and ~38 with GUG and UUG respectively. Therefore, long overlaps, in conjunction with the use of weak start codons could be a sign that the 5'-end of an ORF has been mispredicted and must be taken into account by the annotation algorithms. In fact, some previous SD studies agreed with this finding. Starmer
et al. explained genome annotation errors with a bias in the start codon prediction towards the usage of GUG instead of AUG [
27], whereas a previous study performed by Ma
et al. [
16] found in
E. coli K12 a significant group of genes which started with GUG or UUG and which do not have an SD sequence and hence were erroneously annotated as putative or hypothetical proteins.
The longest real co-directional overlap
When studying co-directional overlaps below 60 bps, the longest real one we could identify was caused by two co-directional genes coding for the DNA polymerase psi subunit (
holD) and an alanine acetyltransferase (
rimI). Figure shows the alignment of the C-terminal end of the DNA polymerase psi subunit and the N-terminal end of the alanine acetyltransferase as well as an arrangement of overlapping regions and amino acid conservation within the overlap among three representative Enterobacteria species. This figure highlights the high similarity among the Enterobacteria orthologs at the C-terminal end of the protein encoded in
holD gene, at the N-terminal end of the protein encoded in
rimI gene and within the overlapping region at the level of nucleotide sequence. This overlap was previously reported to be 32 bps long in
Escherichia coli [
28] which would correspond to around 10 overlapping amino acids; however orthologs gene pairs in the Yersinia and Salmonella genomes reached 56 bps, which would correspond to overlaps of about 18 amino acids. Although the exact gene length seems genus specific, this particular overlap is well conserved among Enterobacteria, and therefore unlikely to be due to a misannotation reported here.