Although intense effort has gone into determining the correct functional annotation of proteins 
, primary gene structures are still imperfect. Proteomics provides a powerful experimental data type to access and improve the quality of genome annotation. A key advantage is the direct correlation between protein annotation and a protein based assay. In this study, analysis of 46 genomes spanning eight bacterial and archaeal phyla across the tree of life allowed us to develop a robust approach for proteogenomics annotation that is functional across genomes varying in %GC, gene content, proteomic sampling depth, phylogeny, and genome size. In proteogenomics, specificity proves more important than sensitivity and leniency at the hopes of greater genome coverage can dramatically increase the chance for false-positive novel protein identification. We evaluate the quality of proposed proteogenomic corrections through the conflict report. By no means implying that overlapping proteins are not real or cannot be found by proteogenomics 
, the vast majority of novel proteins with significant overlap were typically low quality and weeded out by stringent filters.
Our effort to understand why genes are missed in the initial annotation revealed that the only consistent problem was the expected sensitivity/specificity decline for short proteins. Citing the diversity of other errors, we suggest that all genome annotations leverage proteomics, either through concurrent proteomics/genomics sampling, or by utilizing the compendium of proteomically verified ORFs as a part of their extrinsic evidence set (i.e. in addition to blast or hmm searches).
For pseudogenes, we showed three types of misannotation, each resulting from a different deficiency in the sequencing and annotation process. Resolving the annotation of these is difficult, partially attributable to the potential for genome sequence errors. More pointedly, there is not a consensus on the meaning of ‘pseudogene’, whether ‘non-functional’ applies to the translated product's biochemical function or to the ability of a genomic locus to produce a viable transcript which gets translated. While this discussion is outside of the scope of this work, our perspective as proteomic scientists is that all translated products should be included in common database downloads.
We focused largely on false-negative annotations, where a region of DNA was not assigned to be protein-coding, but should have been. A more difficult misannotation is false-positives, which we find as novel/dubious pairs in the data and are more apparent for some genomes. These dubious genes can have far reaching effects, as they propagate through future genome annotations in what is known as “transitive disaster”.
As a final part of our methodology, we analyze the datasets to discover in vivo protein cleavage. Proteomic determination of cleavage sites offers several distinct advantages over strictly computational approaches that predict cleavage events directly from sequence. In addition to providing experimental validation of cleavage, proteomics yields a broad and unbiased sample of cleaved proteins. For example, in the Geobacter data sets over 150 proteins were identified as having a signal peptide, yet the overlap between these three genus members was only nine proteins. Thus a large number of distinct proteins were identified. This diverse set could serve as a powerful training/testing set to improve computational tools.