To test the accuracy, we reassigned K numbers to selected organisms in the manually curated KEGG GENES database. We show the results of Homo sapiens, Arabidopsis thaliana, Saccharomyces cerevisiae and Escherichia coli, where 25.2, 11.6, 32.1 and 63.3% of the genes, respectively, are currently annotated with K numbers. and list the sensitivity, specificity, positive predictive value (PPV), and precision of selected organisms with BBH method. The whole set of KEGG GENES and representative set excluding the query genome itself were respectively referred to for and .
Accuracy of K number assignment by KAAS with the BBH method and the whole set of KEGG GENES
Accuracy of K number assignment KAAS with the BBH method and the representative set
As a result of annotation with the whole set of GENES, the PPV of human gene reassignment was more than 90%. When the test set was limited to the genes with KO annotations, 98% of genes in human were correctly annotated. For E. coli, the accuracy of the reassignment is higher than that of human, because the KEGG GENES database contains many closely related organisms of E. coli. The PPV of Arabidopsis is ~50%, because there are no plants in the KEGG GENES database and many genes of Arabidopsis are left unannotated. Because the KO is not developed based on only the sequence similarity, there is the case that some KOs contain similar members. In that situation, the KAAS may not assign appropriate KOs to genes.
In the case of using the representative set, the genes were annotated without a drastic lowering of accuracy compared with the whole set. The computation time for E. coli takes about one-tenth of the whole set and selected eukaryotes take about one-seventh. For human and yeast, the accuracy of annotation was equal to or slightly better than that with the whole set of KEGG GENES. For Arabidopsis, the accuracy of annotation went down because the number of related organisms contained in the reference data was reduced. The sensitivity for E. coli went down because the representative set for prokaryotes excludes closely related organisms. The KAAS is useful as a rapid and high performance tool for forthcoming genome annotation because many taxa referred to as closely related organisms are now contained in the KEGG GENES database. For plants the accuracy of assignment will improve, as more plant genome projects are being processed.