Since the second release of the COG database in January 2000 (
3), nine new genomes have been added to
the database using the COGNITOR program with subsequent manual validation to
identify new members of pre-existing COGs and previously described
procedures for the construction of new COGs. The additions included
the first sequenced genome of a crenarchaeon (representative of
the second major division of the archaea),
Aeropyrum
pernix; a fifth representative of the Euryarchaea,
Pyrococcus
abyssi; and seven bacterial genomes, including those from unusual
organisms such as the extremely radioresistant
Deinococcus
radiodurans (Table ). The previously described
trend held with the new genomes in that 60–80% of the
proteins from each of the prokaryotic genomes could be included
in COGs (Table ).
| Table 1. Representation of genomes in the COGsa |
The genome of the crenarchaeon
A.pernix (
4), which was of particular interest because
this major evolutionary lineage had not been previously represented
among completely sequenced genomes, was investigated in detail as
a benchmark for annotation of newly sequenced genomes using the
COG system (
5). The COG analysis
resulted in an ~50% increase
in confident functional prediction for
A.pernix genes
compared to the original annotations. On the other hand, a significant fraction
of open reading frames (ORFs), originally annotated as genes, did
not show detectable similarity to any proteins in current databases,
but overlapped with proteins included in the COGs, strongly suggesting
that these ORFs were not real genes (Table ).
Thus the analysis of the genome of an organism that had no close
relatives among other organisms with sequenced genomes appears to
corroborate the effectiveness of the COG system as a genome annotation
tool.
| Table 2. Analysis of the predicted A.pernix proteins
using the COG system |
Given the accumulation of multiple, complete genome sequences,
we were interested in the growth dynamics of the COG set with the
increased number of included genomes. The growth curve was constructed
by imitating the COG formation for each of the 106 random
orders of genome inclusion (Fig. ). For
each number of species, the maximum, the minimum and the average
number of COGs was determined. The minimal and the maximal curves
define the area containing all possible growth curves (Fig. ). The average curve approximates the expected
dynamics of the COG growth. Given that the number of completely
sequenced genomes is still relatively small and that some of them
are closely related, it remains uncertain whether or not the number
of COGs is starting to approach saturation, and if it is, what is
the asymptotic value.