COGs have been identified on the basis of an all-against-all sequence comparison of the proteins encoded in complete genomes using the gapped BLAST program (
9) after masking low-complexity and predicted coiled-coil regions (
7). The COG construction procedure is based on the simple notion that any group of at least three proteins from distant genomes that are more similar to each other than they are to any other proteins from the same genomes are most likely to belong to an orthologous family. This prediction holds even if the absolute level of sequence similarity between the proteins in question is relatively low and thus the COG approach accommodates both slow-evolving and fast-evolving genes. Briefly, COG construction includes the following steps.
1. Perform the all-against-all protein sequence comparison.
2. Detect and collapse obvious paralogs, that is, proteins from the same genome that are more similar to each other than to any proteins from other species.
3. Detect triangles of mutually consistent, genome-specific best hits (BeTs), taking into account the paralogous groups detected at step 2.
4. Merge triangles with a common side to form COGs.
5. A case-by-case analysis of each COG. This analysis serves to eliminate false-positives and to identify groups that contain multidomain proteins by examining the pictorial representation of the BLAST search outputs. The sequences of detected multidomain proteins are split into single-domain segments and steps 1–4 are repeated with these sequences, which results in the assignment of individual domains to COGs in accordance with their distinct evolutionary affinities.
6. Examination of large COGs that include multiple members from all or several of the genomes using phylogenetic trees, cluster analysis and visual inspection of alignments; as a result, some of these groups are split into two or more smaller ones that are included in the final set of COGs.
By the design of this procedure, a minimal COG includes three genes from distinct phylogenetic lineages (protein sets from closely related species, such as, for example, Mycoplasma genitalium and Mycoplasma pneumoniae were merged prior to COG construction). The approach used for the construction of COGs does not supplant a comprehensive phylogenetic analysis. Nevertheless, it provides a fast and convenient short-cut to delineate a large number of families that most likely consist of orthologs.
Once the COGs have been identified using the above procedure, new members can be added using the COGNITOR program that is based on the same idea of the consistency between genome-specific best hits. If a protein sequence, when compared to the COG database, gives two or more best hits into the given COG, the protein in question is a candidate member of the COG.
To create the current set of COGs, the COGNITOR program was used to fit the protein sets from 12 complete bacterial and archaeal genomes into the 860 previously delineated COGs. The candidate COG members identified using the two-best-hit approach were further evaluated by a case-by-case examination of sequence alignments to verify significance of the relationships and the conservation of salient features of the proteins in the COGs, such as domain architecture and active centers of enzymes. Those of the proteins from the 12 new genomes that could not be included in the pre-existing COGs were analyzed using the original procedure for COG construction. The newly formed COGs were combined with the pre-existing ones to form the updated COG collection.