Next-generation sequencing allows large sets of bacterial genomes from the same species to be generated for multiple strain comparisons. The observation that for some species strains can acquire and lose large portions of their protein repertoire led to the concept of the pan-genome (1
). The most fundamental pan-genome analysis is to compare differences in protein content between strains. In order to determine these differences, a correspondence between equivalent proteins in different strains must be established. The most common meaning of equivalent protein is a protein’s ortholog. Orthologs are defined as homologous genes that are related through speciation from a single ancestral gene, not through gene duplication (3
). Orthologs tend to serve the same role and have the same function, particularly the more closely related the organisms are. Furthermore, for pan-genome analysis of closely related strains, ‘operational’, not functional, equivalence is more desirable than functional equivalence alone since, for example, two copies of a nearly identical protein are likely functionally equivalent, but could be under differential regulation. The copies under similar transcriptional regulation (i.e. in similar genomic neighborhoods) are likely to be the ones with similar operational equivalence; therefore, pan-genome analysis software should consider the genomic neighborhood of orthologous genes. When a gene is duplicated after speciation, or in species pan-genomes after strain differentiation, both copies of the gene are defined to be co-orthologs to the unduplicated gene in the other species or strains. For pan-genome analysis, we believe it is preferable to cluster only the co-orthologs with the same genomic context, but additional information should be reported indicating the co-ortholog relationship.
In general, determining orthologs is a hard problem (4–6
) and has most often been investigated across species where evolutionary time has allowed for a great deal of protein sequence and genome context divergence. For greatly diverged species, genome context has been found to have little benefit for ortholog clustering (7
). The key issue is distinguishing paralogs, homologous genes arising from gene duplications, from orthologs. Often, after gene duplication, paralogs diverge to take on different roles and functions. For diverged species, tree-based methods tend to perform best at ortholog clustering, albeit at the cost of being much less computationally efficient. The reason for this is that tree-based methods build multiple sequence alignments that can distinguish which amino acid residues are conserved within orthologs, but not between paralogs, even when the average pairwise alignment scores for orthologs versus paralogs may be indistinguishable. Graph-based methods, which rely on only the pairwise alignment scores, which are much more computationally efficient to generate, can suffer by comparison. For strains of the same species, the orthologous proteins tend to have little divergence and retain a conserved genome context. Paralogs that have diverged are easily distinguishable from the highly conserved orthologs by simple pairwise distances. Very recently duplicated paralogs are often indistinguishable even using tree-based methods, but are separable based on genome context. Pan-genome ortholog clustering tool (PanOCT
) was designed to make use of this genome context or conserved gene neighborhood (CGN) information to better separate very recent paralogs.
There are a number of commonly used programs for determining orthologous gene clusters, but they were designed for clustering genes from distantly related eukaryotes, not closely related strains/species. These ortholog-finding programs consist of three conceptual methods: tree-based, graph-based and hybrid methods (4
). Tree-based methods infer orthologs and paralogs by comparison of trees made with homologous genes to species trees. Graph-based methods use pairwise alignments to determine homology/distance between proteins to weight edges of the graph. Hybrid methods use a combination of tree- and graph-based methods. Mainly for computational efficiency, but also for availability, the graph-based InParanoid (8
), OrthoMCL (9
) and Sybil (10
) ortholog clustering programs are often used for comparative genomic analysis (11–16
is a graph-based method, but differs from existing methods in its use of both the Basic Local Alignment Search Tool (BLAST) score ratio (BSR) (17
) and CGN in a weighted scoring scheme to generate clusters containing single orthologous genes from each of multiple genomes and by detecting and accounting for potential frame-shifts. The concept of using the context of neighboring genes, that are themselves orthologous, to identify orthologs is not new (7
); however, coupling CGN together with pairwise sequence identity and frame-shift detection to cluster orthologs in a single open-source application is novel. Algorithms have been developed that use both reciprocal best hit (RBH) and CGN, but either are used only as the back-end of a static database (ATGC, (19
)), are used to score and visualize the genomic context of homology ‘pillars’ in a web browser (YGOB, (20
)), or are functioning to re-cluster pre-computed ortholog/paralog clusters using CGN (IONS, (21
)). Direct comparison with ATGC was not possible since the application was unavailable. PanOCT
was compared with three popular graph-based programs: InParanoid (8
), OrthoMCL (9
) and Sybil (10
) alone and in combination with IONS (21
). GOB, the back-end CGN-detection script of YGOB (20
), was obtained from the author. Using only ortholog clusters that were the same for InParanoid, OrthoMCL, Sybil and PanOCT
as the pillars to input to GOB, the output of GOB was also compared with PanOCT
) tries to distinguish out-paralogs (i.e. duplications occurring before a species split) from in-paralogs (i.e. recent duplications after a species split) using a combination of RBH, also known as bi-directional best hit, and a heuristic clustering method for resolving overlapping groups of paralogs. A pairwise BLASTP cutoff score of 50 bits and an overlap cutoff of 50% are required for further consideration of orthology.
) tries to distinguish in-paralogs from out-paralogs similarly to InParanoid. This program also uses RBH BLASTP matches to identify orthology, but uses a BLASTP P
-value cutoff of 1 × 10−5
instead of the bit score cutoff and does not consider the length of the match. Potential orthologous and paralogous protein relationships are converted into a graph with weighted edges. The resulting graph is used as input to the Markov Cluster algorithm (22
) to attempt to separate orthologs from paralogs.
) clusters are computed in a two-step process: Jaccard coefficient-based clustering of the proteins within a genome to determine paralogs and RBH BLASTP match clustering of the resulting Jaccard clusters (JAC) between genomes to determine orthologs. The Jaccard clustering step computes a similarity coefficient from filtered intra-genome unidirectional pairwise BLASTP matches (E
-value of at most 1 × 10−5
and a percent identity of at least 80%), resulting in clustering of in-paralogs called JACs. RBH matches of JACs from different genomes are then clustered to form Jaccard orthologous clusters. Similar to InParanoid and OrthoMCL, Sybil clusters in-paralogs with orthologs, but the JAC parameters can be set to effectively exclude Jaccard clustering results, creating ortholog-only clusters based solely on the RBH BLASTP matches.
uses BLASTP matches and CGN to predict orthologous clusters for pan-genomes. CGN is defined as the conservation of gene order and orientation within the genomes of closely related species. PanOCT
is specifically designed for pan-genome analysis of closely related species/strains where CGN can be effectively used to distinguish groups of paralogs into separate clusters of orthologs (7
); however, it will also work on analysis of more distantly related microbial species, but CGN will be of less benefit.