Currently, there are about a thousand sequenced prokaryotic genomes in GenBank, and several thousand more are in various stages of completion. For many bacterial species, sequenced genomes from several different strains are available, opening the possibility to study pan-genomes or supra-genomes. The pan-genome of a species or genus, as opposed to the genome of a single strain, is defined as the union of all gene families found at least once in a genome within that species or genus [1
]. Studying the diversity within pan-genomes is of interest for the characterization of the species or genus. Low pan-genome diversity could be reflective of a stable environment, while bacterial species with substantial abilities to adapt to various environments would be expected to have high pan-genome diversity. Visualizing the relations between genomes within pan-genomes could also be helpful in establishing a picture of the degree of horizontal gene transfer (HGT), as well as aid in the understanding of phenotypic differences.
Diversity between genomes is often displayed in the form of trees. Over the past decade several procedures have been proposed for constructing trees from more or less whole-genome data [3
]. Many strategies have been employed, and two major approaches are sequence-based and gene-content based trees. Sequence based trees include super-trees and phylogenomic trees, and their construction is based more or less directly on sequence alignments and evolutionary distances known from classical phylogenetics [5
]. The gene content trees use as data the presence/absence of genes in the various genomes, and compute distance between genomes from such data [8
]. The pan-genome tree described here would naturally be categorized amongst the gene-content trees.
It should be noted that the vast majority of genome-trees are constructed with the ultimate goal of reconstructing evolution. As for the gene-content trees, this has the effect that a separation between orthologs and paralogs is crucial, and HGT is considered to be noise that ideally should have no impact on calculation of distances between genomes (in the case of distance based trees). There are, however, other reasons for building trees. In applied sciences like medicine or agricultural sciences, a functional relation is as important as evolutionary distance. Admittedly, a good reconstruction of evolution can be very helpful to unravel the functional relations, but discarding HGT as noise in order to present a clean view of history is clearly a mistake in this context. The pan-genome tree we describe here is intended to display, in a hierarchical tree-like structure, the functional relationship between a “snapshot” set of sequenced genomes.