The analysis was performed on 14 bacteria belonging to the Vibrionaceae family (Table ). The redundant list of Vibrionaceae ORFs was clustered to reduce the number of proteins to analyze and the phylogenetic profile for each cluster was calculated as described in the Method section.
List of organisms used in the analysis.
Many authors proposed and successfully applied different measure methods to calculate the phylogenetic profile values.
Pellegrini et al
] firstly proposed a phylogenetic profile described as a string of bits, each bit representing the absence or presence of an homologous gene in a given genome. This method lacks a weighting procedure, giving the same weight (value 1) to all the sequences that are considered homologous given a similarity threshold. Enault and colleagues proposed an improved phylogenetic profile based on a normalized Blastp bit score [4
]. This method, compared to the approach implemented by Pellegrini, allows weighting each point of the profile proportionally to the length and the quality of the alignment. Jingchun and colleagues optimized the phylogenetic profiles method by integrating phylogenetic relationships among reference organisms and sequence homology information, based on E-value score, to improve prediction accuracy [5
The measure index I proposed in this work is similar to the others described above, taking into account both the quality and the length of the alignment using a substitution matrix. Moreover our approach considers also the total length of the sequences, penalizing good alignments occurring between ORFs having different lengths and taking into consideration that ORFs could differentiate mainly for the presence of functional domains.
The final phylogenetic profile for each cluster was defined as the median of all the profiles belonging to the cluster, named "meta-profile", which describes the profile of conserved ORFs belonging to an entire family.
Hierarchical cluster analysis
A hierarchical cluster analysis was performed on the entire phylogenetic profile matrix and it was calculated a statistical support based on bootstrap method for the nodes of the columns tree (Fig ). The branch tree colors represent the bootstrap percentage support. This constitutes a phylogenetic tree based on gene content using Vibrionaceae
ORFs as a reference. Genomes belonging to the same taxonomic group tend to cluster together and the Vibrionaceae
species are closely related. As expected, according to the Vibrionaceae
branch lengths it is evident that variability within this group is higher compared to the other groups. The dataset used for phylogenetic matrix calculation is indeed composed by Vibrionaceae
ORFs. This implies that the similarity measures between these ORFs and the corresponding orthologues will be nearly zero in most of the other species and significantly higher in the Vibrionaceae
family, increasing the variability into this group. Moreover the average percentage of clusters shared by the Vibrionaceae
members is only 47.5% (average number of shared clusters divided by the total number of clusters) that again indicates a high variability inside this family. It is also interesting to note that organisms belonging to the same or closely related taxa split into different subgroups. This highlights the existence of a high variability among lineages, due to genetic and evolutionary processes such as lateral gene transfer, concerted evolution and gene duplication [6
]. In terms of gene content, the organisms more related to the Vibrionaceae
belong to the gamma and beta proteobacteria
. In particular Altermonadales
are closely related to Vibrionaceae
, and share the higher number of cluster of genes (average percentage of 20%). As expected, Archea
are the most distant group sharing just 3.8% of clusters.
Figure 1 Hierarchical cluster analysis with bootstrap resampling method was performed on the complete set of organisms (columns of the phylogenetic profile matrix). The number of genes identified in each organism (with a similarity measure greater than zero) is (more ...)
Clusters and genes distribution, as shown in Fig , reveals that the number of clusters and genes shared by the organisms decreases as the number of organisms considered increases. The analysis was performed considering for each cluster profile the number of organisms sharing the same numbers of clusters (and genes). The majority of gene cluster groups no more than 21 species on a total of 320. The highest blue spike corresponds to the higher number of genes shared by 105 groups of 14 organisms. Among these groups, as expected, Vibrionaceae are highly represented. Other species represented are Colwellia psychrerythraea 34H and Shewanella oneidensis, that belong to the Alteromonadales family.
The blue line represents the number of genes, while the red line reports the number of gene clusters shared by an increasing number of genomes.
The cluster analysis performed on genes is shown in Fig. . From now on, to avoid confusing interpretation between clusters derived from the cluster analysis and cluster derived from the ORFs clustering we will use the term "gene" in place of cluster of ORFs.
Figure 3 Two-way hierarchical cluster analysis of the entire phylogenetic profile matrix (panel A). Panel B: dendrogram selection zoom of highly conserved genes shared among all the organisms; panel C: genes conserved mostly among gamma proteobacteria; panel D: (more ...)
The different gradient of color, from bright to dark red, represents decreasing similarity values. The cluster analysis allows the detection of three main groups of genes. The first one (Fig , panel B) contains the most conserved and established genes shared almost by all the organisms. These core genes can be defined as the set of all genes shared as orthologous by all members of an evolutionary coherent group. In our analysis we identify four clusters, for a total of 145 genes, shared by all the 320 organisms.
The ORFs belonging to these clusters are predicted to codify for the ATP binding subunit of ABC transporters (annotated as ABC-type polar amino acid transport system, ABC-type antimicrobial peptide transport system, ABC-type histidine transport system and ABC-type transport system involved in lysophospholipase L1 biosynthesis). This finding is surprising since this is the first report where these ORFs are assigned to the core genes. Anyway two different explanations can be traced. First, it is known that the ABC transporters represent an essential transport system in the prokaryotes and that their ATP binding subunits are apparently overrepresented compared to the other two subunits (ligand binding and permease subunit) in all genomes sequenced thus far [7
]. Second, one organism, Buchnera aphidicola
, presents these genes with a similarity just below the cut-off used for the analysis, but they have been considered since it is well known that in this mutualistic endosymbiont the accelerated evolution and AT bias affect all its genes, including the 16S rRNA [8
The dataset used for the analysis includes genomes in draft quality (Vibrio cholerae
0395, Vibrio cholerae
MO10, Vibrio cholerae
RC385, Vibrio cholerae
V51, Vibrio cholerae
V52, Photobacterium profundum
MED222, Vibrio splendidus
12B01). Wrong ORFs prediction or missing genes due to incomplete genome sequences can explain the low number of core genes identified. To avoid such problems we repeated the analysis excluding the draft genomes and thus considering 312 genomes. The results, reported in Table , show an increased number of the core genes and in particular ribosomal proteins and tRNA synthetase, as reported by Charlesbois and Doolittle [10
]. This could be considered as a sort of "minimal genome" containing the group of genes that are necessary to maintain a free-living organism.
Core genes shared considering different number of genomes. The table shows the progressive number of genes shared with increasing number of genomes.
The low number of genes shared by all the organisms can be due to many factors. First of all we used the Vibrionaceae
ORFs as a reference, limiting the number of genes we were able to identify. It was further demonstrated that the core gene size decreases as more genome sequences are analyzed [10
Genes that are considered to belong to the core set when close organisms are compared, are classified as flexible genes when distantly related genomes are analyzed [6
]. Finally, genes within core genomes might be transferred or replaced, introducing new versions of existing genes into genomes. Such transfers can replace even highly conserved genes by non-homologous counterparts but the advantages provided are difficult to explain. It is also to take into consideration that many symbiotic and parasitic bacteria undergo a reduction of their genomes, loosing many genes required by free-living cell.
The second group (Fig , panel C) represents genes shared mainly among Vibrionaceae and other gamma proteobacteria (as Altermonadales, Burkholderiales and Enterobactidiales).
Finally, the third group (Fig. , panel D) is composed by genes that are mainly specific to the Vibrionaceae.
k-mean cluster analysis and cluster enrichment
We performed a k-means cluster analysis, setting the k value to 14. As shown in Fig. , the clusters 3, 4, 11, 13 and 14 contain the higher percentage of genes, accounting for more that 50% of the total genes, while clusters 9 and 10 contain the lower number of ORFs (3% of genes). The variance in each k-means cluster is very low (Fig. ), meaning that the clusters contain genes with compact and similar profiles. As described in Fig. , the majority of the clusters (1, 2, 3, 4, 5, 6, 7, 9, 12, 13) contains genes with a similar profile, with the average values (red line) near zero, except for the presence of some spikes correspondent to an increasing similarity with some isolated organisms. As shown in Table , clusters 1, 3, 4, 5, and 9 contains genes that have a high similarity in a small subset of organisms. The majority of these ORFs are annotated as hypothetical proteins or phage related proteins. Clusters 8, 10 and 14 present genes shared among almost all the organisms. In particular cluster 10 is composed by the core genes described before having an high value of similarity widespread among all the organisms; cluster 8 contains genes shared mainly by gamma proteobacteria and cluster 14 is composed of genes in common between Vibrionaceae and Enterobacteriaceae.
Figure 4 Phylogenetic profiles of all the 14 clusters identified by k-means analysis. Gray lines represent profile patterns (arrays of similarity measure) of genes in the clusters. Red line identifies cluster average profiles. Cluster name and number of genes (more ...)
Clusters of genes mainly shared by Vibrionaceae family. In the second column the organisms representative of each cluster (column 1) is reported with its median and standard deviation similarity profile.
A functional annotation has been performed on each gene cluster using COG (Cluster of Orthologous Genes), KEGG pathway map and GO databases. For each k-mean cluster the enrichment probability with respect to the total number of clusters has been obtained with the hypergeometric distribution.
Fig. shows COG enrichment results for each cluster. As expected clusters represented by conserved genes (cluster 8, 10 and 14) have the higher number of enriched COG codes, while cluster specific of few organisms are characterized by a small number of enriched COGs. The majority of clusters presents COG codes enrichment for S (function unknown), R (poorly characterized) and – (absence of COG code) categories. This is due to the large abundance of unknown and hypothetical proteins presents in the Vibrionaceae proteomes.
COG, KEGG and GO categories enrichment across the 14 k-means clusters. Panel A: Presence (coded 1) or absence (coded 0) of enriched COG categories for each cluster. Panel B: numbers of KEGG and GO enriched classes for each cluster.
It is worth noting that cluster 3, mainly represented by Photobacterium profundum
SS9 ORFs, is enriched only by C (Energy production and conversion), L (DNA replication, recombination and repair) and M (Cell envelope biogenesis, outer membrane). Probably the L class overrepresentation is determined by the high number of transposons that are present in the SS9 genome [11
]. The role played by these transposable elements in the survival of this deep-sea bacterium it is still a matter of debate [12
In addition V. vulnificus
YJ016 and V. vulnificus
CMCP6 (cluster 13) seem to share genes belonging to the enriched COG classes K (Transcription), L and T (Signal transduction mechanisms). It was previously reported an enrichment in genes belonging to the transcription class in the genome of V. vulnificus
respect to the V. cholerae
]. This class is clearly related to the T class and seems to indicate that this organism is able to receive and translate in a transcriptional response specific environmental signals. Despite this, the large majority of the genes in clusters 3 and 13 lacks COG annotation.
Cluster 7, as shown in Table , accounts organisms with large genome size (see Table ). This can explain the fact the this cluster contains almost all the COG class enriched and suggests a more complex and flexible life-style of these organisms compared to the other Vibrionaceae members.
KEGG annotation is limited to metabolic or structural complex network and so a reduced number of genes have a KEGG entry. This causes the presence of clusters without enriched map (cluster 2–5, 7, 13, see Fig ). Also in this case, the clusters presenting the higher number of significant KEGG map are those containing the conserved genes. The most enriched KEGG clusters are cluster 14, 10 and 8 accounting for the majority of the metabolic pathways. Cluster 1 is enriched for map3080 (type IV secretion system). In fact V. fischeri
genome contains 10 separate pilus gene clusters, including eight type-IV pilus loci. The presence of multiple pilus gene clusters suggests that different pili may be expressed in different environments or during multiple stages of its development as a symbiont [14
Cluster 11 is enriched for map3090 (type II secretion system). The type II pathway is conserved among gram-negative bacteria, including many pathogens, and secretes a variety of virulence factors and degradative enzymes [15
Cluster 9 is enriched for map 00860 (Porphyrin and chlorophyll metabolism). These genes are involved in the cobalamin (coenzyme B12) biosynthetic pathway [16
]. Some organisms, such as Salmonella typhimurium
and Klebsiella pneumoniae
, can synthesize cobalamin de novo
], while E. coli
and large part of the Vibrionaceae
perform cobalamin biosynthesis only when provided with cobinamide. It is interesting to observe that the genes belonging to the de novo
pathway are only shared by Archea
, some other organisms like Salmonella
Finally cluster 6 is enriched by map2010 (ABC transporter), map2020 (two-component system), map2030 and map2031 (bacterial chemotaxis), map2040 (Flagellar assembly) and map3090 (type II secretion system). This cluster contains genes shared with a high similarity by all Vibrio and with a lower similarity with Photobacterium profundum species. Among the Vibrio species the organisms showing the highest similarity (Tab. ) are V. cholerae strains.
Vibrionaceae specific genes
We identify 1940 clusters specific to the Vibrionaceae
. All the Vibrionaceae
considered in the analysis share 108 clusters. Among these genes we identify ToxR and ToxS genes. ToxR gene encodes a transmembrane regulatory protein firstly identified in V. cholerae
, in which it co-ordinates many virulence factors in response to several environmental parameters [18
]. V. cholerae
ToxR activity is enhanced by a second transmembrane protein, ToxS, encoded downstream tox
]. This family of proteins is involved in response to temperature, pH, osmolarity and in Photobacterium profundum
SS9, a piezophilic bacterium, to hydrostatic pressure [20
]. The widespread presence of these genes among the Vibrionaceae
suggests their importance in regulatory mechanisms.
We identify two other noteworthy groups of genes composed by 257 and 160 genes respectively shared just by two strains, mainly annotated as "hypothetical protein". The first group of genes is shared between Photobacterium profundum SS9 and Photobacterium profundum 3TCK, while the second is shared between V. vulnificus CMCP6 and YJ016. These strains are closely related and this explains the high number of shared genes; while, inside the Vibrionaceae family, the number of specific shared genes highly decreases, showing a high inter-species variability (Fig. )
Figure 6 Number of gene clusters identified only in Vibrionaceae family; the number of Vibrionaceae genomes is reported on the x axis and the amount of shared genes is reported on the y axis. In the first histogram, for example, there are 11 groups of Vibrios (more ...)
Prophages and transposases
Prophages recover different biological roles both as quantitatively important genetic elements of the bacterial chromosome, and as vectors of lateral gene transfer among bacteria, due to their characters of mobile DNA elements. Indeed, numerous virulence factors from bacterial pathogens are phage encoded. It was postulated that this role of prophages is not limited to pathogenic bacteria but some adaptations of nonpathogenic strains to their ecological niche might also be mediated by prophages acquisition [21
To better understand the importance of mobile elements within Vibrionaceae
family, we performed a hierarchical cluster analysis using gene profiles annotated as "phage protein" and "transposase", for a total of 172 clusters of genes (Fig ). We found that a high inter-strain genetic variability exists and phages and transposases are both shared by almost all Vibrionaceae
, and specific to just some organisms. We identified five major clusters of mobile elements that are specific to a single organism. A group composed by 26 clusters containing both transposase and phage proteins seem to be unique to V. splendiduds
12B01 (Fig ). Another one composed by 16 clusters is specific of V. vulnificus
CMCP6 (Fig ) while V. parahaemoliticus
has a cluster of 11 genes (Fig ). Moreover there is another group of transposases and phage genes shared mainly by V. cholerae
0395, Shewanella oneidensis
and V. cholerae
V51 (Fig ). Finally a big cluster of almost 30 genes, all predicted to codify for transposases, was found in P. profundum
SS9 genome (Fig. ). The high presence of transposases in this bacterium seems to correlate with its deep-sea habitat, a feature presumably shared with other deep-sea microorganisms [12
]. As shown in Fig , many of the clusters well conserved in an organism, are partially shared with a low similarity by other organisms. This agrees with the idea that prophages are not maintained in the genome over a long period of time and part of their genes may be deleted from the chromosome. Moreover, microarray analysis and PCR scanning demonstrated that prophages are frequently strain specific within a given bacterial species [22
]. According to the modular theory of phage evolution, phage genomes are mosaics of modules, groups of genes functionally related, that are free to recombine in genetic exchanges between distinct phages infecting the same cell [21
]. This can result in the occurrences of different part of phage distributed in far related genomes. Phylogenetic profile of some transposases is similar to the phage ones, suggesting a possible transfer mechanism phage-mediated for such mobile elements.
Two-way hierarchical cluster analysis performed on prophage and transposase proteins. The blue bars highlight the more interesting clusters of genes such as for example the Vibrio cholerae CTX prophage.