An Overview of HPKs across Bacteria
The fraction of HPKs coded by a given genome is known to scale roughly with the size of the genome, as shown in (an even better correlation is seen if all signaling proteins in a genome are considered [2
]). We wanted to identify and investigate genomes that had particularly high numbers of HPKs to see if we could identify their origin. Did these genomes duplicate existing HPKs, acquire large amounts through HGT, or did they simply lose fewer “old” HPKs than other genomes? Several different types of genomes were chosen as examples for more in-depth study throughout this manuscript (and details for all are available in Dataset S1
). First, we chose organisms in which more than 1.5% of the genome codes for HPKs (red squares in ). We also targeted two genomes that had the largest numbers of genes acquired by HGT, Ralstonia solanacearum
and Pseudomonas syringae
(blue triangles in ), and one genome in which nearly every new HPK gene was acquired through LSE, Streptomyces coelicolor
(pink diamond in ). We chose Bradyrhizobium japonicum
(turquoise diamond in ) because it includes large numbers of new HPKs acquired through both HGT and LSE, so we could compare these two processes in a single genome. Finally, we included the model organisms Bacillus subtilis
and Escherichia coli,
in which HPKs are the most well-studied experimentally (green circles in ).
Different Species Rely on Different Mechanisms for Acquiring New HPKs
summarizes the quantitative results of our phylogenetic analysis across all bacteria (A) and individually for each of our targeted genomes (B). New HPKs are common across the bacteria; however, different genomes encode different numbers of new genes. Bacteria in the δ- and
-groups of the proteobacteria contain particularly high numbers of recently acquired HPKs.
The number of new HPKs arising through HGT or LSE is quite variable across different phylogenetic groups, as shown in . In some genomes, such as E. coli
and R. solanacearum,
recent gene duplications are rare. HGT, on the other hand, accounts for nearly all of the recently acquired HPKs in these genomes. For others, such as D. vulgaris
and Geobacter sulfurreducens
, LSE accounts for the majority of recently acquired HPKs. Streptomyces
spp. are known for their propensity for gene duplication [11
], and their new HPKs result almost exclusively from LSE. The mechanism of gene duplication in S. coelicolor
is qualitatively unlike that of other genomes in this study; this point is discussed in greater detail in following sections.
The question of why different genomes have different preferences for HGT or LSE as a means of acquiring new signaling proteins is not obvious, but we did find that genomes with unusually large numbers of HPKs relative to their genome size tend to have accumulated those HPKs via LSE. The fraction of HPKs in a genome involved in recent LSE correlates strongly with the total number of HPKs in that genome (ordinary least squares linear regression: r = 0.74, p < 10−15), while the fraction involved in a recent HGT event does not (r < 0.1, p = 0.93). In fact, all of the genomes that devote at least 1.5% of their genes to encoding HPKs (Nostoc sp. PCC 7120, Geobacter spp., Desulfovibrio spp., Wolinella, and Dechloromonas), which are highlighted in red in , have major LSEs.
In addition, while all of these genomes (excluding Nostoc
) are dissimilatory sulfur- or sulfate-reducing bacteria, and many are more closely related δ-proteobacteria, they do not necessarily contain the same expansions. For example, the Geobacter
lineage contains a large expansion in “type 3” HPKs (using the standard nomenclature defined in [12
]), while the Desulfovibrio
lineage contains an expansion in type 4 HPKs. A further expansion of the “hybrid” type 1b family (including both histidine kinase and RR domains in the same protein) is seen only in D. vulgaris,
and not in its close relative D. alaskensis
G20 (also known as D. desulfuricans
G20). Thus, while the propensity for gene duplication may be an inherited trait among these broadly related δ-proteobacteria, the major expansions in each organism are not necessarily shared.
LSE Disrupts HPK–RR Operon Structure
Compared to new HPKs acquired through HGT, HPKs resulting from LSE are less likely to have coevolved with their cognate RRs in a single duplication event. shows the distance of new HPKs to response regulators. The data shown do not include “hybrid” HPKs (HPK and RR domains in the same polypeptide chain), which can bias analysis due to their apparent propensity for LSE, and since there is already a RR in the same gene by definition. Averaged over all genomes or taken individually for particular genomes, the trend is clear—LSE genes are much more likely to be present as “orphans,” separated from their cognate RRs in the genome. S. coelicolor
is an unusual exception to this trend, as it has high numbers of RRs in the immediate proximity of duplicated HPKs. To confirm that operons were the most likely explanation for this genomic proximity between HPKs and RRs, we also compared these different classes of HPKs to operon predictions that have been validated across a wide range of species [13
], and observed the same trend: 77% of HGT HPKs had a co-operonic RR, compared to 69% for “old” HPKs, and only 42% for LSEs.
Proximity of Different Classes of HPKs to RRs
This separation between HPK and RR evolutionary events suggests that these novel LSEs may be more likely to engage in crosstalk. This is certainly the case for the sole LSE in B. subtilis,
which is made up of the kin
regulators of sporulation. KinA–E
are thought to integrate signals into a common downstream target based on their approximately equal affinity for the regulator Spo0F
]. By contrast, the sole recent duplication in E. coli,
resulting in the NarQ/NarX
genes, avoids crosstalk as each HPK ties into a distinct regulator (NarP
]. A recent study by Laub and coworkers in Caulobacter crecentus
also found little evidence for physiologically relevant crosstalk among HPKs [18
]. If crosstalk does not play a large role in general, we would expect to see that the number of “orphan” RRs (not in proximity to a HPK) would generally correlate with the number of “orphan” HPKs (not in proximity to a RR). shows that this trend largely holds across the species examined, though many species show large deviations. We suspect that while some crosstalk may indeed occur, the results from Laub and coworkers are likely to apply to some extent even across species with large numbers of duplications. Experimental work in these species will be necessary to answer this important question.
Coevolution of Orphan HPKs and RRs
In some cases, we observed that one or a small number of HPKs in an expansion are positioned in operons with RRs. Although beyond the scope of this study, an interesting hypothesis is that these HPKs may be the progenitors of the expansions. For example, NarQ
is co-operonic with NarP,
while its duplicate NarX
is transcribed separately from its cognate regulator, NarL.
We also observed that the small numbers of HGT genes in genomes with large LSEs are likely to have cognate RRs nearby. This may not only reflect the fact that HPKs are likely to transfer into a genome with their cognate regulators, but also that those HPKs near their cognate regulators make better candidates for transfer out of a genome than their paralogous copies. Indeed, we recently reported a relationship between operons and HGT [5
]. We found that nearly 50% of new HGT genes in E. coli
were acquired with another gene as part of a horizontally transferred operon.
Domain Shuffling Often Accompanies LSE
HPKs generated by LSE also display more novel variation in their (usually N-terminal) sensory domains than those acquired horizontally. Across all genomes, 47.4% of horizontally transferred HPKs retain a set of upstream signaling domains identical (in both domain type and linear order) to their inferred HGT partner, whereas only 29.1% of recent duplications retain the same domain structure as their closest paralog. In fact, for expansions that include five or more proteins, only 19.9% of closest paralogs had an identical set of upstream domains. shows results for individual genomes. The fraction of HGT genes with conserved upstream domains are shown for those genomes rich in HGT events, and the fraction of LSE genes with conserved upstream domains are shown for those genomes rich in LSE. For B. japonicum, which contains a mixture of both types of new genes, both numbers are shown. As a control, we also considered a more stringent definition of HGT requiring genes to be absent from three consecutive outgroups. Using this more stringent definition, 47.3% of horizontally transferred HPKs were found to retain an identical series of upstream signaling domains, which is nearly identical to the 47.4% obtained from the less-stringent definition. In addition, we considered the possibility that horizontally transferred HPKs might have a tendency not to include any additional signaling domains, and therefore may be identical trivially. We found that only ten of our 420 HGT genes lacked any signaling domains, supporting our original conclusions.
Extent of Domain Shuffling in Different Classes of HPKs
These results are particularly striking since the horizontally transferred HPK domains are on average less similar (lower BLASTp sequence identity) than paralogous domains. These results are also encouraging because our evolutionary inference methods are based only on the similarity of the histidine kinase domain of each HPK, and the high rate of similarity of these upstream signaling domains between putative HGT partners supports the accuracy of our approach. In these results, we considered genes derived from an HGT event followed by a duplication event in the totals for duplicates, but not when computing the totals for HGT, as it is not possible using our method to determine which of the resulting paralogs is more likely to have retained the ancestral state of signaling domains.
A notable outlier in is worth mentioning: S. coelicolor
contains the largest fraction of new genes acquired by LSE of all the genomes we studied, yet a large fraction of these genes contain an identical set of upstream signaling domains. In addition, as reported in a previous section, LSEs in this species tend to involve duplications that preserve HPK–RR pairings. These qualitative differences may reflect an enhanced capability of this genome to duplicate regions of its linear chromosome, a process that has been proposed previously based on genome sequence analysis [11
Taken together, the results presented in this section suggest different roles for HGT and LSE in HPK evolution. While both mechanisms contribute to the diversity of signaling systems, LSE is accompanied by rearrangements in domain structure as well as by independent evolution of HPK and RR genes. By contrast, organisms such as B. subtilis, E. coli, and P. syringae appear to acquire new HPKs via horizontal transfer of intact two-component systems. These consumers of preexisting genetic diversity are less likely to contain completely novel domain structure, and are more likely to include HPKs in proximity to their cognate RR. Individual genomes appear to have very different preferences for HGT or LSE. LSE is the dominant force in species that are the most highly regulated (those with the highest proportion of genes coding for HPKs), whereas HGT appears to be dominant, for example, in the well-studied model systems E. coli and B. subtilis.
Anatomy of an LSE
To better understand the structure of an LSE, we investigated a single expansion in the two sequenced Desulfovibrio species. Several striking features are present in the expansion depicted in . First, the diversity in the upstream signaling domains is obvious in the nonorthologous pairs of HPKs (some likely orthologs between D. vulgaris and D. alaskensis are shown, and have a similar set of upstream domains). Second, there was likely a HGT event between Desulfovibrio and Pseudomonas (probable orthologs from three Pseudomonas species are shown in the tree), which conserved an upstream domain structure (TM-TM-HAMP-PAS-HPK). This domain structure is identical between the Pseudomonas species and one of the members of the Desulfovibrio expansion, which we postulate served as the donor or acceptor. Third, many of the upstream signaling regions contain repeated domains, but only some of these are noticeably more similar in sequence than other pairs. Thus, rearrangements involve domains that are acquired from distant sources or domains that have been subject to more rapid evolution than HPK domains. Finally, there appears to be a mixture of proteins with and without predicted transmembrane domains, implying that the same basic architecture can support both kinds of signaling mechanisms. Further domain shuffling may also be happening at the level of the extracellular sensory regions not detected by our sequence profiles.
Domain Shuffling in a Desulfovibrio spp. Expansion
A close inspection of reveals a pattern in the signaling domain architecture of this expansion: every gene observed has a PAS domain immediately upstream of the HPK domain. Upon closer inspection, we found that this domain is not only conserved in its placement relative to the HPK domain, but is also highly conserved at the sequence level in most of the genes in this family, with clear sequence homology detectable even in the Pseudomonas species. This implies that domain architecture is not completely plastic. Instead, there appear to be “rules” for constructing new functional paralogs, and certain domains may be necessary to preserve optimal activity. Similarly, several expansions in Nostoc consist of a conserved set of core domains preceded by a variable upstream (N-terminal) region. Some other expansions we studied did not display such obvious patterns of domain architecture. The role of these conserved and nonconserved domains and their mechanism of interaction remains a key open question.
New Functional Roles for Recently Duplicated Paralogs
LSEs contain a diversity of upstream signaling domains, suggesting that they might respond to different environmental signals. To test this hypothesis, we analyzed microarray data collected for D. vulgaris
under a variety of stress-response conditions to determine whether paralogs had similar expression patterns. Surprisingly, reveals no detectable similarity in gene expression patterns among close paralogs, nor overall similarity within the two Desulfovibrio
-specific clusters of HPKs. The correlations of gene expression profiles for closest paralogs is not significantly different from those observed between random pairs of genes as measured by the Student's t
or Kolmogorov-Smirnov tests (as implemented in the R statistical computing package; http://www.r-project.org
). As a control, HPKs and their cognate RRs (predicted based on genomic proximity) are strongly correlated within this same dataset (see Figure S1
Gene Expression of D. vulgaris Expansions
The difference in gene expression patterns and the domain shuffling both support the idea that these new paralogs have adopted new functional roles within the cell. It is not within the scope of this work to determine the environmental stimuli to which each HPK responds, yet some idea of the variety of possible direct or indirect signals can be inferred from . For example, in cluster 1, a paralog with domain structure TM-PAS-PAS-PAS-HPK responds strongly to heat shock, and (to a lesser extent) nitrite stress, while a close paralog with domain structure TM-TM-HAMP-PAS-HPK responds most strongly to salt stress.
It is important to note that gene expression is an imperfect measure of function. Moreover, many HPKs may be expressed constitutively and regulated mainly at the level of phosphorylation. Nonetheless, we observe some clear cases in which expression is either upregulated or downregulated, and those trends are not generally conserved within these phylogenetic clusters. In some sense, signaling genes that are expressed under different sets of conditions could be considered to have different functions even if they regulated overlapping sets of genes. We feel that the general lack of coexpression, when combined with the diversity of newly evolved signaling domain architectures, together make a strong case for new functional roles.
Genomic Distribution of HPK Families
We looked at the distribution of paralogs within each genome to see if we could infer any information regarding the mechanism of gene family expansion. In B. subtilis, for example, all five of the kin genes are contained within a small region of the chromosome, with four of them very tightly spaced (the LSE or purple-colored genes in ). In general, however, we observed very little clustering of genes within genomes. To be more rigorous, we constructed a simple statistical test to measure clustering of new HPKs in a genome. We computed the distribution of nearest-neighbor distances between HPKs arising from LSE, and compared this with the distribution expected by chance (approximated by an exponential distribution with mean = [number of genes in genome] / [number of recent LSEs]). We then used the Kolmogorov-Smirnov test to determine if the two distributions were significantly different. Of the genomes we classified as having large numbers of LSEs, only Nostoc showed significant clustering. When we examined this result further, we identified the source of the clustering: a set of two adjacent HPKs, which likely work together to relay signals. The first gene in each of these pairs contains a wide variety of largely shuffled signaling domains, while the downstream gene contains a conserved HPK domain followed by a CheY-type regulator domain. After correcting for this by counting these adjacent pairs as a single duplication, we observed no clustering among LSE genes in Nostoc. shows an overview of genomic positions of HPKs and RRs across several species, none of which (apart from the kin locus of B. subtilis) appears to have significant clustering. Thus, the duplication of HPKs appears qualitatively different from the duplication of signaling domains within the N-terminal region of individual HPKs, as the latter often occur in long tandem stretches.
Genomic Distribution of HPKs and RRs
Timing of Evolutionary Events
Because our phylogenetic inference procedure identifies LSE and HGT events associated with a particular outgroup, we can trace the influx of HPKs into each lineage as a function of time. shows the number of HPKs predicted to have entered several lineages as a function of time (distance to divergence of outgroup). While different species here show different overall trends (some such as P. syringae gradually accumulated HPKs, while some such as Nostoc acquired most of their HPKs very recently), the species-averaged plot shows a steady influx of HPKs at a nearly constant rate back until about our phylogenetic cutoff distance of 1.0 (where HGT tends to saturate since it requires absence from at least two outgroups predating the transfer). Moreover, both HGT and LSE seem to be contributing at similar levels to the total number of HPKs, and both accumulate at about the same rate. It is important to note that the resolution of these figures depends directly on the number of sequenced bacterial groups at different levels of divergence from each genome, and caution should be used when comparing our evolutionary distances across distant taxa as differences in evolutionary rate were not rigorously modeled in this analysis. As more genome sequences become available, it will be possible to resolve the timing of these events with higher resolution, and even to measure turnover rates for HPKs.