|Home | About | Journals | Submit | Contact Us | Français|
Comparative analysis of the complete genome sequences from a variety of poorly studied organisms aims at predicting ecological and behavioral properties of these organisms and help in characterizing their habitats. This task requires finding appropriate descriptors that could be correlated with the core traits of each system and would allow meaningful comparisons. Using the relatively simple bacterial models, first attempts have been made to introduce suitable metrics to describe the complexity of organism’s signaling machinery, which included introducing the “bacterial IQ” score. Here, we use an updated census of prokaryotic signal transduction systems to improve this parameter and evaluate its consistency within selected bacterial phyla. We also introduce a more elaborate descriptor, a set of profiles of relative abundance of members of each family of signal transduction proteins encoded in each genome. We show that these family profiles are well conserved within each genus and are often consistent within families of bacteria. Thus, they reflect evolutionary relationships between organisms as well as individual adaptations of each organism to its specific ecological niche.
All organisms adapt to their environment by perceiving environmental signals and accordingly modifying their behavior and/or metabolism. The recent progress in genome and metagenome sequencing revealed that organismal (primarily microbial) diversity in natural habitats far exceeds most previous estimates. The influx of a large number of complete genomes from environmental microorganisms inhabiting a variety of ecological niches requires new approaches to deal with this previously hidden world1. Several years ago we have conducted a census of signal transduction systems encoded in genomes of 167 prokaryotic species that provided sufficient background data to compare microorganisms in terms of their ability to adjust to environmental changes2. It showed that most proteins involved in signal transduction – sensor histidine kinases, methyl-accepting chemotaxis proteins, adenylate cyclases, diguanylate cyclases, c-di-GMP-specific phosphodiesterases, and Ser/Thr protein kinases – are unevenly distributed even in closely related microorganisms. While the total number of signaling proteins encoded in any given genome correlated well with the genome size, the contribution of each individual class of signaling proteins varied greatly between organisms belonging to different phylogenetic lineages. These data were consistent with the observation that different microorganisms use different signal transduction pathways to respond to the same environmental challenges3 and supported the idea that these pathways were somewhat interchangeable4,5. As an example, environmental modulation of the cyclic diguanylate (c-di-GMP) synthesis can occur through a dedicated environmental sensor with a GGDEF output domain (e.g. AdrA6), through a sensor histidine kinase with a GGDEF-containing response regulator (e.g. PleD7), or through a methyl-accepting chemotaxis sensor protein, from which the signal is transmitted to the chemotaxis-specific histidine kinase CheA and then to a GGDEF-containing response regulator (e.g. WspR8). This, in turn, meant that the sheer complexity of the signal transduction system, manifested in the total number of the interacting components, was an important parameter reflecting the ability of the host organism to adapt to the environmental challenges. To account for the complexity of signal transduction even in the most primitive bacterial cells, two new parameters have been introduced. The signal adaptation index, commonly referred to as “bacterial IQ”, reflects the abundance of signal transduction components encoded in a given organism as compared to others of a similar genome size2. The “degree of extrovertness” reflects the fraction of the outward-looking environmental sensors among all signal transduction proteins encoded in a given genome2. While these parameters offered a new way to describing environmental organisms, they were defined in a largely ad hoc manner and were not supported by any statistical analysis. As part of our attempts to introduce statistically sound metrics to define bacterial genomes and proteomes1, we present here an improved way to calculate IQ scores and introduce family profiles of signal transduction proteins, a new metric that reflects the number of members of each particular family of signal transduction protein encoded in the given genome.
Previous analyses of bacterial signal transduction2,9,10 revealed a strong correlation between the total number of signaling proteins encoded in any given genome and the genome size. Both values correlated well with the total number of proteins encoded in the corresponding genome2. Therefore, the number of signaling proteins encoded in a given genome was judged to be a poor a measure of its signaling capabilities, as it would introduce a bias towards the organisms with larger genome sizes. The fraction of the total proteome dedicated to signal transduction also grew with genome size, indicating that this value is also biased towards the organisms with larger genome sizes. To account for these observations, the bacterial IQ metric was devised to be proportional to the square root of the total number of genes encoding signal transduction proteins per 1 Mbp of a given genome2. In this work, we made several adjustments aimed at improving the robustness of the IQ parameter while preserving its independence from the influence of the genome size.
During the four years that elapsed since the completion of the previous signal transduction census2, there was a dramatic increase in the number of completely sequenced prokaryotic genomes. Although a fair share of this increase was due to the sequencing of multiple isolates of selected pathogenic bacteria (Bacillus anthracis, Escherichia coli, Salmonella enterica, Staphylococcus aureus, Streptococcus pneumoniae, Streptococcus pyogenes), joint efforts of several genome sequencing centers resulted in a rapidly improving coverage of microbial diversity11. By the end of 2008, the size of the non-redundant genome set, covering all individual bacterial and archaeal species with completely sequenced genomes, increased to 555 genomes. This set included first genome sequences from several poorly studied bacterial phyla and of a number of relatively large genomes that demonstrate great variety in their protein content (Table 1). This prompted an update of the census of prokaryotic signal transduction proteins, http://www.ncbi.nlm.nih.gov/Complete_Genomes/SignalCensus.html, and a re-evaluation of the best way to calculate bacterial IQ scores.
In addition, the set of signal transduction proteins included in the total count was further expanded to include two-component response regulators9,12 and Ser/Thr protein phosphatases of the PP2C superfamily13,14. These additions were not expected to affect the general trends, as, according to the previous observations, the number of response regulators encoded in any prokaryotic genome are close to the number of their cognate histidine kinases9 (see Supplementary Figure S1), whereas the number of Ser/Thr protein phosphatases is typically comparable to the number of Ser/Thr kinases encoded in the same genome14,15. Indeed, despite these changes, the total number of signal transduction proteins continued to show strong correlation with the genome size (Supplementary Figure S2). In accordance with the earlier observations, the relationship was close to quadratic (the slope of the regression line, 1.91). This correlation held also for the histidine kinases and response regulators and, albeit less strongly, for the total sum of other signal transduction proteins (Supplementary Figure S3). A strong correlation with a nearly-quadratic relationship was also observed between the total number of signal transduction proteins and the total number of proteins encoded in that genome (Figure 1). Based on these data, we have modeled the relationship between the number of signaling proteins and the genome size aiming to put the IQ parameter on a solid statistical footing (see Supplementary Methods). The IQ scores were normalized to make their distribution was roughly symmetric and scaled so that the minimum IQ score was equal to zero and the mean was close to 100. This resulted in the following approximate formula for calculating the IQ score:
where NSP is the number of signaling proteins and NT is the total number of proteins in an organism. This formula was then used to compare the signaling capacities of diverse microorganisms (Table 2).
The availability of complete genomes of at least 7 different species from 11 bacterial and two archaeal phyla (Table 2) provided a sufficient data set to calculate the average IQ scores and evaluate the consistency of this parameter across various prokaryotic taxa. Despite the significant variance in the mean IQ scores within most bacterial and archaeal phyla (Table 2), there were statistically significant differences between certain phyla and between different classes within Proteobacteria and Firmicutes. A key reason for this variance, identified in Table 2, was the presence in many phyla of both free-living and obligately pathogenic bacteria. In the latter ones, adaptation to the parasitic lifestyle typically involves genome compaction, coupled with the loss of metabolic and signaling genes. As a result, closely related organisms often differ in genome sizes and encode dramatically different gene sets. Another factor contributing to the variance in signal protein content is the ability of organisms to use oxygen and other terminal electron acceptors. Strictly anaerobic fermentative lifestyle often results in a decrease in the signaling capacity, whereas the ability to utilize distinct electron acceptors always correlates with a high number of sensory and signal transduction proteins. Among anaerobes, the highest versatility in utilizing various terminal electron acceptors is seen among members of Thermotogae, certain clostridia, and delta-proteobacteria, which is reflected in the relatively high average IQ values in these groups of organisms (Table 2).
The groups showing consistently low IQ values include Tenericutes (mycoplasmas), which have undergone extreme genome compaction in the course of their evolution, and Crenarchaeota, which are represented mostly by extremophiles that occupy stable and unique ecological niches. It should be noted, however, that two recently sequenced genomes provided the first examples of several types of signal transduction proteins in both these lineages – and significantly increased data variance within these phyla. The genome of the mollicute Acholeplasma laidlawii showed the presence of several classes of proteins – histidine kinases, response regulators, diguanylate cyclases and c-di-GMP phosphodiesterases and even an adenylate cyclase – that have not been seen in any previously sequenced mollicute genome (Supplementary Figure S4). Likewise, the recently sequenced genome of the first non-extremophilic crenarchaeon, an ammonia-oxidizing, carbon-fixing marine archaeal group 1 member Nitrosopumilus maritimus16, encodes 11 histidine kinases and 10 response regulators, which were absent in all previously sequenced crenarchaeal genomes (Supplementary Figure S5). These examples show that we are still dealing with only a minor slice of microbial diversity and our genome analyses are still subject to sample bias.
The above examples show that an integral parameter like the IQ score needs to be supplemented with more detailed metrics that would allow more targeted comparisons of the signaling capabilities within closely related or diverse organisms. Following the example of the Clusters of Orthologous Groups of proteins (COG) functional profiles17, as discussed in Ref. 1, we introduce here family profiles of signal transduction proteins, which reflect the number of members of each particular family of signal transduction protein encoded in the given genome. For example, Figure 2 shows family profiles of 10 species from the genus Mycobacterium (phylum Actinobacteria). This genus unifies a wide variety of bacteria, including obligate parasites Mycobacterium leprae and M. tuberculosis (named after the diseases they cause), facultative pathogens, such as M. marinum and M. smegmatis, and free-living environmental isolates such as M. gilvum and M. vanbaalenii18,19. With the exception of the highly degraded genome of M. leprae20, all other members of genus have similar IQ values in the range of 80–100. Active pathogens, such as M. leprae, M. tuberculosis, M. bovis, and M. avium, have smaller genome sizes and encode fewer proteins than their free-living relatives. The family profiles, plotted in the order of increasing number of total proteins per genome, show that increased genome size is accompanied with an increase in the number of encoded two-component systems (histidine kinases and response regulators), Ser/Thr protein kinases and protein phosphatases (Figure 2). The complete absence of chemotaxis sensors (MCPs) is consistent with the observation that all mycobacteria are non-motile. Other systems show more complex distribution patterns with a remarkable increase in the number of adenylate cyclases and Ser/Thr protein kinases in the genome of M. marinum, a fast-growing aquatic organism that infects marine and freshwater fish. Protein kinases have been shown to regulate metabolism in mycobacteria21 and could be important for the environmental adaptations of this versatile pathogen, which is capable of infecting fish, frogs, and snakes and occasionally causing skin lesions in humans. At the same time M. marinum lacks any enzymes for synthesis or hydrolysis of c-di-GMP, which has been reported to be required for long-term survival of M. smegmatis under nutritional stress22. Indeed, by far the largest number of c-di-GMP-metabolizing enzymes is seen in M. gilvum and M. vanbaalenii, two free-living bacteria that can degrade a wide range of environmentally toxic chemicals, including high-molecular-weight polycyclic aromatic hydrocarbons, such as benz[a]pyrene, and are of great interest for use in bioremediation18,23.
In most bacterial genera, the numbers of signal transduction proteins do not vary as much as in Mycobacterium and their family profiles offer a convenient tool to visualize and compare organisms isolated from diverse ecological niches. Figure 3 shows the distribution of signal transduction proteins in 12 species of Shewanella, a very interesting genus of predominantly marine bacteria that are found in diverse environments and are capable of reducing a variety of terminal electron acceptors24,25. This genus, and its best studied representative Shewanella oneidensis strain MR-1 are being systematically studied using a variety of high-throughput methods24. Figure 3 shows that these versatile organisms encode members of every kind of signal transduction proteins, indicating their ability to respond to environmental challenges on several different levels – transcriptional (two-component systems, cAMP-dependent signaling), posttranslational (protein phosphorylation on Ser and Thr residues), whole-cell level (chemotaxis) and multicellular level (biofilm formation). Despite their different habitats and significant differences in respiratory metabolism, all Shewanella species have similar distributions of key families of signal transduction proteins.
As part of the flexible genomic “shell” (as opposed to the core genome26) signal transduction proteins are commonly acquired and lost in the course of evolution27. As noted previously, adaptation to extreme (or stable) conditions often results in a loss of metabolic flexibility, which leads to a diminished demand for cellular regulation and therefore affects environmental sensing2,4. A recent genome analysis of the moderately thermophilic bacterium Anoxybacillus flavithermus WK1, isolated from a super-saturated silicate solution, concluded that evolution of the family Bacillaceae from a common ancestor of all Firmicutes involved acquisition of numerous genes, which was followed by a substantial gene loss in the Anoxybacillus/Geobacillus branch in the course of the specific niche adaptation28. A signal transduction protein family profile of various Bacillaceae (Figure 4A) is consistent with that conclusion (which was achieved without considering these proteins). This profile shows that A. flavithermus and two Geobacillus species encode all the same families of signal transduction proteins as Bacillus spp., albeit in a smaller number. In an interesting twist, Bacillus anthracis, a pathogen, encodes a far richer complement of environmental sensors than its free-living relatives (Figure 4A). Remarkably, an even smaller set of signal transduction proteins is seen in an extremely halotolerant and alkaliphilic species Oceanobacillus iheyensis, which is otherwise closely related to Bacillus subtilis and B. halodurans29. The moderately halotolerant alkaliphile B. halodurans and O. iheyensis occupy similar habitats, which cannot account for the differences in signal protein content. It seems likely therefore that this difference is due to the distinct evolutionary histories of these organisms. The phylogenetic position of O. iheyensis with respect to other bacilli remains unresolved28,30, but it seems likely that it had diverged from the main bacillar branch before that branch has experienced the massive gene influx. Thus, while A. flavithermus and Geobacillus appear to have lost some of their signal transduction genes, O. iheyensis probably never had them in the first place.
Expanding the phylogenetic coverage of Figure 4A to include other members of the order Bacillales adds family profiles both on the lower [genus Listeria (family Listeriaceae), genus Staphylococcus (family Staphylococcaceae)] and on the upper [genus Lysinibacillus (family Planococcaceae)] part of the spectrum (Figure 4B). This is consistent with the IQ scores of these organisms and their lifestyle as facultative pathogens, capable of saprophytic growth outside the host. Unlike Bacillaceae members, listeriae and staphylococci lack the ability to sporulate and the complex signal transduction machinery that regulates sporulation and germination. Moving further down the evolutionary tree to the class Bacilli adds the order Lactobacillales with extensively sequenced genera Lactobacillus (family Lactobacillaceae) and Streptococcus (family Streptococcaceae). Representatives of these genera demonstrate further loss of signaling capacity (Figure 4C), most likely as a result of adaptation to the nutrient-rich media, milk for the former group and host tissues for the latter one. Further integration of signal transduction protein family profiles is possible through the use of heat maps (Figure 4D, Supplementary Figure S6). In such maps, grouping of organisms will not necessarily be defined by their phylogenetic proximity. Instead, it will reflect similarities in the quantity and distribution of their signal transduction systems which are more likely to be determined by their respective habitats.
The availability of numerous genome sequences presents a challenge of using the sequence data to characterize metabolic and signaling pathways and ecological properties of the respective organisms and to extract useful information about their evolutionary origin, role in the biosphere, potential use in biotechnology, and other traits. So far, genomic data has been a mixed blessing: having dramatically expanded our knowledge of the molecular biology and biochemistry, it also illuminated major deficiencies in our understanding of the functioning of the live cell. The problem goes beyond the famous ‘70% hurdle’, the fact that about a third of all proteins encoded in any bacterial cell have unknown functions31; for most eukaryotic genome sequences, including the human genome, the fraction of characterized genes is far lower than that. Even for proteins with known biochemical functions, the exact substrate specificity often remains obscure24,32. In signal transduction, recognition of a given protein as a sensor histidine kinase or a response regulator rarely provides any clues as to what stimuli are being sensed and which cellular responses are being triggered by them4,5. Major unresolved questions about the functioning of the signal transduction machinery remain even for such model organisms as E. coli and B. subtilis33, which are commonly used as reference points for describing metabolic and signaling pathways of lesser studied organisms.
We have recently pointed out that comparative genomics needs a new language that would combine (i) a standard set of parameters that describe statistical properties of each sequenced genome, as well as physiology and the natural environment of the host organism, and (ii) a set of new (still to be devised) metrics that would allow placing the newly sequenced genome in a proper evolutionary and ecological context1. The first part of this proposal got instant acceptance in the community and contributed to the formulation of the minimum information about a genome sequence (MIGS) specification by the Genomic Standards Consortium34. The second part proved to be more difficult, as development of new descriptors is not a straightforward task. Even utilization of the promising examples that we have cited – incorporation of a genome in the three-dimensional Tree of Life35 and genome profiling by COG functional categories36 require a certain effort. The ‘bacterial IQ’ score, which is not difficult to calculate, did not fare any better. Here, we sought to develop a better formula for the IQ score, putting it on a solid statistical footing, and to use this new formula for comparing different microbial phyla. However, the fluidity of the signal transduction systems often shows up at smaller phylogenetic distances, at the family and genus levels. The signal transduction protein family profiles, introduced in this work (Figures 2–4 and S4–S6), aim at capturing the differences in signaling capacity between closely related organisms. We hope that these new tools will help in comprehending the information stored in the genomic data and will eventually lead to a better understanding of how every genome is shaped by its phylogeny and its own adaptations to the environment, i.e. by interplay of its heritage and habitat.
The family profiles of signal transduction proteins were designed to capture both phylogenetic and ecological properties of diverse bacteria. Figures 2–4 and S4–S6 show that these profiles largely accomplish these goals and provide an easy way to compare any given organism with its relatives and/or neighbors. Nevertheless, we envisage several ways in which these profiles could be improved.
Firstly, the list of the signal transduction systems included in these profiles could be expanded by incorporating (i) additional families of Ser/Thr and Tyr protein phosphatases15 (ii) the components of the phosphoenolpyruvate-dependent sugar:phosphotransferase system (PTS) that are involved in chemotaxis and regulation of the activity of the enterobacterial adenylate cyclase and various membrane permeases37 (iii) the anti-sigma and anti-anti-sigma components that modulate the activity of RNA polymerase sigma subunits38; (iv) systems that produce and respond to alternative nucleotide messengers such as ppGpp and the recently described cyclic diadenylate39, and (v) additional signaling components that still remain to be characterized.
Secondly, the specificity of the profiles could be increased if, instead of treating all response regulators as a single group, they were separated into families, e.g. based on their domain architectures9. Preliminary data show that family profiles of various response regulators are consistent within most taxonomic groups of bacteria (MYG, manuscript in preparation).
Thirdly, it might make sense to include so-called “one-component” transcriptional regulators10, although their variety might make these profiles too complex and complicate their comparison. Finally, these profiles could be further modified to reflect the fraction of environmental (membrane-bound) and intracellular (cytoplasmic) receptors in each sensor protein family, e.g. by shading or hatching. This way, family profiles could be also used to compare the “extrovertness” of different genomes.
There is little doubt that almost all features of an organism, at least a simple unicellular one, could be deduced from its complete genome sequence. A key goal of genome analysis is using the sequence data to uncover salient features of the host organism and put them into phylogenetic and ecological context. That is why a significant part of comparative-genomics studies is devoted to finding the genomics correlates of such features as thermophily, psychrophily, halophily, alkalophily, resistance to radiation and the ability to survive stress, and using these parameters as metrics for genome description. This work continues the previous study that introduced the “bacterial IQ” score for comparing the sets of various receptor proteins encoded in bacterial and archaeal genomes. Using a larger set of genomes and an expanded list of signal transduction proteins, we compare the IQ scores within various groups of bacteria. We also present a new way of comparing signal transduction machineries of various bacteria by plotting the abundance of members of each signal transduction protein family in each analyzed genome, either as a 3D histogram or as a heat map. We show that these family profiles are well conserved within each genus and are often consistent within more distant groups of bacteria. At the same time, these profiles are affected by the lifestyle of each particular organism and the set of environmental signals it responds to. Thus, these family profiles reflect evolutionary relationships between organisms (their heritage) as well as individual adaptations of each organism to its specific ecological niche (its habitat).
The genome sequence data, genome sizes and the total numbers of encoded proteins were extracted from the Entrez Genome Project database at the US National Center for Biotechnology Information (NCBI), http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi. The analyzed set contained only one representative genome per species, typically the first one to be released to the public, as listed on the above web site. Exceptions included two strains of E. coli, K-12 substrain MG1655 and O157:H7 strain Sakai, three strains of Salmonella enterica, and two srains of Orientia tsutsugamushi, Boryong and Ikeda. The numbers of signaling proteins of each type encoded in every given genome were calculated and manually curated essentially as described previously2,9 and reconciled with the data in the Pfam (http://pfam.sanger.ac.uk/)40 and MiST (http://genomics.ornl.gov/mist)41 databases, where available. The updated census of signal transduction proteins is available at the web site http://www.ncbi.nlm.nih.gov/Complete_Genomes/SignalCensus.html.
The previously observed trend that the number of signal transduction proteins (NSP) grows approximately as square of the genome size2 was verified using the expanded genome set with two additional classes of proteins and replacing the genome size with the total number of proteins (NT) encoded in every given genome. The overall relationship between NSP and NT was estimated by fitting lnNSP as linear regression of lnNT. The genomes with NSP <2 were excluded from the analysis and assigned IQ = 0 (alternatively, arbitrary constant of 0.5 proteins/genome could be added to accommodate cases of NSP = 0; this had only a minor effect on the overall results). On a log-log scale, NSP and NT showed a linear relationship with the slope approximately equal to 2 (Figure 1). The bacterial IQ score was then estimated as a measure of the deviation of NSP for the given genome from the regression line. To make the distribution of IQ scores more symmetric and bell shaped than in the previous work2, the IQ scores were normalized by taking natural logarithms of the protein counts. Also, the IQ scores were shifted and scaled by constants so that the minimum value was equal to zero and the mean was approximately equal to 100 (see the Supplemental Methods for details).”. Additionally, inclusion in the set of genomes with NSP from 2 to 5 resulted in a much better coverage of small genomes.
This work was supported by the NIH Intramural Research Program at the National Library of Medicine (MYG) and by grants to EK from the NIGMS (R01 GM076680-02), NIDDK (UO1 DK072473), and the NSF (DBI-0544757 and NSF-07140).