The availability of numerous genome sequences presents a challenge of using the sequence data to characterize metabolic and signaling pathways and ecological properties of the respective organisms and to extract useful information about their evolutionary origin, role in the biosphere, potential use in biotechnology, and other traits. So far, genomic data has been a mixed blessing: having dramatically expanded our knowledge of the molecular biology and biochemistry, it also illuminated major deficiencies in our understanding of the functioning of the live cell. The problem goes beyond the famous ‘70% hurdle’, the fact that about a third of all proteins encoded in any bacterial cell have unknown functions31
; for most eukaryotic genome sequences, including the human genome, the fraction of characterized genes is far lower than that. Even for proteins with known biochemical functions, the exact substrate specificity often remains obscure24,32
. In signal transduction, recognition of a given protein as a sensor histidine kinase or a response regulator rarely provides any clues as to what stimuli are being sensed and which cellular responses are being triggered by them4,5
. Major unresolved questions about the functioning of the signal transduction machinery remain even for such model organisms as E. coli
and B. subtilis33
, which are commonly used as reference points for describing metabolic and signaling pathways of lesser studied organisms.
We have recently pointed out that comparative genomics needs a new language that would combine (i) a standard set of parameters that describe statistical properties of each sequenced genome, as well as physiology and the natural environment of the host organism, and (ii) a set of new (still to be devised) metrics that would allow placing the newly sequenced genome in a proper evolutionary and ecological context1
. The first part of this proposal got instant acceptance in the community and contributed to the formulation of the minimum information about a genome sequence (MIGS) specification by the Genomic Standards Consortium34
. The second part proved to be more difficult, as development of new descriptors is not a straightforward task. Even utilization of the promising examples that we have cited – incorporation of a genome in the three-dimensional Tree of Life35
and genome profiling by COG functional categories36
require a certain effort. The ‘bacterial IQ’ score, which is not difficult to calculate, did not fare any better. Here, we sought to develop a better formula for the IQ score, putting it on a solid statistical footing, and to use this new formula for comparing different microbial phyla. However, the fluidity of the signal transduction systems often shows up at smaller phylogenetic distances, at the family and genus levels. The signal transduction protein family profiles, introduced in this work (– and S4–S6
), aim at capturing the differences in signaling capacity between closely related organisms. We hope that these new tools will help in comprehending the information stored in the genomic data and will eventually lead to a better understanding of how every genome is shaped by its phylogeny and its own adaptations to the environment, i.e. by interplay of its heritage and habitat.
The family profiles of signal transduction proteins were designed to capture both phylogenetic and ecological properties of diverse bacteria. – and S4–S6
show that these profiles largely accomplish these goals and provide an easy way to compare any given organism with its relatives and/or neighbors. Nevertheless, we envisage several ways in which these profiles could be improved.
Firstly, the list of the signal transduction systems included in these profiles could be expanded by incorporating (i) additional families of Ser/Thr and Tyr protein phosphatases15
(ii) the components of the phosphoenol
pyruvate-dependent sugar:phosphotransferase system (PTS) that are involved in chemotaxis and regulation of the activity of the enterobacterial adenylate cyclase and various membrane permeases37
(iii) the anti-sigma and anti-anti-sigma components that modulate the activity of RNA polymerase sigma subunits38
; (iv) systems that produce and respond to alternative nucleotide messengers such as ppGpp and the recently described cyclic diadenylate39
, and (v) additional signaling components that still remain to be characterized.
Secondly, the specificity of the profiles could be increased if, instead of treating all response regulators as a single group, they were separated into families, e.g. based on their domain architectures9
. Preliminary data show that family profiles of various response regulators are consistent within most taxonomic groups of bacteria (MYG, manuscript in preparation).
Thirdly, it might make sense to include so-called “one-component” transcriptional regulators10
, although their variety might make these profiles too complex and complicate their comparison. Finally, these profiles could be further modified to reflect the fraction of environmental (membrane-bound) and intracellular (cytoplasmic) receptors in each sensor protein family, e.g. by shading or hatching. This way, family profiles could be also used to compare the “extrovertness” of different genomes.