A major fraction of bioinformatics research on sequence analysis has focused on the conserved regions in proteins, trying to hypothesize the role of the protein by identifying sequence motifs that have been shown experimentally to correlate with a specific function. Some work has gone into cataloging the groups of lineage-specific proteins that show no similarity to other proteins in GenBank (Galperin and Koonin 1999
), but there the route to assigning function usually needs experimental approaches requiring biochemistry or genetics or more rarely by determining the crystal structure of the gene product (Zhang et al. 2000
). Unfortunately, current bioinformatics methods are only occasionally helpful in suggesting where to begin such studies.
In this paper we have initiated an effort to identify SVGs, which contain both well-conserved regions and highly variable regions. By looking carefully at a few specific examples where functional information is available from experimental data, we find that the variable region often seems to play a key role in mediating interactions with other molecules, both large and small. Sometimes the variable portions are involved in biological processes with a component of interaction between the cell and agents from the external environment. For instance, the DNA methyltransferases are part of a defense system that recognizes and clears invading foreign DNA; membrane-bound sensory HKs and mechanosensitive ion channels, etc., monitor changes of living conditions. Sometimes the variable portions are involved in intracellular processes that appear to have lineage-specific features. Thus, the variable regions inside DNA GyrB and several types of AARSs probably determine the specificity of substrate recognition. The detailed factors that introduce the molecular variability may go well beyond our explanations here and likely vary from case to case. Some variable regions may have diverged a long time ago and are now kept constant, while others may keep changing. In all of these cases, SVGs are exceptionally worthy targets of further experimental investigation, and such investigations can be greatly aided by the presence of the conserved regions that may suggest a preliminary function to be tested.
Why might certain genes contain these variable regions? Could they be simply relics left over during evolution and now serve no purpose? Are they just “pseudo-segments” with no function? There are several lines of evidence that support the hypothesis that when variable regions have been retained, they indeed serve a function. First, several studies have shown that deletions are, on average, more frequent than insertions (Halliday and Glickman 1991
). As a result, if a region is evolving under weak functional constraints, it tends to get smaller over time (Lipman et al. 2002
). Second, in a special case, one can imagine that when a variable region occurs at the C-terminus of a protein and is not being selected, it is likely to suffer random mutations including nonsense mutations or insertions/deletions that cause a shift in reading frame. Thus, we searched GenBank release 136.0 for examples of genes that matched the conserved region of an SVG, but in which the C-terminus was missing or much shorter. The DNA sequences downstream of such hits were examined for similarity to the variable region in the query gene. Of the 83 SVGs with a C-terminal variable region in H. pylori
, none of them had hits with a disrupting stop codon in the variable region; 20 of them have hits with genes showing insertions/deletions that cause frame shifts in the variable region. However, the real number is likely to be much fewer, since, based on previous work, many of them may be the results of sequencing errors (Posfai and Roberts 1992
In other cases, we find that some proteins have lost the variable segment in a subset of genomes. For instance, in ProRSs, the variable segment is absent in archaea; in GyrB, the variable segment is absent in the Gram-positive bacteria. Clearly in those cases the organisms can get by without the variable domain, although they may have a compensating function in a different gene. But this again does not imply that the variable region has no function in those genes that have retained it.
SVGs are distinct from sequences with shuffled domains (Doolittle 1995
) in that the variable region is bounded by the same sets of conserved portions, while domain shuffling usually manifests itself in a different sequential order of conserved domains. We also hypothesize that the variable regions in SVGs are not the result of multiple domain fusion events, each resulting in an insertion of a different sequence into the protein. This hypothesis is supported by the fact that the fused domains are often conserved across multiple organisms (Marcotte et al. 1999
). Additionally, our procedure requires that the variable regions are of similar length within a family of proteins, which are also restricted to conserved length distributions. This filter suggests a mutational mechanism that originated from an ancient protein. Indeed, it is possible that originally the variable region was a result of a single or possibly relatively few ancient fusion events, but this paper does not focus on the evolutionary origin of SVGs.
Another prediction from our observations is that the variable regions are excellent candidates to bind substrates or partner macromolecules. They may be extremely helpful in discovering the networks of protein–protein or protein–nucleic acid interactions within a cell. Bioinformatics may even be able to help in this endeavor by finding genes that seem to have coevolving variable regions as a result of such interactions. Experimental data from techniques such as the yeast two-hybrid system or microarrays may provide evidence for interactions that can involve two variable regions.
Much additional bioinformatics work will be needed to explore fully the potential of this method in hypothesizing function. For instance, the size limits we have arbitrarily imposed on the variable region should be tested systematically. In our relatively simple formulation presented here, the length of the variable region and the number of proteins in the same family that do not have an alignment to the variable region are the primary factors in determining its statistical significance. Methods using other sequence analysis tools, such as multiple alignment and sequence profiles, may provide alternative ways to identify segmental pattern of variability. A fundamental problem is to differentiate random evolutionary drift from positive selection correlated to functional requirements. Although one might expect that the N- and C-termini may be more variable than the regions in the middle, our data suggest that variable regions in SVGs are not preferentially located in either end (data not shown). We have also examined the amino acid composition, codon usage, and GC content in the variable regions and the conserved regions of the same SVG. While there is no significant deviation of amino acid composition and GC content between the two regions in general, codon usage appears to be biased in the variable regions (data not shown).
SVGs usually account for 10%–20% of the total genes in a microbial genome. Currently, we think of the class of lineage-specific genes as being the key factor that distinguishes one strain or species from another. The class of SVGs that we have defined in this paper must now be added to this collection of lineage-specific genes by virtue of the unique segments that constitute their variable regions. They also appear to provide functional elements that help to differentiate among strains and species. This point is well illustrated by considering the restriction-modification systems. Here, the DNA methyltransferases, which have a variable region responsible for DNA recognition, are members of the SVG class. With the help of their companion restriction endonucleases, which typically appear as lineage-specific genes, they serve to keep foreign, unmodified DNA sequences from entering the genome. In this case, the synergy of function provided by members of the two classes highlights the key role that both sets of genes must play in defining the individuality of a strain or species.
Our analysis to date is limited to prokaryotes and archaea where SVGs are transcribed and translated as contiguous genomic segments. In eukaryotes, alternative RNA splicing introduces substantial additional complexity into the interpretation of gene structure and protein product, thereby rendering impossible the simple analysis we have applied here. It is tempting to consider alternative splicing as a highly evolved control mechanism to introduce the variability we find in the SVGs and thereby achieve the functional diversity necessary for cell survival under different conditions. In eukaryotes, alternatively spliced exons can be introduced in response to the functional demands of different cell types by merely juggling protein coding regions in the genome, thereby creating an SVG structure. If this view is correct, then it reinforces and highlights the importance of these SVGs to the workings of the cell.
In this paper we have provided an initial glimpse of SVGs, which appear to provide an important genetic layer in the adaptation of cells to novel environments and hazardous pathogens. We have focused attention on the biological significance of these genes, especially those that have highly diverged segments. We are currently trying to develop a more refined classification of these genes so as to explore the functional significance of the variability. We would like to know whether extreme variability is required for diverse function or whether more modest variation is sufficient. Such questions require that we can first distinguish positive selection acting on these variable regions from neutral evolution leading to gene decay and eventual loss. Since the variable regions we report are often not amenable to current tools available for alignment, we are exploring new methods that will help us to assess whether positive selection is driving the evolution of these genes.
In summary, we have identified an extremely useful way of classifying genes that leads to the identification of those with a high priority for both experimental and computational research.