Phylogenetic analysis revealed an unusually strong relationship between the substrate specificity and the presence of a variable four-residue, Cys-containing motif, in the five subgroups of GH4 hydrolases. Biochemical analysis of the E. rhapontici enzyme showed that it is a highly specific α-glucosidase, in striking contrast with the enzymes of the Thermotoga clade that hydrolyze α-galactosides as effectively as they do α-glucosides. The E. rhapontici enzyme has the motif CHEI that, when changed to CHEV, has virtually no effect on specific activity. The most common active-site motif sequence for the α-galactosidases is CHSV. If substrate specificity is determined solely by the Cys-motif, then it would be expected that further changing CHEV of E. rhapontici to CHSV should convert the E. rhapontici enzyme from a strict α-glucosidase to a specific α-galactosidase. Contrary to expectation, changing the motif to CHSV resulted in a complete loss of catalytic activity. This finding makes it clear that substrate specificity and enzymatic activity, although clearly associated with a particular Cys motif, must also be determined by other factors including molecular structure and the architectural environment of the active site. In other words, it is not only the motif but also the context of the Cys motif that determines the substrate specificity of a particular subgroup of enzymes. In the context of the remainder of the E. rhapontici enzyme sequence, the CHSV motif results in inactivity, whereas in the context of the α-galactosidase sequences, it results in high activity.
With these caveats in mind, we cannot assume that the Symbiobacterium thermophilum IAM 14863 ST enzyme is a 6-phospho-α-glucosidase simply because it encodes a CDMP motif that it shares with all of the 6-phospho-α-glucosidases. Neither can we assume that the enzymes in the Thermotoga clade that carry the CHGH motif are all α-glucosidase/galactosidases. The “context” effect means that there is simply no substitute for direct, experimental evidence when assigning function and substrate specificity to an enzyme.
Although the Cys motif cannot be used by itself to estimate enzyme specificity, it can certainly serve as a red flag to direct our attention to likely unusual activities. The replacement of CNVP by SSSP in the
A. laidlawii enzyme makes it very unlikely that the enzyme is a phospho-
β-glucosidase; it certainly cannot utilize the normal GH4 mechanism to hydrolyze phospho-
β-glucosides.
Acholeplasma laidlawii lacks the phosphotransferase system enzymes II and cannot phosphorylate
β-glucosides (
Hoischen et al. 1993), making it even more unlikely that the enzyme is a phospho-
β-glucosidase. It is possible that the enzyme is simply inactive and that it has accumulated substitutions in both of the metal-binding motifs, in what amounts to a pseudogene. Considering that several other motifs remain intact, it seems unlikely that “both” metal-binding motifs would have been mutated. It is also unlikely that the enzyme, whose length is typical for GH4 enzymes, would have escaped deletions, nonsense mutations, and frameshift mutations, along a branch that has an average of 2.2 substitutions per site unless it serves some function that is subject to purifying selection. Finally,
A. laidlawii has an extremely small genome (~1.5 Mb), and it is unlikely that a genome of this size would long retain a functionless gene. A more likely alternative is that the
A. laidlawii enzyme has acquired a novel activity during a period of positive selection. The observation that
A. laidlawii possesses a
β-glucosidase that is inducible by growth on cellobiose (
Hoischen et al. 1993) raises the intriguing possibility that the
A. laidlawii GH4 enzyme has evolved
β-glucosidase activity. We have initiated a study of the
A. laidlawii enzyme in the hope of determining whether it possesses glycosyl hydrolase activity and (if so) determining the catalytic mechanism of that activity.
Our results are quite consistent with a similar study of the large GH13 family (
Stam et al. 2006). GH13 is an enormous family of over 2,500 enzymes that include 26 different functions that are indicated by different EC numbers. The Stam study used a variety of methods to cluster those enzymes into 35 subfamilies that were entirely consistent with subtrees of an unrooted NJ tree of GH13. Most of the subfamilies were monospecific, that is, included only one experimentally determined EC number, whereas a few included enzymes with two closely related functions. They concluded that “assignment to a subfamily is a considerable step toward improved functional prediction. However, because not all subfamilies have a biochemical(ly) characterized member and because a significant number of sequences are not included in subfamilies, errors or imprecision are still possible during unsupervised automated genomic annotation.”
Practical considerations make it clear that sequencing will continue to outpace our ability to obtain direct experimental evidence for enzyme activities. It therefore behooves us to consider how best to use sequence information to maximize our confidence in estimating enzyme function and specificity. This study makes it clear that simple grouping on the basis of similarity of sequence motifs is an insufficient criterion. Automatic documentation algorithms sometimes make incorrect assignments, sometimes make unjustified assignments, and sometimes fail to make assignments that can be justified. Likewise active-site motif sequences (as we describe here) although perhaps indicative are nevertheless insufficient to assign enzyme function and specificity. Phylogenetic analysis is, so far, a reliable guide to function.
When phylogenetic analysis is used conservatively (as we have done) to assign specificity, there are no cases in which experimental evidence conflicts with phylogenetic assignment. Although we cannot hope to determine the functions of all the 201 GH4 enzymes considered here, consideration of the active-site motif together with phylogenetic analysis can serve as a guide to choosing enzymes for biochemical analysis. The speed and ease of modern DNA-sequencing technology can easily lead to a “stamp-collecting” approach in which genomes are sequenced for the sake of enlarging a collection, and the annotations serve as a dry catalog of that collection. Genome sequences and their annotations should serve as a guide to interesting problems and to the design of experiments, which will augment our understanding of biological processes. To achieve these aims, automated annotation “pipelines” will need to develop along the lines of expert systems that will include the phylogenetic analysis (not just clustering) of genes and that will integrate such analyses with functional motifs and other bioinformatic data.