|Home | About | Journals | Submit | Contact Us | Français|
Assignment of function for enzymes encoded in sequenced genomes is a challenging task. Predictions of enzyme function can be made using clues from superfamily assignment, structure, genome context, phylogenetic conservation, and virtual screening to identify potential ligands. Ultimately, confident assignment of function requires experimental verification as well as an understanding of the physiological role of an enzyme in the context of the metabolic network.
Genome sequences are now available for over 900 microbes and 6 multicellular organisms, providing genetic blueprints for organisms that differ enormously in morphology, physiology, and habitat. Unfortunately, our ability to interpret these blueprints is hampered by the lack of assigned function for one-third or more of the proteins in every organism. This commentary will focus specifically on the assignment of enzyme function. Automated assignment of enzyme function is notoriously difficult, as many enzymes with very low sequence identity catalyze the same reaction, and even enzymes that share 98% identity can have different substrate specificities .
Efforts to define the roles of enzymes of unknown function often begin with assignment to a superfamily based upon sequence analysis. Enzymes in a superfamily share a common ancestor. In some cases, the ancestral catalytic activity has been retained and divergence has resulted in different substrate specificities. In others, divergence has generated enzymes that catalyze mechanistically distinct reactions, although structural and mechanistic features of the ancestor are conserved.
Superfamily assignment provides clues to enzyme function by indicating the overall fold of the protein, the location of the active site, and the range of known functions found in superfamily members. Further clues can be provided by conserved sequence motifs. Superfamily members generally share conserved motifs that are important for structure or function or both. Families within a superfamily often have additional motifs and/or patterns of distinct residues within motifs that are involved in substrate specificity or family-specific catalytic functions (Figure 1) [2-6]. Our ability to capitalize on such clues is growing as structural and functional studies expand our knowledge of specific superfamilies. The enolase , amidohydrolase , and haloalkanoic acid dehalogenase  superfamilies are the most thoroughly characterized at this point. However, even in these superfamilies, numerous enzymes fall into families for which there is no known function. Furthermore, some superfamilies do not have easily recognized signatures indicating family membership. The hotdog fold superfamily exhibits little or no conservation of catalytic residues and poorly defined substrate-binding pockets, hindering efforts to use sequence and structural information for the prediction of function .
Information about potential functions derived from superfamily affiliation can be exploited along with clues from genome context, phylogenetic conservation, and an understanding of microbial physiology to assign enzyme function . A few of many examples of the use of such information include the identification of function for o-succinylbenzoate synthase from Amycolaptosis sp. , 2,6-dichlorohydroquinone dioxygenase from Sphingobium chlorophenlicum , N-formimino-l-glutamate deiminase from Pseudomonas aeruginosa , and d-galacturonate isomerase from Bacillus halodurans . However, in many cases, these clues are not enough. For example, protein Cg10062 from Corynebacterium glutamicum belongs to the tautomerase superfamily. The protein has six active site residues that are conserved in the superfamily and catalyzes three reactions typical of the superfamily at low rates, but its physiological role still cannot be identified .
Additional clues to enzyme function can be obtained by screening libraries of potential substrates for activity (e.g., [16-18]). An example is the identification of function of Bacillus subtilis BC0371 , which belongs to the muconate-lactonizing enzyme subgroup of the enolase superfamily. This enzyme clusters with the l-Ala-d/l-Glu epimerase family, but three residues typical of that family are missing, suggesting that BC0371 has a different function. The enzyme was incubated with a library of l,l-dipeptides, and epimerization was detected by incorporation of deuterium from the solvent into the substrate. Subsequent kinetic analysis using molecules that were substrates showed that values for kcat/KM were suspiciously low - at best 103/M-1s-1. Since N-acyl amino acid racemases are also found in the muconate-lactonizing enzyme subgroup, a second screening was carried out with a library of N-succinyl l-amino acids. N-succinyl-l-Arg was found to be the best substrate; kcat/KM was 1.4 × 105/M-1s-1, which is well in the range of values seen for physiologically relevant reactions.
Virtual screening has been widely used to identify potential enzyme inhibitors for drug discovery efforts by docking a set of ligands into an active site and predicting binding energies based upon van der Waals and electrostatic interactions and solvation effects. This approach has been adapted in recent years to predict substrates for enzymes of unknown function. The true substrates for enzymes are usually found among the high-scoring hits, though often not at the top of the list [20,21]. The correct substrates for 11 enolase superfamily members were found in the top 1% of 19,000 ligands . The substrates for Pseudomonas putida mandelate racemase ranked 77 and 140 for docking to a structure of the enzyme from an enzyme-inhibitor complex. Pinpointing the correct substrate is difficult because docking algorithms predict binding affinity, but not propensity for turnover, which requires correct positioning of the substrate with respect to catalytic groups. Furthermore, approximations are required for the scoring function. Finally, it can be difficult to account for conformational changes in the protein that must occur for ligand binding. The docking algorithm can be adapted to allow some flexibility in the protein, but this is not successful for large conformational changes. Thus, the primary value of virtual screening is in providing clues to structural characteristics of the substrate and thereby limiting the number of potential substrates that must be screened experimentally.
The results of virtual screening are often improved by use of a database containing high-energy intermediates, rather than ground-state substrates, because enzymes generally bind more tightly to transition states or high-energy intermediates than to substrates. To explore this approach, Hermann et al.  generated 21,000 high-energy forms of 3770 potential substrates for amidohydrolase superfamily enzymes. The docking procedure evaluated up to 1 million poses for each molecule. High-energy intermediates corresponding to the known substrates for seven enzymes were found in the top 100 molecules. For five of the seven enzymes, screening using the ground-state metabolite dataset was considerably less successful. A subsequent study used this approach to predict the function of Thermatoga maritima Tm0936, which belongs to the amidohydrolase superfamily . High-energy intermediates based on 4207 potential substrates for amidohydrolase superfamily members were docked into the active site. Nine of the top ten hits were derivatives of adenine, strongly suggesting that the true substrate is some sort of adenine derivative. Only four adenine derivatives were tested experimentally. Two of these (5-methylthioadenosine and 5-adenosylhomocysteine, which were ranked 5 and 6, respectively) were efficiently deaminated, with values of kcat/KM of greater than 105/M-1s-1.
A homology model can be used for virtual screening, although the docking algorithm should be modified to allow subtle rearrangements of side chains in the active site that may not be correctly predicted by the homology model. Song et al.  generated a homology model for the Bacillus cereus protein BC0371, which had been discovered by experimental screening of potential substrates to be an N-succinyl-l-Arg racemase (discussed in the ‘Introduction and context’ section). When a set of 420 l,l-dipeptides and N-succinyl l-amino acids was docked, N-succinyl-l-Arg ranked 147 when the homology model was used but was the best hit when a flexible-receptor docking protocol was used. Thus, the flexible-receptor protocol allows the virtual screening approach to be extended to enzymes for which a structure is not available. A recent implementation of this strategy led to the prediction and experimental verification that T. maritima TM0006 is a dipeptide isomerase with an unusual substrate specificity, preferring dipeptides containing Phe, Tyr, or His in the C-terminal position, in contrast to the structurally characterized l-Ala-d/l-Glu epimerase from B. subtilis used as the template for the TM0006 homology model .
Identifying a substrate that is turned over efficiently in vitro is an important step toward understanding the function of an enzyme. Taking this understanding to the next level requires fitting an observed activity into the metabolism of the organism. This is straightforward when an enzyme serves a function in a core catabolic or anabolic pathway, particularly when deletion of the gene encoding the enzyme causes a noticeable phenotype. However, many enzymes for which functions are not known will not play such obvious roles. Some may be involved in unusual metabolic pathways that have not yet been recognized. Others may serve more subtle functions whose discovery requires substantial imagination and understanding of chemistry and metabolism. An interesting example is Escherichia coli YghZ, a member of the aldo-ketose reductase superfamily . Overexpression of Ygh allows a strain lacking triose phosphate isomerase (TIM) to grow on lactose, although YghZ does not have TIM activity. A possible function suggested by the superfamily assignment and metabolic context was that YghZ might reduce l-glyceraldehyde 3-phosphate (l-GAP) produced by racemization of d-glyceraldehyde 3-phosphate (d-GAP) (Figure 2). Indeed, YghZ reduces l-GAP with a kcat/KM,l-GAP of 4.2 × 105/M-1s-1. The product, l-glycerol 3-phosphate, can be converted to dihydroxyacetone phosphate by glycerol 3-phosphate dehydrogenase. When TIM is present, YghZ may serve to detoxify l-GAP produced by non-enzymatic racemization of d-GAP.
A particular challenge is understanding the physiological roles of enzymes that have broad substrate specificity. For example, Haemophilus influenza YciA hydrolyzes a wide range of CoA thioesters with high efficiency (values of kcat/KM for 14 substrates are greater than 106/M-1s-1) . Similarly, E. coli NagD dephosphorylates UMP, GMP, CMP, AMP, and ribose 5′-phosphate with values of kcat/KM from 2 × 103 to 3 × 104/M-1s-1. The relatively robust activities of these enzymes with multiple substrates suggest that broad specificity may be part of their function. Understanding the physiological roles of such broad-specificity enzymes will require a more sophisticated understanding of the conditions under which they are expressed and the dynamics and relative concentrations of potential substrates that occur as the metabolic network changes in response to environmental conditions.
Identification of function for enzymes of unknown function has necessarily been an endeavor focused upon individual enzymes, and this will continue to be the case since experimental verification is critical for correct annotation of function. However, genomic enzymologists should take advantage of clues emerging from new high-throughput proteomic methods. Techniques for immobilizing small molecules on microarrays will allow the identification of proteins in complex mixtures that bind to or transform specific molecules or both . Activity-based protein profiling methods identify proteins that react with affinity probes designed to modify active sites of specific classes of enzymes . Some of the hits obtained in such experiments will undoubtedly be enzymes of unknown function. Such information provides an additional source of clues for identification of function for enzymes for which superfamily analysis, genome context, and phylogenetic analysis have not been sufficient.
The electronic version of this article is the complete one and can be found at: http://F1000.com/Reports/Biology/content/1/91
The author declares that she has no competing interests.