|Home | About | Journals | Submit | Contact Us | Français|
A novel protein superfamily with over 600 members was discovered by iterative profile searches and analyzed with powerful bioinformatics and information visualization methods. Evidence exists that these proteins generate a radical species by reductive cleavage of S-adenosylmethionine (SAM) through an unusual Fe-S center. The superfamily (named here Radical SAM) provides evidence that radical-based catalysis is important in a number of previously well- studied but unresolved biochemical pathways and reflects an ancient conserved mechanistic approach to difficult chemistries. Radical SAM proteins catalyze diverse reactions, including unusual methylations, isomerization, sulfur insertion, ring formation, anaerobic oxidation and protein radical formation. They function in DNA precursor, vitamin, cofactor, antibiotic and herbicide biosynthesis and in biodegradation pathways. One eukaryotic member is interferon-inducible and is considered a candidate drug target for osteoporosis; another is observed to bind the neuronal Cdk5 activator protein. Five defining members not previously recognized as homologs are lysine 2,3-aminomutase, biotin synthase, lipoic acid synthase and the activating enzymes for pyruvate formate-lyase and anaerobic ribonucleotide reductase. Two functional predictions for unknown proteins are made based on integrating other data types such as motif, domain, operon and biochemical pathway into an organized view of similarity relationships.
Sophisticated iterative profile methods have dramatically extended the power of sequence homology searches (1–3). These tools are useful for creating a larger context for database search results. Whereas a strong match in a BLAST search can be used to infer similar function, the weaker similarity detectable by an iterative profile method illuminates a more distant relationship and is evidence of a conserved fold in the protein structure (2). An anonymous sequence without significant pairwise similarity can often be linked in this way with proteins that have been characterized experimentally (4).
Iterative profile searches are easy to perform but can be difficult to interpret because the data sets returned are large. A query is linked to numerous sequences, each with multiple links to other data sources, creating a large information landscape that can be hard to navigate. As a result, when performing iterative profile searches on the most interesting and novel sequences, a scientist is likely to be overwhelmed with data presented simply as long linear lists, a sharply limited view of information that is inherently multi-dimensional.
We have applied powerful bioinformatics and information visualization techniques to overcome these obstacles in the analysis of an important new protein superfamily that we discovered using iterative profile searching. We call this new superfamily Radical S-adenosylmethionine (SAM) after the defining characteristics of its best-studied members. Radical SAM is an ancient and diverged group with 645 unique sequences from 126 species found to date from all three domains of life. At least half the proteins are of unknown activity. We use exploratory statistical methods to analyze the sequence similarity relationships and integrate these results with other data types (motif, domain, operon structure, biochemical pathway and the biomedical literature) for discovery efforts on previously uncharacterized sequences. Our results are part of a larger effort to scale up biological knowledge production using four accelerating factors: (i) information visualization; (ii) large computational resources; (iii) new mathematical strategies; (iv) collaborative problem solving environment technology.
The reader can directly observe evidence for distant sequence similarity in the Radical SAM superfamily using the Web version of PSI-BLAST (http://www.ncbi.nlm.nih.gov/BLAST/) at the National Center for Biotechnology Information (NCBI, Bethesda, MD). For example, enter any gi identifier from Figure Figure11 (such as 128228 for the Azotobacter vinelandii NifB protein) and iterate to convergence with the default threshold. In this work PSI-BLAST (2) searches were performed locally using command line searches against the non-redundant protein database downloaded from NCBI onto a Sun Ultra 60 workstation. Search results were analyzed and tested for the closure property with standard Unix tools. The 54 sequences directly tested include 30 proteins associated with at least some biochemical information as well as others chosen to represent the most distant members. False negatives for any individual search were measured against the set of unique and complete sequences already accepted as belonging to the superfamily from multiple other searches. False positives were measured against a list that also included redundant sequences and fragments. False positives ranged from 0 to 12%, with a median value of 0.2%. False negatives ranged from 0.7 to 16%, with a median value of 3%. There were seven sequences that required either an increased or decreased threshold [from the default Expect (E) value of 0.001] for convergence to occur. (Searches that fortuitously include significant numbers of unrelated sequences in a profile do not converge but rather ‘explode’ and can pull in the entire database.) With the 15 N-terminal/motif deletion sequences the median rate of false positives was unchanged, but false negatives increased to 5%. PROBE (3) is a powerful iterative profile search tool that was used in this case as a convenient method for extracting alignment blocks from the defined set of highly diverged Radical SAM proteins. These blocks were edited to show the strongest conserved regions.
BLAST E scores are reported in the computer style of standard scientific notation (e.g. 3e-20 represents 3 × 10–20).
Standard Unix tools, S-PLUS (MathSoft, Cambridge, MA), the OmniViz Pro software package (OmniViz, Richland, WA) and custom Perl programs were used for superfamily analysis. At the time the analysis was initiated there were 533 unique and complete Radical SAM proteins in the database. The conserved core domains (estimated at ~200 residues and starting at the conserved cysteine motif) were extracted from the Radical SAM sequences using an S-PLUS script. A Perl program was used to perform a complete BLAST comparison of the core domains to produce a matrix of BLAST E values with a high score cut-off of 1000 and then to transform the matrix (lowest score of 0 set to 1e-200, all missing scores to 10 000 and take logn). The transformed matrix was then imported into OmniViz Pro for hierarchical clustering (complete linkage with Euclidean distance). Data files produced by OmniViz Pro were imported into S-PLUS to produce a preliminary dendrogram representation. Cluster membership in the dendrogram was analyzed by making a Galaxy visualization in OmniViz Pro at each level of interest for the purpose of capturing the cluster membership list. These lists were examined individually and were also combined into a spreadsheet for a convenient view of the data. Inclusion of the location of the cysteine motif and the size of the proteins in the spreadsheet allowed patterns to be detected in the size of N- and C-terminal domains. The dendrogram visualization was created with an S-PLUS program that generated the colored blocks from the hierarchical clustering results and Adobe Illustrator (San Jose, CA).
The NCBI Web Entrez interface was used for access to MEDLINE and large sets of abstracts were downloaded using Network Entrez for analysis with the SPIRE technology (http://multimedia.pnl.gov:2080/infoviz/technologies.html), which provides an interactive topic map based on word frequency analysis.
A small collection of proteins with diverse functions have been noted to share an unusual Fe-S cluster associated with generation of a free radical by reductive cleavage of SAM. This group consists of lysine 2,3-amino mutase (LAM), biotin synthase (BioB), lipoic acid synthase (LipA) and the activating proteins for pyruvate formate-lyase (PflA) and anaerobic ribonucleotide reductase (NrdG). These ‘deoxyadenosyl radical’ enzymes have been the focus of detailed experimental work, including UV-Vis, EPR, Mössbauer, resonance Raman, variable temperature magnetic circular dichroism and mutagenesis experiments (5–12). SAM has been described as equivalent to a ‘poor man’s coenzyme B12’ in the reaction catalyzed by LAM (13). Very recently, K edge X-ray absorption spectroscopy experiments have provided important mechanistic evidence for the direct role of the unusual Fe-S cluster in LAM in the reductive cleavage of SAM (14–16).
Despite the attention they have received, the deoxyadenosyl radical proteins have not been previously recognized as homologous sequences, although a characteristic cysteine motif has been noted (7). We applied sensitive bioinformatics methods that detected distant sequence similarity between these five protein groups. This observation is evidence for a shared ancestor and supports the prediction of a common fold for the core domain. Our results also link these enzymes to a larger collection of known and unknown functions, a list that includes proteins found at unresolved steps in familiar biosynthetic pathways, such as thiamin, heme, heme d1, bacteriochlorophyll, molybdopterin, nitrogenase cofactor, pyrroloquinoline quinone, desosamine and others in secondary metabolism.
We detected distant sequence conservation between the Radical SAM proteins with PSI-BLAST iterative profile searching (2). We observed that these proteins form a closed set with the following property. Each sequence detects the same hit list within a small margin of error after iteration to convergence with a conservative threshold (for details see Materials and Methods). Proteins classified as belonging to the superfamily were either directly tested for this closure property (54 sequences) or shown to be strongly similar to one that was. All of the 645 unique and complete sequences collected in this manner were observed to contain an unusual conserved cysteine motif, most often near the N-terminus or in some longer sequences in the middle. These include 592 proteins with an exact match to the consensus CxxxCxxC and 53 variants with a small increased distance between the first two cysteine residues.
We also tested 15 diverse Radical SAM proteins after removal of the N-terminus including the cysteine motif. This deletion had the effect of reducing the sensitivity of the searches, but not the specificity. Interestingly, the oxygen-sensing regulatory protein FNR has been described as containing an Fe-S cluster similar to those found in the deoxyadenosyl radical proteins both in the cysteine motif and in a reversible transition from [2Fe-2S]2+ to [4Fe-4S]2+ controlled by the presence of oxygen (16,17). However, FNR proteins were never detected in any of the Radical SAM searches. Therefore, the presence of the cysteine motif is not necessary or sufficient for inclusion in the superfamily by PSI-BLAST detection of distant sequence similarity.
We used the PROBE (3) software against the diverged set of Radical SAM sequences to extract alignment blocks and show the strongest sections of these in Figure Figure1.1. The cysteine motif in the first block has the highest information density (in units of bits). A conserved aromatic residue (Y, F or W) adjacent to the third cysteine may function to lower the midpoint potential of the cluster by limiting solvent exposure (16). The second block contains a glycine-rich sequence resembling the SAM-binding site in methyltransferases (18) and could play a role in binding this molecule for reductive cleavage.
Protein sequences evolve more quickly than the corresponding three-dimensional structures and, as our results illustrate, proteins with a common fold may only show faint sequence conservation that approaches the limit of detection. However, these patterns can be extracted with sensitive bioinformatics approaches and the information they contain has quantitative value, as exemplified by recent successes in ‘threading’ protein fold prediction programs that include PSI-BLAST results as a term in the calculations (19).
The Radical SAM classification places 645 proteins into a single conceptual box but does not illuminate any details of how the members are organized. Although a phylogenetic tree is a useful way to analyze a sequence family, it is difficult to create a multiple alignment for this purpose with highly diverged proteins (20). We applied clustering, a well-known approach in exploratory statistics for extracting groups, to characterize the sequence similarity relationships between the Radical SAM core domains and generate a dendrogram (21). In this approach we used a tree representation not to represent phylogeny but rather to display sequence similarity relationships between superfamily homologs. We first generated a feature matrix based on complete BLAST comparisons between the conserved core domain in each Radical SAM member. We then used hierarchical clustering (complete linkage with Euclidean distance) on the BLAST E score feature matrix to organize the sequences and produce the dendrogram. This algorithm results in a hierarchical tree with the property that variance within each cluster is minimized. We found this preliminary dendrogram to be useful in many ways, for example in identifying misnamed sequences, classifying unknown sequences and in supporting the definition of unique features that characterize sub-groups. However, navigating the raw dendrogram with its associated lists of groups was a difficult and rate-limiting step in the analysis. We therefore created a visual prototype for an automated and interactive solution.
Visual display of information is considered a ‘broad bandwidth’ pathway to the human brain. Powerful visual problem solving approaches have been applied to the analysis of complex hierarchical data (22–28). We used aspects of this research to create a new visual representation for our data. We applied a measure of cluster cohesion that we have named the maximum BLAST E (mBE) score to the clusters in the dendrogram as the basis for color coding groups of closely related proteins that may share a common function (Fig. (Fig.2).2). This mBE cluster cohesion metric is defined as the maximum of the BLAST E values in the subset of the original feature matrix defined by the proteins in the group under consideration. Therefore, our measure of the ‘tightness’ of any cluster is directly based on the largest of all the BLAST comparisons for the proteins in a group. Strong relationships in the dendrogram are depicted with colored blocks that appear when the mBE value is <1. Further divisions within groups are represented by the color scheme, with cool colors representing looser groups and warm colors tighter ones. This visual encoding essentially creates a level of abstraction on distracting details and facilitates interpretation of the results.
For example, at the level of eight clusters in the dendrogram a group of 40 HemN-related sequences appear with an mBE value of 0.028 (Fig. (Fig.22 and Table Table1). 1). This group, shown in blue (a cool color reflecting lower cohesion) divides again at 37 clusters, producing 19 HemN sequences (mBE 3e-24, yellow) and 21 HemN-like sequences (mBE 1e-8, green). Examining the sequences, we observed that the HemN proteins all contain an extra cysteine at the end of the conserved motif (CxxxCxxCxC) but the HemN-like proteins, such as in Bacillus subtilus (29), and the PhuW, HutW and ShuW virulence proteins (30) do not. Thus, our visual algorithm presents these two related but distinct groups of proteins in an intuitive fashion and facilitates integration of this motif information into the analysis.
Interactive visualization strategies are beginning to be part of the analyst’s basic toolkit in working with large-scale information. We rely on the SPIRE technology (31) to explore large sets of biomedical records through interactive topographical maps based on word frequency statistics (http://multimedia.pnl.gov:2080/infoviz/technologies.html). In a similar way, we envision an automated and interactive version of our dendrogram visualization as a data mining tool that supports biological problem solving by creating a map of superfamily sequences and providing a framework for the integration of diverse data types in the analysis of unknown proteins.
We explored the organized view of the Radical SAM proteins to find 30 distinct groups associated with at least some biochemical data (Table (Table1).1). Interestingly, the most distantly related clusters (diverging first in the hierarchical tree) seem to share an involvement with sulfur transfer; these include the NifB, MiaB, BioB and LipA proteins (7–9,32–35). A mechanism for sulfur transfer from the Fe-S cluster in BioB has been proposed (32). Like biotin synthase, the NifB proteins act as reagents and not catalysts in existing in vitro assays (7,33). A group of 53 sequences, including MiaB (appearing at the level of three clusters with an mBE value of 0.001), also contains a novel human Cdk5 activator-binding protein that binds the neuronal Cdc2-like kinase (Nclk) involved in regulation of neuronal differentiation and neuro-cytoskeletal dynamics (36).
Other Radical SAM proteins, such as ThiH of thiamin and MoaA of molybdopterin biosynthesis, are found in pathways with sulfur transfer, but most likely do not act in this role directly. Interestingly, however, the MoaA proteins contain the residues GG at the C-terminus, a motif that is adenylated for sulfur transfer in the MoaD proteins, as in ubiquitin (37). In thiazole biosynthesis, sulfur is mobilized from cysteine in a manner similar to the molybdopterin pathway, with adenylation/thiocarboxylate formation at a C-terminal GG motif in ThiS (32).
Radical SAM proteins often provide an anaerobic or oxygen-independent mechanism that is found as an aerobic reaction in other proteins, for example HemN (38), BchE (39) and, possibly, ThiH (32). The HemN protein catalyzes an oxygen-independent oxidation in anaerobic heme biosynthesis and has been shown to require NADH and either SAM or ATP and methionine for in vitro activity (40), which Thauer is reported to have explained with the hypothesis of deoxyadenosyl radical chemistry (38). In heme d1 biosynthesis the anaerobic production of oxo groups at positions C3 and C8 is of special interest as a possible role for the NirJ protein (41). The oxidation of serine or cysteine to formylglycine is catalyzed by sulfatase activation proteins similar to AslB; the activation of arylsulfatase has been observed under both aerobic and anaerobic growth conditions (42,43).
Radical SAM proteins are associated with several ring-forming reactions. ThiH functions in thiazole ring formation from tyrosine, cysteine and 1-deoxy-d-xylulose-5-phosphate (32), PqqE in cyclization of the tyrosine amino acid backbone with glutamate addition to form the cofactor PQQ (44) and BchE in isocyclic ring formation in bacteriochlorophyll (39). The MitD protein in the mitomycin C gene cluster may catalyze mitosane ring formation from the condensation of 3-amino-5-hydroxybenzoic acid, d-glucosamine and carbamoyl phosphate (45).
Three eukaryotic interferon-inducible members are found, including a rat gene (best5) expressed during osteoblast differentiation and bone formation and a candidate drug target for osteoporosis (46), a human gene (cig5) induced during cytomegalovirus infection (47) and a trout gene (vig1) induced during rhabdovirus infection (48).
Many examples from secondary metabolism pathways, such as antibiotic and herbicide biosynthesis, are found, including spectinomycin (49), subtilosin (50), nikkomycin (51), mitomycin C (45), oxetanocin (52), fortimicin, fosfomycin and bialaphos biosynthesis (53,54) and the desosamine moiety of erythromycin (55), oleandomycin (56), methymycin, neomethymycin, narbomycin and pikromycin (57). Biodegradation is represented by BssD in toluene catabolism (58) and DNA repair by spore photoproduct lyase (59).
Problem solving in the genomics era increasingly depends upon traversing complex data landscapes with computational and visualization approaches. We present two examples of functional prediction for unknown proteins in the Radical SAM superfamily based on a multi-dimensional approach to data mining. These analyses were performed in a semi-manual fashion as a preliminary effort in the large-scale automation of the approach.
Although a dendrogram for the Radical SAM core domains is a useful tool for classification and annotation, it essentially provides only a one-dimensional analysis of the superfamily proteins based on the single data type of sequence similarity. Features such as motif, domain, operon, biosynthetic pathway, chemical structure and properties described in the biochemical literature can all provide important functional clues to the biologist when viewed in an organized context. We use the similarity dendrogram as a framework for the integration of multiple data types for the purpose of gaining leverage in the functional prediction of unknown proteins.
Example 1. Many Radical SAM sequences have independent N- or C-terminal domains (Fig. (Fig.1).1). We correlated the sizes of these independent domains with cluster membership in the dendrogram. For example, at the level of 41 clusters in the dendrogram (Fig. (Fig.2)2) a group of 26 proteins appears that shows poor cohesion even after multiple divisions, a feature that often suggests divergent functions. However, 25 of these proteins possess a long N-terminal extension of ~200 residues. PSI-BLAST searches show that these N-terminal sequences have distant sequence similarity (Fig. (Fig.3). 3). The proteins in this group, linked by both cluster membership and a shared N-terminal domain, include the fortimicin (FMT) and fosfomycin (PAMT) methyltransferases (53,54), OxsB (52) (oxetanocin) and MmcD (45) (mitomycin C). Also sharing the N-terminal domain but located in different clusters are the BchE proteins (38,39) and the bialaphos P-methylase (PMT) (53,54).
With further PSI-BLAST iterations of the N-terminal domain (with BchE for example) proteins outside the Radical SAM superfamily are detected, which share the property of binding cobalamin (60; Fig. Fig.3).3). This result is interesting in the light of experimental evidence that the fortimicin, fosfomycin and bialaphos methyltransferases (53,54) and, recently, BchE (61) utilize a cobalamin cofactor. Methylation reactions commonly occur through electrophilic attack of a methyl cation. However, the fortimicin, fosfomycin and bialaphos biosynthesis proteins each transfer a methyl group to an electrophilic site (54). Taken together, these data (dendrogram, domain and biochemical) reinforce each other and suggest that the candidate methyltransferase proteins found at cluster level 41 in our analysis are likely to share an unusual chemical mechanism even as they have diverged in sequence as a result of acting on distinct substrates and pathways.
Example 2. Operon data can be powerfully integrated with clustering results and biochemical data in a similar way. Little is known about the Radical SAM member ExsD (62) except that proteins in this operon impact on succinoglycan biosynthesis. We used the neighboring ExsC protein to search an operon database (made by extracting protein links from Radical SAM nucleotide records) as a means of finding other superfamily members with this linkage. ExsC homologs are found adjacent to nine Radical SAM proteins, located in two clusters (Fig. (Fig.2),2), one containing ExsD and another with a Pyrococcus furiosus protein (63). Therefore, it appears likely that the location of ExsD next to ExsC is not fortuitous. These results suggest an interesting hypothesis. ExsC is strongly related to 6-pyruvoyl tetrahydrobiopterin synthase, the second step in tetrahydrobiopterin biosynthesis (64). The first step in tetrahydrobiopterin biosynthesis is the production of a pterin ring by GTP cyclooxidase I. Interestingly, the MoaA proteins provide a unique mechanism for production of a pterin ring from GTP in molybdopterin biosynthesis (65). Therefore, by analogy, the ExsD Radical SAM protein and its neighbor in the operon ExsC could be the first two steps of an unusual pterin synthesis pathway.
With over 600 unique sequences, 30 known functions and many additional unknowns, the existing biochemical and genetic data on the Radical SAM proteins easily represent over 1 000 000 person-hours of experimental work in the laboratory. With identification of the superfamily this knowledge base becomes a resource supporting the laboratory efforts of a newly defined community of experimental scientists. All the Radical SAM proteins can now be evaluated for radical chemistry as well as other properties. The usefulness of the classification is illustrated by experiments performed by Nicholson and co-workers based on the observation that spore photoproduct lyase contains the characteristic cysteine motif of the deoxyadenosyl radical proteins (59). They modified an assay for anaerobic ribonucleotide reductase and successfully measured spore photoproduct lyase activity in vitro for the first time.
Radical SAM represents a mechanistic solution for the catalysis of difficult chemical reactions. Robert H. Abeles, who uncovered many unusual enzymatic reactions, is reported to have said ‘if you can formulate on paper a mechanism in two-electron steps, then there is no radical involved’, and this comment is still a practical one (66). However, the many two-electron mechanisms proposed for proteins in this new superfamily can now be seen as too conservative and can be reasonably made more radical.
We thank Gus J.Calapristi and Jim J.Thomas for helpful discussions and Wanda F.Mar and Leigh K.Williams for technical assistance. This work was performed at the Pacific Northwest National Laboratory (PNNL), which is operated by Battelle for the US Department of Energy. Funding support was provided by PNNL Laboratory Directed Research and Development and by Battelle.