Starting from a seed set of helicase-like region sequences from 28 demonstrated Snf2p-related proteins or close homologues, we have carried out a broad survey of Snf2 family proteins. This was achieved by iterative cycles of manual curation of multiple alignments and neighbour-joining trees to identify Snf2 proteins by similarity, construction of an HMM profile from the multiple alignments of identified proteins, and scanning of global and model organism protein databases using the HMM profile to uncover further sequences for curation. Our current global Snf2 family profile scan revealed 3932 sequences with E
-value under 1 (1879 with bitscore > 0) in Uniref100 release 5 [2.4 million entries (26
)]. Of these, 1306 sequences were identified as belonging to the Snf2 family and to span the full helicase region from motifs I to VI without introducing large unique insertions or deletions. A further 220 sequences fall within the rapA group, while other hits appear to belong to more distantly related groups (see below) or were too highly truncated to be aligned. Neighbour-joining trees from multiple alignments of the set of 1306 sequences revealed a well-defined branching structure () and enabled their assignment to 24 distinct subfamilies ().
Subfamily-specific HMM profiles were constructed from these assignments and used to characterize the Snf2 family complement for 54 complete eukaryote genomes. The counts of predicted proteins and unique encoding genes for 21 selected genomes are listed in , part A and B, respectively (see Supplementary Table S1A for full analysis of eukaryotic genomes, and Supplementary Table S2 for gene IDs by subfamily for seven common model organisms). In addition, 24 complete archaeal and 269 bacterial genomes were scanned (Supplementary Tables S1B and S1C).
Subfamily occurrences in selected complete eukaryotic genomes
Subfamilies within the Snf2 family
The clear distinction and significant number of subfamilies based on the helicase-like region ( and ) reflects both a remarkable breadth and specificity in the Snf2 family. An additional level of similarity distinguishes apparent groupings of subfamilies (), which echo current understanding of their functional diversity (). Most of the best studied Snf2 family proteins fall into a grouping of ‘Snf2-like’ subfamilies including proteins such as S.cerevisiae Snf2p, D.melanogaster Iswi, mouse Chd1 and human Mi-2, which are core subunits of the well-known ATP-dependent chromatin remodelling complexes. A separate ‘Swr1-like’ grouping encompasses the Swr1, Ino80, EP400 and Etl1 subfamilies. The ‘Rad54-like’ grouping contains the Rad54 subfamily, relatives such as ATRX and Arip4, and also includes the recently recognized DRD1 and JBP2 proteins. A further, unexpected, ‘Rad5/16-like’ grouping links several poorly studied subfamilies, three of which contain RING finger insertions within the helicase-like region (see below). The ‘SSO1653-like’ grouping of Mot1, ERCC6 and SSO1653 is notable because all three subfamilies are thought to have non-chromatin substrates. Finally, we have labelled SMARCAL1 proteins as ‘distant’ because they lack several otherwise conserved sequence hallmarks of the Snf2 family (see below). Although some groupings are clear, further investigations will be required to verify those where the boundaries are less distinct.
Functional and sequence characteristics of subfamilies
Since the subfamily assignments are based only on the common helicase-like region, this suggests that the ‘motor’ at the core of even large multiprotein remodeller complexes is tuned to the mechanistic requirements of its function. Such properties are not unprecedented for motor protein subfamilies. The ubiquitous kinesin and myosin proteins are divided into at least 14 and 17 subfamilies, respectively (39
), and those subfamilies are recognized to reflect tuning of the motors for enzymatic properties linked to particular functional roles. As this also appears to be true for Snf2 family proteins we can anticipate that mechanistic features of the motors will be shared within subfamilies and groupings. This may be useful in helping to predict function of poorly characterized proteins. For example, owing to the recent observation that Swr1 functions in histone exchange (41
), it is tempting to speculate that the Snf2 motors within other subfamilies in the Swr1-like grouping may be adapted for related purposes.
Owing to the remarkable diversity revealed by this classification and the occurrence of many subfamilies which have not been intensively investigated, we briefly summarize current functional and biochemical understanding and characteristic features of each subfamily in .
Defining the Snf2 family
The survey of Snf2 family proteins enables detailed analysis of sequence conservation in the helicase-like region (). This reveals a number of unique features distinguishing them from other helicase superfamily SF2 members. First, the conserved helicase motifs show a highly conserved character across the Snf2 family, and some motifs are extended by juxtaposed residues such as conserved blocks E and G ( and Supplementary Figure S4). Second, the helicase-like region in the Snf2 family is significantly longer than for many other helicases, primarily due to an increased spacing between motifs III and IV of >160 residues compared to 38 and 78 for typical SF2 helicases NS3 and RecG, respectively (44
). Third, a number of unique conserved blocks are found in Snf2 family proteins ( and Supplementary Table S5). Several of these blocks have been noted previously (20
), with conserved block B having been confused in a number of early manuscripts with motif IV. Conserved blocks B, C and K are of particular interest because they are located within the characteristic extended inter-motif III–IV region ().
Figure 2 Conserved residues within Snf2 helicase-like region. Sequence logo of global multiple alignment of 1306 Snf2 helicase-like region for alignment positions with residues in >90% of proteins. Helicase motifs are indicated in solid black boxes with (more ...)
Figure 3 Conserved blocks contribute to distinctive structural features of Snf2 family proteins. Structural components of Snf2 family proteins relevant to the conservation are illustrated on the zebrafish Rad54A structure [pdb 1Z3I (153)]. (A) core recA-like domains (more ...)
The SMARCAL1 subfamily contains classical helicase motifs which are highly similar to the other subfamilies. It also has an extended motif III–IV spacing, but it nevertheless lack conserved blocks within the motif III–IV region (Supplementary Figure S4). The rapA group has similar properties but is more diverse in overall sequence and retains less similarity in the classical motifs. It is unclear whether the SMARCAL1 subfamily and particularly the rapA group will maintain the structural features of the Snf2 family and they are therefore at the limit of the definition of the Snf2 family. We have also noticed further protein groupings with extended spacing between motifs III and IV and detectable similarity to the classical helicase-like motifs of the Snf2 family sequences (Supplementary Figure S4). These include poxvirus NPH-I related proteins involved in transcription termination (49
) and the FANCM/MPH1/Hef group of helicases encompassing yeast Mph1p, archaeal Hef and human FANCM proteins involved in DNA repair (50
). However, those proteins show low similarity to the Snf2 family between motifs III and IV and appear to lack the characteristic conserved blocks C, J and K of the Snf2 family. Interestingly, comparison of the recently determined Pyrococcus furiosus
archaeal Hef helicase structure reveals that the MPH1/Hef group has a related structural organization to Zebrafish Rad54, but contains only a single compact alpha-helical domain encoded between motifs III and IV (Supplementary Figure S6). It has been noted that this extra alpha-helical domain has some similarities with the thumb domain of Taq
DNA polymerase which grips the DNA minor groove (53
). It is therefore likely that the SMARCAL1 subfamily, rapA group, NPH-I and MPH1/Hef proteins reflect a continuum of diversity while sharing core features with the other Snf2 subfamilies.
Evolution of Snf2 family diversity
None of the 293 scanned archaeal or bacterial genomes contains a protein classified in any of the eukaryotic subfamilies (Supplementary Tables S1B and S1C). All identified archaeal and bacterial proteins belong to the SSO1653 subfamily and rapA group. Conversely, the SSO1653 subfamily and rapA group are likely to be specific to microbial organisms because the only two members of these families identified in eukaryotes (Supplementary Table S1A) appear to be false positives (data not shown). Over two-thirds of complete microbial genomes contain members of the SSO1653 subfamily and/or rapA group. This broad yet incomplete distribution suggests they perform non-essential functions that are sufficiently advantageous to maintain their prevalence.
Although rapA group proteins are distinguished by the lack of several features characteristic of eukaryotic Snf2 family members (see above), the SSO1653 subfamily carries all the Snf2 family sequence and structural hallmarks (Supplementary Figure S4). SSO1653 subfamily members are present in both bacteria and archaea, but they are not ubiquitous in archaeal genomes despite the presence of transcription, replication and repair mechanisms with significant similarity to those of eukaryotes (54
). There is also no obvious linkage between the presence of histone-like proteins and SSO1653 subfamily members in archaeal genomes (Supplementary Table S1B). Furthermore, the SSO1653 subfamily falls in a grouping () with the eukaryotic ERCC6 and Mot1 subfamilies whose biochemical role appears not to involve chromatin directly. In contrast to the limited archaeal and bacterial distribution of Snf2 family proteins, all eukaryote genomes contain multiple Snf2 family proteins. The early branching Giardia lamblia
and the minimal Encephalozooan cuniculi
genomes both encode six different Snf2 family genes falling into subfamilies represented across eukarya (Supplementary Table S1A), several of which have clear linkage to chromatin transactions. It is therefore possible that the microbial SSO1653 subfamily represents an ancestral Snf2-like form from which the eukaryotic subfamilies radiated. Such expansion of the Snf2 family early in eukaryote evolution (20
) could have been coincident with the development of high-density nucleosomal packaging (56
Distribution of Snf2 family members in complete genomes
The linkage between the primary sequence-based definitions of the subfamilies and distinct biological function is strongly supported by the presence of one or more subfamily members in each eukaryotic genome across large evolutionary ranges ( and Supplementary Table S1A). For example, a common set of subfamilies are found in almost all fungi, plant and animal genomes comprising Snf2, Iswi, Chd1, Swr1, Etl1, Mot1, Ino80, Rad5/16, ERCC6, Rad54 and SHPRH. Increased genomic complexity is also paralleled by increasing numbers of subfamilies and members: E.cuniculi
with a genome encoding some 2000 gene products has 6 Snf2 family members from 6 subfamilies, whereas the S.cerevisiae
genome encoding some 6000 genes has 17 Snf2 family members from 13 subfamilies, and the human genome encoding some 25
000 genes has 32 Snf2 family genes from 20 subfamilies (, part B).
The functional linkage across large evolutionary ranges suggests that each subfamily may have distinctive properties of their ATPase motors tuned to their function. This is supported by recent biochemical results demonstrating that helicase-like regions can be swapped within but not between subfamilies (57
). However, a counterpoint is that functional redundancy can occur between subfamilies. For example, synthetic deletion of all three of the S.cerevisiae ISW1
genes together is required to generate a strong phenotype (58
). Redundancy also provides an explanation why some genomes lack certain members: the small genome of Schizosaccharomyces pombe
lacks an Iswi subfamily member but maintains two Chd1 subfamily members. In addition to the 11 subfamilies represented broadly across eukaryotes are a number of others restricted to specific taxonomic ranges. For example, CHD7 members are found almost exclusively in animals, and ATRX members are found only in animals and plants.
A number of specific features contribute to the distinction between subfamilies. First, the spacing between motifs III and IV is extended significantly beyond the minimal ~160 residues for a number of subfamilies (). For the Rad5/16, Ris1 and SHPRH subfamilies, the additional sequences all include RING fingers, whereas for the Swr1 and EP400 subfamilies they comprise highly proline and serine/threonine-rich spans. Ino80 and ATRX subfamilies also contain large, novel and distinct spans. Remarkably, all these large extra insertions occur at the same location in the primary sequence, between conserved blocks C and K which we term the ‘major insertion site’ ( and Supplementary Figure S7A). Even for the subfamilies without large insertions there is variation in the length of sequence in the major insertion site (). For example, the Zebrafish Rad54 structure contains some 25 more residues forming two additional small alpha helices compared to the Sulfolobus solfataricus SSO1653 structure. When Snf2 family members from different subfamilies are aligned, the variability of the major insertion region strongly disturbs the alignment such that a contiguous pattern becomes difficult to define. This has led to some of the Snf2 family proteins being described as having ‘split’ helicase-like ATPase regions. The discontinuity is also the cause of protein motif databases such as SMART and Pfam defining Snf2 family members as matching a bipartite combination of SNF2_N and Helicase_C profiles (). The C-terminal end of the SNF2_N profile corresponds to conserved block C.
Spacings between helicase motifs and major conserved blocks by subfamily
Second, subfamilies have characteristic small insertions at other sites (). Two such sites, also in the motif III–IV region, are located between conserved blocks H and B and between J and C (). These are likely to influence the length of the long alpha helical protrusions 1 and 2, respectively (see below, ), and there is a difference of some 40 residues between the shortest and longest subfamily lengths for each (). A ‘minor insertion site’ located between motifs I and Ia on the back of recA-like domain 1 is also occupied by recognizable domains in a few subfamilies from the Rad5/16-like grouping such as SHPRH (Supplementary Figure S3B). A number of other small insertions map to loops between various secondary structural elements (data not shown).
Third, although adhering to a general Snf2 family-specific pattern, individual subfamilies show characteristic patterns in the helicase motifs and in other conserved blocks (Supplementary Figure S4). For example, the well-known helicase motif II with typical DEAH pattern favours DEGH in the Snf2, Mot1 and Rad54 subfamilies, DEAQ in the Swr1, EP400, Ino80 and SSO1653 subfamilies or DESH in the SMARCAL1 subfamily. Likewise, for the typical conserved block E—motif I combined pattern ILADEMGLGKT all ATRX subfamily members have histidine instead of aspartate (i.e. ILAHEMGLGKT) and most Mot1 subfamily members have cysteine replacing alanine (i.e. ILCDEMGLGKT). It is also possible to identify other residues correlating with groups of subfamilies. For example, members of the Snf2, Iswi, Chd1, Mi-2, CHD7, ALC1, Rad54, ATRX and Arip4 subfamilies have an arginine immediately following the motif II DEAH. In the zebrafish Rad54 structure this residue R294 interacts with the sulphate which is suggested to mimic the ATP gamma phosphate.
Conserved blocks encode the unique structural features of the Snf2 family
Two structural determinations of the helicase-like regions of Snf2 family members have been presented recently: zebrafish Rad54 (pdb code 1Z3I) (47
) and S.solfataricus
SSO1653 (pdb codes 1Z6A, 1Z63, 1Z5Z) (46
). As expected for members of the Snf2 family, the fold of each core recA-like domain in the Rad54 and SSO1653 structures is substantially similar and related to those of other known SF1 and SF2 helicases. In the zebrafish Rad54 structure the two recA-like domains are oriented equivalently to those of other known helicase structures (Figure 3A and B), whereas in the S.solfataricus
SSO1653 structures recA-like domain 2 is flipped by 180° to an arrangement never previously observed for a helicase (Supplementary Figure S7B). This unusual orientation in SSO1653 is observed for both the DNA free and DNA-bound forms (46
The most striking feature of the Snf2 family structures is the presence of several additional structural elements grafted onto the core helicase structure. These comprise antiparallel alpha helical protrusions from both recA-like domains 1 and 2 (), a structured linker between the recA-like domains (), the major insertion region at the back side of the domain 2 alpha helical protrusion () and a triangular brace packed against the domain 2 alpha helical protrusion (). The two alpha-helical protrusions and linker are all encoded within the enlarged span between motifs III and IV. The triangular brace is encoded immediately downstream of motif VI.
Remarkably, the primary sequence features of the Snf2 family correspond directly to the additional structural elements (). First, the bases of the protrusions from recA-like domains 1 and 2 are both fixed by conserved blocks. For protrusion 1, this involves conserved block H composed of a repeating pattern of aromatic residues, with additional involvement of aromatics from conserved block A. For protrusion 2 this involves the arrangement of conserved blocks C, J and K. Second, the protrusions themselves are relatively conserved in sequence and length within subfamilies but not across the whole Snf2 family. Although there is no obvious correlation between the lengths of the protrusions 1 and 2, the distribution of protrusion lengths adheres to multiples of the alpha helical repeat (Supplementary Figure S8), suggesting that protrusions retain structure while varying in extension. Third, the Q motif structure found in many SF2 proteins utilizes a different arrangement of residues to DEAD box helicases such as eIF4A, where an aromatic residue orients the adenine base ring for contacts with a downstream glutamine (4
) (). In the Snf2 family, the aromatic residue is contributed by conserved block F downstream of the glutamine. The Q motif affects ATP hydrolysis in DEAD box helicases and mutation of the core glutamine in yeast Snf2 subfamily member Sth1p causes slow growth (4
). Fourth, the linker connecting protrusions 1 and 2 contains highly conserved dual arginines in conserved block B. Their central location between the ATP-associating and DNA-associating structural elements suggests that they may play an important role in the mechanism of Snf2 family enzymes. Consistent with this, mutation of the second arginine of the pair in Snf2p leads to effectively complete loss of function of the protein in vivo
). Finally, the brace is composed of a principal alpha helix anchored by conserved block M into the junction at the base of protrusion 2 composed of conserved blocks C, J and K.
The major insertion region is immediately behind protrusion 2, almost diametrically opposite the ATP-binding site in the zebrafish Rad54 structure (). The nearest residues of the major insertion region in Rad54 are some 15 Å from DNA phosphates for docked DNA (Supplementary Figure S7A). However, an appropriately oriented alpha helix of some 20 residues would be sufficient to reach into the major groove, so large insertions at the major insertion site could potentially interact with DNA or other DNA-binding proteins bound in the groove. In the flipped conformation of domain 2 observed in the SSO1653 structure, the major insertion region is juxtaposed immediately adjacent to the DNA such that two non-conserved arginines from the major insertion region make direct DNA phosphate contacts.
As the distinctive structural features are defined by unique and highly conserved blocks, they are likely to confer properties to the ATPase motor that adapts the action of the core recA-like domains for a unique mechanism. We anticipate that while some features of the Snf2 family mechanism will be common to SF2 translocases, other aspects will be distinctive. Knowledge of the conserved residues and their structural location provides important information for understanding these distinctions.
Other levels of Snf2 family identity
We have demonstrated that the common helicase-like region is sufficient to enable classification of Snf2 family members. However, almost all Snf2 family polypeptides contain significant additional sequences likely to harbour accessory domains. For some subfamilies there is good correlation with the presence of particular accessory domain combinations (Supplementary Table S9). For example, almost all Snf2 subfamily members contain a bromodomain, ISWI members contain a SANT domain, and Chd1, Mi-2 and CHD7 members contain a chromodomain. However, many domain profiles in resources used for domain analysis have unidentified function or are unreliable in the context of Snf2 proteins. For example, Pfam lacks a SANT-specific profile and detects <10% of SANT domains with a more generic general ‘Myb_DNA-binding’ profile. We are currently undertaking further analysis to improve the relevant profiles and analyse the linkage of Snf2 family accessory domains in detail.
Finally, many Snf2 family proteins are part of larger multi-protein complexes. Accessory motifs within these complexes are also likely to adapt the function of Snf2 motors for different purposes.