|Home | About | Journals | Submit | Contact Us | Français|
The homeodomain is a highly conserved DNA-binding motif that is found in numerous transcription factors throughout a large variety of species from yeast to humans. These gene-specific transcription factors play critical roles in development and adult homeostasis, and therefore, any germline mutations associated with these proteins can lead to a number of congenital abnormalities. Although much has been revealed concerning the molecular architecture and the mechanism of homeodomain-DNA interactions, the study of disease-causing mutations can further provide us with instructive information as to the role of particular residues in a conserved mode of action. In this paper, I have compiled the homeodomain missense mutations found in various human diseases and re-examined the functional role of the mutational “hot spot” residues in light of the structures obtained from crystallography. These findings should be useful in understanding the essential components of the homeodomain and in attempts to design agonist or antagonists to modulate their activity and to reverse the effects caused by the mutations.
The regulation of gene transcription is based on specific interactions between transcription factors and their target genes. These transcription factors play central roles in all developmental processes and also in adult homeostasis (Duboule 1994). Thus, numerous congenital syndromes have been shown to be caused by mutations in genes encoding transcription factors, and the numbers of mutations and congenital defects are expected to grow (Semenza 1989; Engelkamp and van Heyningen 1996). Indeed, a recent analysis of the complete human genome has revealed that transcription factors represent one of the four major functional groups of proteins whose germline mutations are the causes of known human diseases (Jimenez-Sanchez et al.2001).
Among these transcription factors, the homeodomain has become one of the most studied eukaryotic DNA-binding motifs since its discovery when homeotic mutations, i.e., mutations leading to segmental transformations, were observed in Drosophila (Gehring 1966; Lewis 1978) and later localized in genes encoding a stable domain of about 60 residues (McGinnis et al. 1984; Scott and Weiner 1984). Since then, hundreds of homeodomains in a large variety of species have been found at all levels of the developmental hierarchy, establishing that genetic control based on homeoboxes is common both to various levels of the development of an organism and to a wide range of species (Duboule 1994).
Many human diseases ranging from developmental abnormalities to metabolic disorders have been linked to mutations in the genes encoding these homeodomaincontaining proteins (D’Elia et al. 2001; Goodman and Scambler 2001; Zhao and Westphal 2002). Mutations affecting transcription factors lead to the breakdown or abnormal control of the transcriptional machinery because of loss of function, either as a hafloinsufficiency or in a dominant negative fashion (Seidman and Seidman 2002). These mutations include nonsense or frameshift mutations that result in truncated and non-functional proteins, or missense mutations giving rise to single amino acid substitutions that can cause subtle, yet detrimental effects in individuals. Whereas nonsense or frameshift mutations are readily understandable, disease-causing missense mutations and the encoded single amino acid substitutions can be more instructive as to the requirement and specific role of that particular residue for protein function. These missense mutations could affect protein expression levels, protein stability, protein localization, post-translational modification, and/or the specific activity of a protein including its physical interactions (Wang and Moult 2001). In this paper, I have compiled and re-examined the diseasecausing mutations in homeodomains from a structural viewpoint and have addressed the role of key residues that are more frequently mutated in patients and the effects of these mutations.
There are 155 homeodomain-containing proteins in the UCSC Human Genome Browser (http://genome.ucsc. edu/cgi-bin/hgGateway), many of which contain more than one isoform. Many disease-causing mutations are found in these proteins, and the current information has been obtained from the available resources on the web, including the Human Gene Mutations Database and other bioinformatics databases. Among these, I have used the Online Mendelian Inheritance in Man in Baltimore (http://www.ncbi.nlm.nih.gov/Omim), the Human Gene Mutation Database in Cardiff (http://archive.uwcm.ac.uk/uwcm/mg/hgmd0.html), the Bioinformatic Harvester in Heidelberg (http://harvester.embl.de), and the NIH Homeodomain Resources (http://research.nhgri.nih.gov/homeodomain/), in addition to the literature. The list of missense mutations, gene products, associated human diseases, homeobox classes, and mutational effects on protein stability and functions are tabulated in Table 1. A total of 119 independent homeodomain missense mutations has been documented for 26 different genes giving rise to various inherited human diseases. These are all monogenic causes of their respective diseases in which the direct relationship between sequence and protein function (missense mutations) and between protein function and disease state (monogenic causes) can be addressed. Throughout this text, the conventional numbering system for homeodomain residues has been used.
Remarkable features of structural and functional conservation have been observed within homeodomain family members. Compared with primary sequences, their three-dimensional structures are more conserved, which indicates the importance of proper architecture for correct functioning with respect to, for example, DNA recognition and protein-protein interactions. Some amino acids, such as Trp48, Phe49, Asn51, and Arg53, which are invariant among almost all homeodomains, are essential in maintaining structural integrity and/or making contacts with DNA, whereas other residues vary in order to provide DNA-binding specificity and other protein functions.
This high degree of conservation between the sequence and structure makes the homeodomain an ideal model for studying protein-DNA interactions and gene regulation. The homeodomain is composed of three helices, which are folded around a hydrophobic core in which the second and third helix adopt a helix-turnhelix motif for DNA recognition, and a flexible N-terminal arm with additional important functional roles (Gehring et al. 1994; Billeter 1996; Wolberger 1996). The third (recognition) helix and the N-terminal arm recognize the major groove and the adjacent minor groove of target DNAs, respectively. The N-terminal arm also contains a stretch of basic residues known as the nuclear localization signal (NLS). Unlike conventional helixturn-helix motifs, which use the residues on the turn and the first loop of the third helix to contact DNA, homeodomains make these contacts with residues that are located toward the C-terminal end of the third helix. This structure is highly conserved among otherwise highly different species and different ways of recognizing target genes. These homeodomains are either found alone as a DNA-binding motif or in tandem with another module, such as paired-homeodomains (Wilson et al. 1995), LIM-homeodomains (Hobert and Westphal 2000), POU-homeodomains (Ryan and Rosenfeld 1997), or cut-homeodomains (Harada et al. 1994).
A vast body of knowledge about homeodomains has accumulated over the last 15 years, including the results from extensive in vitro binding studies of various DNA fragments and from crystallographic structures of both free and DNA-bound forms of homeodomains revealing the molecular mode of their interactions with DNA (Gehring et al. 1994; Billeter 1996; Wolberger 1996). In addition, some of these conjugate-homeodomain structures show diverse and subtle variations in homeodomain architecture and homeodomain-DNA interactions but display the highly conserved and universal mode of DNA recognition (Jacobson et al. 1997; Xu et al. 1999; Chi et al. 2002).
The trinity of the sequence-structure-function relationship is the core element of the natural world of proteins, and some degree of variation might be expected to be allowed for these proteins, particularly for transcription factors during adaptive evolution, in order to ensure DNA and protein-binding specificities for recognizing a larger spectrum of gene promoters and co-regulators. However, selective pressure maintains vital functions, and thus, the degree and pattern of sequence conservation among various members of a protein family is highly informative regarding the functional requirement of each residue.
The signature motif of the homeodomain is found within the DNA-recognition helix in which hydrophobic core aromatic residues and DNA-binding core residues are strictly conserved (Fig. 1). Two aromatic residues in helix 3, Trp48 and Phe49, are almost absolutely conserved and have become markers for identifying divergent homeoboxes. Likewise, Asn51 and Arg53, which are also in helix 3, are strictly conserved and form bidentate contacts with adenine and nonspecific interactions with backbone atoms, respectively. In addition, throughout the sequence, other residues are highly conserved for structural and functional roles, in accordance with findings related to the frequency of disease-causing mutations (Fig. 1). These findings indicate that the mutational “hot spot” residues serve as core residues for maintaining the overall architecture of the homeodomain and for optimally recognizing the site-specific target genes.
When the database information was complied, three mutational “hot spots” were identified along the sequence of the homeodomain (Fig. 1a, b): Arg5, which recognizes the minor groove of DNA (Fig. 2a, b), and the successive residues of Arg52 and the strictly conserved Arg53, which recognize the major groove (Fig. 2a, c). These are all surface residues that are either directly or indirectly involved in DNA binding. These findings are contrary to a general observation that the relative probability of disease-causing mutations is highest in the protein interior and lowest on the protein surface, and that the dominant mechanism by which disease mutations damage protein function is a decrease in protein stability (Wang and Moult 2001; Ferrer-Costa et al. 2002), validating the significance of these residues for protein function. Structural descriptions and the functional roles for each “hot spot” residue are provided in the following sections.
A significant contribution to the optimal DNA binding of the homeodomains comes from the N-terminal arm (Gehring et al. 1994; Shang et al. 1994). The Arg5 residue is located in the middle of this N-terminal extension. Arg5 has the dual function of binding DNA through the minor groove and serving as part of the NLS. Even though the exact details of the DNA-binding mode vary among different homeodomains (Table 2), Arg5 always protrudes deep into the minor groove and makes an extensive and nondiscriminatory hydrogen bonding network with the base and sugar atoms (Fig. 2b). These interactions are often further stabilized by the basic residues at positions 2 and 3, which make additional hydrogen bonds with DNA backbone atoms. Thus, Arg5 appears to serve as a core element of the N-terminal arm in recognizing the minor groove without imposing DNA specificity. This structural finding is consistent with the biochemical data in which proteins mutated at this position are expressed at levels similar to wild-type proteins but have markedly reduced DNAbinding activity (McIntosh et al. 1998; Qu et al. 1998; Yamada et al. 1999). Partially impaired nuclear localization has also been observed for the R203C (R5C by the conventional numbering) mutant of HNF1α (Yamada et al. 1999).
This major mutational “hot spot” found on the recognition helix includes Arg52 and Arg53. Arg53 is strictly conserved in all homeodomains and makes direct hydrogen bonds with DNA backbones from two nonspecific nucleotides at the 5′ flanking region of the promoter recognition sequence in all cases (Table 2 and Fig. 2c). This acts as a claw hooking onto a rope and holding it tightly and serves as a clamp to anchor the recognition helix from one side for optimal interactions in the major groove. In addition, Arg52 is highly conserved and tethers the recognition helix for optimal DNA binding by forming a salt bridge with the Glu17 on the first helix, except in hepatocyte nuclear factor 1 a (HNF1α) in which the closest residue Glu21 is 4.16 Å away and beyond the acceptable hydrogen bonding distance. In many cases, Arg52 further stabilizes the recognition helix by forming an additional salt bridge with Glu56 (Table 2, Fig. 2c). Thus, Agr52 appears to be required both for the conformational stability of the recognition helix and the entire homeodomain (Weiler et al. 1998) and for optimal DNA interactions. In some homeodomains, Arg52 is replaced by Lys52 (Table 2), but similar hydrogen bonding patterns are still maintained. This intricate network of interactions by Arg52 and Arg53 has been evolutionally conserved to ensure the correct positioning of the recognition helix but does not dictate the sequence-specific promoter recognition. Biochemical data have confirmed greatly reduced DNA binding and transcriptional activity in “hot spot 2” mutants, despite normal protein expression levels and protein stability (Dattani et al. 1998; Wu et al. 1998; Swaroop et al. 1999; Vaxillaire et al. 1999; Yoshiuchi et al. 1999; Wilkie et al. 2000; Quentien et al. 2002).
Protein stability and correct folding are the foundations of protein function. The compact homeodomain is stabilized by the hydrophobic core, which holds all of its helices together. Highly conserved Val/Ile45 and strictly conserved Trp48 and Phe49 on the recognition helix take part in the formation of the core. Other notable highly conserved amino acids forming the hydrophobic core include Leu/Trp/Phe16 and Tyr/Phe20 from the first helix and Leu/Ile/Phe/Met34 from the second helix. A recent phase-display shotgun scanning method used on the engrailed homeodomain has revealed the importance of additional hydrophobic residues such as Phe20 and Tyr25 (Sato et al. 2004; Wolfe 2004). However, the frequencies of diseasecausing mutations on these hydrophobic core residues are low compared with those occurring in DNA-binding domains (Fig. 1a, b). A similar pattern has been observed in p53, another well-known transcription factor in which the largest number of human disease mutations have been found within a single gene product (Bullock and Fersht 2001). These findings are in contrast to the generally believed observation that the majority of human disease-causing mutations disrupt protein stability (Wang and Moult 2001; Ferrer-Costa et al. 2002).
An exhaustive survey of transcription-factor-DNA interactions reveals a number of forces contributing to their strength and specificity. Some of these forces act locally in distinct regions of the interacting surfaces, whereas others exert a more global influence on complex formation. Local forces include hydrogen bonds, ionic salt bridges, hydrophobic interactions, and van der Waals contacts, whereas global forces include plasticity and sequence-dependent folding, conformational changes, and cooperativity gained through simultaneous DNA recognition by multiple protein modules (Ogata et al. 2003). As a rule, DNA-binding domains mediate: (1) nonspecific or “positioning contacts” that provide a general moderate affinity and (2) base-specific contacts that ensure high-affinity binding to specific target sequences. Nonspecific contacts are principally interactions with the DNA backbone of phosphate and sugar moieties and frequently involve electrostatic attractions (ionic salt bridges) between basic protein residues and the polyanionic DNA phosphoskeleton. Base specificity is governed by a network of local contacts of the types outlined above between flexible amino acid side chains that emanate from the binding domain and the exposed edges of the base pairs, primarily in the major groove of the DNA target sequence. The difference between the binding energies for the sequence-dependent and sequence-independent components of the interaction is the measure of the sequence selectivity of a DNA-binding domain (Ogata et al. 2003).
Homeodomainsarenoexceptiontothesegeneralrulesof protein-DNA interactions. Key residues for specific and nonspecific interactions have been well characterized. Whereas target DNA sequences of respective homeodomains differ from each other, they share some common features such as the “TAAT” core sequences. In the major groove, base-specific recognitions are made primarily by residues Val/Ile47, Gln/Lys50, and Asn51 (Gehring et al. 1994;Billeter 1996;Wolberger 1996).Amongthese,the side chain of theinvariant Asn51 from the recognition helix specifically contacts A3 of the TAAT core by accepting a hydrogen bond from adenine N6 and donating a hydrogen bond to adenine N7. This is conserved in all homeodomain proteins,andtheN51Amutationinengrailedhomeodomain abrogatesbindingtoDNA(Ades and Sauer 1994).
DNA-binding specificity appears to be conferred primarily by Val/Ile47 and Gln/Lys50 (Ades and Sauer 1994; Pomerantz and Sharp 1994; Connolly et al. 1999), and earlier studies have indicated that mutations in the Val/Ile47 and Gln/Lys50 residues alter DNA target specificity (Treisman et al. 1989; Ades and Sauer 1994; Tucker-Kellogg et al. 1997; Grant et al. 2000; Simon and Shokat 2004). For example, in HNF1α, Val/Ile47 is replaced by Asn, which recognizes cytosine (lacking a methyl group) instead of thymine, and Gln/Lys50 is replaced by Ala, which does not take part in DNA binding. Val/Ile47 mostly recognizes T4 of the TAAT core via a van der Waals contact with the methyl group at the C5 position, whereas Gln/Lys50 mostly recognizes the nucleotides 3′ to the TAAT core. However, these residues do not appear to be essential for DNA binding because the replacement of Val/Ile47 by Arg, Asn, His, or Gly residues still renders compatible or better DNA bindings (Pomerantz and Sharp 1994), and Q50K replacement enhances DNA-binding affinity (Ades and Sauer 1994). Furthermore, the crystal structures of the Q50K and Q50A mutants reveal only subtle changes at the protein-DNA interface (Tucker-Kellogg et al. 1997; Grant et al. 2000). Consistent with these findings, only a few numbers of mutations are found at these residues (Fig. 1a, b). DNA-binding specificity appears to be more tolerant of mutation than the binding affinity governed mostly by nonspecific interactions.
Nonspecific DNA interactions in homeodomains are made by the basic residues on the N-terminal arm, on the loop between the first and the second helices, and on the recognition helix (Gehring et al. 1994; Billeter 1996). The mutational “hot spot” residues are found among these nonspecific DNA-contacting residues. Arg5 is found at the N-terminal, and Arg52 and Arg53 are located on the recognition helix, and these residues are highly intolerant of any substitutions (Fig. 1a, b). Additionally, a moderate frequency of mutation is also observed at Arg31, which is located at the beginning of helix 2, serves as an anchor for the DNA recognition helix, and directly or indirectly participates in DNA backbone interactions. Whereas it makes a direct hydrogen bond with the DNA backbone atom in the MSX1 structure (PDB accession code 1IG7), it is not close enough (greater than 6 Å) to accept a hydrogen in many other homeodomains, including that of HNF1α. Instead, it provides an overall positively charged environment favorable for DNA interactions. It also makes a salt bridge with the carbonyl backbone atom Glu42 at the beginning of helix 3, which serves to anchor the recognition helix properly for optimal DNA binding and local stabilization. Thus, Arg31 at the beginning of the second helix also appears to have a significant structural and functional role as part of the general homeodomain-DNA backbone interactions.
Among the proteins listed in Table 1, HNF1α represents the most number of mutations found in a single protein, and the mutational “hot spot” residues are well represented (Table 1, Fig. 1a, b). Mutations in Hnf-1a are the most common monogenic causes of the form of diabetes known as maturity onset diabetes of the young (MODY). The recent crystal structure of HNF1α bound to DNA has revealed that HNF1α belongs to the POU transcription factor family, despite the lack of sequence homology in the POUSpecific domain region, and has unveiled the way in which HNF1α confers site-specific promoter recognition, thus telling us why function is lost by MODY3 mutations (Chi et al. 2002). Unlike nonsense and frameshift mutations that are found sporadically throughout the HNF1α sequence, missense mutations are clustered into DNA-binding domains and are almost evenly distributed between the POUHomeo and POUSpecific domain. However, because information about diseasecausing mutations of other POUSpecific domains is limited, I intend to confine the discussion in this review to homeodomains in which mutation information is more abundant.
Even though HNF1α displays moderate variation from other homeodomains in that a 21 amino acid insertion, important for extensive domain-domain interactions, has occurred between the second and the third helix (Chi et al. 2002), it still retains the conserved DNA-binding mode and can serve as a prototype for discussion and graphical representations of homeodomain-DNA interactions (Fig. 2).
The hallmarks of DNA-homeodomain interactions are present in HNF1α (Chi et al. 2002). The recognition helix is situated in the major groove, oriented perpendicular to the long axis of the DNA. As in all homeodomains, Asn51 (Arg270 in human HNF1α) forms bidentate contacts with adenine at the TAAT core, whereas Arg53 (Arg272) within the conserved WFXNXR motif of the recognition helix makes nonspecific interactions with the backbone atoms. In addition, Arg5 (Arg203) in the N-terminal arm of the POUHomeo domain forms hydrogen bonds with thymine, cytosine, and adenine in the minor groove.
Many of the mutated residues in the POUHomeo domain are involved directly in DNA recognition, including those that normally create hydrogen-bonding networks with DNA, viz., basic residues Arg5 (Arg203) and Arg53 (Arg272). Other mutations appear to disrupt DNA recognition indirectly through perturbations in the local environment. The cluster of basic residues at the amino-terminus of the POUHomeo domain serves as an NLS. Mutation of Arg2 (Arg200) and Arg5 (Arg203) within the putative NLS probably hinders nuclear translocation: Thus, the substitution of residues such as Arg5 (Arg203) have dual potential consequences on HNF1α function. Additional mutations interfere with intramolecular interactions between its POUSpecific and POUHomeo domains; these would distort their relative orientations and ability to interact cooperatively with DNA. Others are predicted to disrupt protein folding or stability, which may lead to the accumulation of misfolded protein or premature degradation.
Mutations are of fundamental importance for gene diversity and evolution but are also associated with diseases and death when they occur at critical sites. The study of naturally occurring missense mutations on protein-coding genes can be instructive. Even though mutations in a single protein might not be definitively informative because human mutations are not random and are influenced by the local DNA sequence environment (Antonarakis et al. 2000; Krawczak et al. 2000; Zhang and Gerstein 2003), accumulated occurrences on many functionally related proteins or a group of family members can yield information on the importance of each residue and the underlying functional mechanism.
Many residues of the homeodomain participate in DNA recognition, and the analysis of disease-causing mutations has revealed Arg5, Arg52, and Arg53 as key functional elements in this vital function. These mutation-intolerant arginine residues make nonspecific interactions with DNA backbone atoms, indicating that nonspecific DNA binding is a prerequisite for any further sequence-specific recognition and binding. Homeodomains have been shown to be capable of binding to DNA nonspecifically or atypically with reasonable binding affinity (Aishima and Wolberger 2003). Thus, these mutational “hot spot” residues appear to recognize DNA nonspecifically anywhere along the chain and maintain stable homeodomain-DNA complexes while translocating to their target sites at which point specific interactions can be made by other residues (Kalodimos et al. 2004).
All of these “hot spot” mutations appear to be arginine residues. A similar finding has been made on p53 in which five out of six mutational “hot spot” residues are arginine residues that either directly or indirectly affect DNA binding (Bullock and Fersht 2001). Assuming that each base-pair has the same chance of naturally becoming modified, arginine would not be expected to be the amino acid with the highest mutation rate in a protein, because arginine has the highest number of possible codons. Nevertheless, arginine residues account for almost 15% of all human disease mutations (Vitkup et al. 2003). This high mutational recurrence of arginine residues is not unexpected and could be partially attributable to the high mutability of cytosine present in the CpG dinucleotide. CpG dinucleotides are believed to be hypermutable because of deamination when they are methylated (Cooper et al. 1997; Pfeifer 2000). However, several mutations of arginine codons of human homeodomain genes are not C to T transitions (D’Elia et al. 2001). This is true for many other proteins. Thus, the high frequency of arginine substitutions are believed to reflect their functional requirements as surface residues that play vital roles in catalysis, protein-protein interactions, and protein-DNA interactions, as in the case of the homeodomains.
Homeodomain-containing transcription factors often interact with other transcription factors binding to adjacent recognition sites, in addition to coactivators, to enhance transcriptional activity (Di Palma et al. 2003; Okada et al. 2003). These synergistic interactions with other transcription factors and coactivators serve as additional elements that control the specificity of the homeodomains (Gehring et al. 1994). Even though the molecular details of the combinatorial synergism and recruitment by each homeodomain-containing transcription factor are ill-defined, those residues found on the putative protein-protein interaction surfaces seem to have higher mutational tolerances than the core residues affecting nonspecific DNA-binding affinity.
A protein is made up of a large number of amino acid residues with unequal contributions to protein stability and various other functions. Even though alanine scanning mutagenesis (Shang et al. 1994; Acton et al. 2000; Morrison and Weiss 2001) or phase display (Connolly et al. 1999; Pabo et al. 2001; Sato et al. 2004; Simon et al. 2004) can be used systematically to assess the contributions of individual amino acid side chains to protein properties, better and more definite indications can be obtained from naturally occurring monogenic mutations resulting in altered phenotypes. Therefore, these findings should be valuable in attempts to design homeodomains with high affinity for the targeting of specific genes; similar approaches have been made for zinc-finger DNA-binding proteins with clinically important applications (Choo and Isalan 2000; Wolfe et al. 2000; Jamieson et al. 2003). These findings should also be useful in the design of small agents that can modulate the function of homeodomain-containing transcription factors and that can reverse the effects caused by disease-causing mutations.
Acknowledgments I wish to thank S. Shoelson for initiating the HNF1α project and for insightful discussions. I also thank K. Sarge and members of the Chi laboratory for comments on the manuscript. This work was funded in part by fellowships from the Juvenile Diabetes Research Foundation and the Mary Iacocca Foundation to Y.-I. Chi.