The ~60 amino acid homeobox domain or ‘homeodomain’ is a conserved DNA-binding protein domain best known for its role in transcription regulation during vertebrate development. The homeodomain can both bind DNA and mediate protein-protein interactions (
Wolberger, 1996); however, the precise mechanisms that dictate the physiological function and target range of individual homeodomain proteins are in general either unknown or incompletely delineated (
Banerjee-Basu et al., 2003;
Svingen and Tonissen, 2006). In several cases, functional specificity can be traced to the homeodomain itself (
Chan and Mann, 1993;
Furukubo-Tokunaga et al., 1993;
Lin and McGinnis, 1992), indicating that individual homeodomains have distinct protein- and/or DNA-binding activities. Since many homeodomains have similar DNA sequence preferences, much attention has been paid to the role of protein-protein interactions in target definition (
Svingen and Tonissen, 2006), despite evidence that the sequence specificity of monomers contributes to targeting specificity (
Ekker et al., 1992) and that binding sequences do vary, particularly among different subtypes (
Banerjee-Basu et al., 2003;
Ekker et al., 1994;
Sandelin et al., 2004). Indeed, it has been proposed that the DNA binding specificity of homeodomains is determined by a combinatorial molecular code among the DNA-contacting residues (
Damante et al., 1996).
Efforts to understand the physiological and biochemical functions of homeodomains have been hindered by the fact that most have only a few known binding sequences, if any. Position weight matrices (PWMs) have been compiled for 63 distinct homeodomain-containing proteins from human, mouse,
D. melanogaster, and
S. cerevisiae in the JASPAR (
Bryne et al., 2008) and TRANSFAC (
Matys et al., 2003) databases. These matrices are based on 5 to 138 individual sequences (median 18), presumably capturing only a subset of the permissible range of binding sites for these factors. Further, the accuracy of PWM models has been questioned (
Benos et al., 2002), and there are many examples in which transcription factors bind sets of sequences that cannot be described in a conventional PWM representation (
Blackwell et al., 1993;
Chen and Schwartz, 1995;
Overdier et al., 1994).
Moreover, the sequence preferences of the individual proteins can, in some cases, be altered by the binding context: for instance, the binding specificity of the complex of
Drosophila Hox-Exd homeodomain proteins is remarkably different from that of the individual monomers (
Joshi et al., 2007), raising the prospect that the monomeric binding preferences may not always be relevant to targeting
in vivo. There is evidence that the sequence preferences of individual Hox proteins in
Drosophila and mammals are significantly altered by physical interactions with protein co-factors in the PBC and Meis subfamilies, presumably through contacts to the Hox N-terminal arm that change the way the homeodomain contacts DNA (
Mann and Chan, 1996;
Wilson and Desplan, 1999). Other evidence, however, suggests that these examples of co-factor alterations to the monomer binding specificities are likely to be the exception rather than the rule. Carr and Biggin demonstrated that there is good correlation between monomer binding
in vitro and
in vivo for four fly homeodomain-containing proteins: Eve, Ftz, Bcd, and Prd (
Carr and Biggin, 1999). Carroll and colleagues further showed that Ubx activity in promoting haltere development is independent of protein co-factors and that the promoters of its target genes in this pathway contain clusters of individual Ubx binding sites (
Galant et al., 2002). Liberzon
et al., showed not only that the specificity of the Hox-like mouse protein Pdx1 also extends beyond the TAAT core, but that the preferences at these flanking positions
in vitro correlate with the ability of these sequences to stimulate transcription
in vivo (
Liberzon et al., 2004). In addition, for many domain classes, and in organisms ranging from yeast to human,
in vivo binding sites detected by ChIP-chip typically contain sequences that reflect those preferred
in vitro (
Carroll et al., 2005;
Harbison et al., 2004).
The mouse genome encodes a larger number of homeodomains than most vertebrates, including humans, and contains representatives of both ancient (NK, Hox) and young (Rhox, Obox) homeodomain families, encompassing striking examples of both purifying and diversifying selection (
Jackson et al., 2006;
Larroux et al., 2007;
Rajkovic et al., 2002). The mouse homeodomain complement, estimated at 260 distinct proteins and 275 individual homeodomains (
Bult et al., 2004), is broadly conserved across animals (). For example, most mouse homeodomains (172/275 or 63%) have an identical human counterpart, and among these, most (107/172) have fewer than ten amino acid differences from their
Drosophila counterpart. In contrast to their relative invariance over evolutionary time, however, most homeodomains within a genome are very different from other homeodomains within the same genome (): although there are 22 instances of mouse proteins with identical homeodomains, the median number of amino acid differences between any two mouse homeodomains is 37.
In this analysis, we sought to fully characterize the sequence preferences of mouse homeodomains in order to ask whether the binding activity is unique to each homeodomain and whether the full activity profile can be predicted from the primary amino acid sequence of the homeodomain, in a way consistent with a molecular code. We also explore the relevance of the monomeric binding preferences to binding sites in vivo. Since the mouse homeodomains exemplify the functional diversity inherited from the common ancestor of all animals, as well as the potential for homeodomain expansion and divergence, our results and conclusions are extendible across the animal kingdom.