How many homeobox genes and pseudogenes?
Using exhaustive database screening, followed by manual examination of sequences, we identified 300 homeobox loci in the human genome. Distinguishing which of these loci are functional genes and which are non-functional pseudogenes was difficult in some cases. Most loci classified as pseudogenes in this study are integrated reverse-transcribed transcripts, readily recognized by their dispersed genomic location, complete lack of intron sequences, and (in some cases) 3' homopolymeric run of adenine residues. A small minority are duplicated copies of genes, recognized by physical linkage to their functional counterparts and the same (or similar) exon-intron arrangement. In general, retrotransposed gene copies are non-functional (and therefore pseudogenes) from the moment of integration because they lack 5' promoter regions necessary for transcription. However, such sequences can occasionally acquire new promoters and become functional as 'retrogenes'. Duplicated gene copies often possess 5'promoter regions (as they are often encompassed by the duplication event); most degenerate to pseudogenes due to redundancy in a process known as non-functionalization, however some can be preserved as functional genes through sub- or neo-functionalization. Thus, in both instances, reliable indicators of non-functionality were sought in order to assign pseudogene status, notably frameshift mutations, premature stop codons and non-synonymous substitutions at otherwise conserved sites in the original coding region.
We currently estimate that the 300 human homeobox loci comprise 235 functional genes and 65 pseudogenes (Table ). These figures include three functional genes that possess partial homeobox sequences (PAX2, PAX5 and PAX8) and retrotransposed pseudogenes that correspond to only part of the original transcript, whether or not it includes the homeobox region or indeed any of the original coding region. Consequently, 13 retrotransposed pseudogenes that lack homeobox sequences are included (NANOGP11, TPRX1P1, TPRX1P2, POU5F1P7, POU5F1P8, IRX4P1, TGIF2P2, TGIF2P3, TGIF2P4, CUX2P1, CUX2P2, SATB1P1, ZEB2P1). We do not include PAX1, PAX9 and CERS1; these are functional genes without homeobox motifs, albeit closely related to true homeobox genes (the other PAX and CERS genes).
Table 1 Numbers of human genes, pseudogenes and gene families in each homeobox gene class. The human homeobox gene superclass contains a total of 235 probable functional genes and 65 probable pseudogenes. These are divided between 102 gene families, which are (more ...)
The total number of homeobox sequences in the human genome is higher than 300 for two reasons. First, several genes and pseudogenes possess more than one homeobox sequence, notably members of the Dux (double homeobox), Zfhx and Zhx/Homez gene families. Second, we have excluded a set of sequences related to human DUX4
), which have become part of 3.3 kb repetitive DNA elements present in multiple copies in the genome [12
]. Few of these tandemly-repeated sequences are likely to be functional as expressed proteins, and all were probably derived by retrotransposition from functional DUX gene transcripts (see below). The fact that they are not included in the total count, therefore, is likely to have limited bearing on understanding the diversity and normal function of human homeobox genes. Hence, our figure of 300 homeobox loci is the most useful current estimate of the repertoire of human homeobox genes and pseudogenes.
We propose a simple classification scheme for homeobox genes, based on two principal ranks: gene class and gene family. A gene class contains one or more gene families, which in turn will contain one or more genes. In a few cases, it is useful to erect an intermediate rank between these levels, and for this we use the term subclass. For the entire set of homeobox genes, we use the term superclass.
For the rank of gene family, we use a specific evolutionary-based definition based on common practice in the field of comparative genomics and developmental biology. We define a gene family as a set of genes derived from a single gene in the most recent common ancestor of bilaterian animals (here defined as the latest common ancestor of Drosophila
and human). This definition has been made explicitly in previous work [2
] but is actually a principle that has been in widespread, but rather inconsistent, use for over a decade [15
]. For example, amongst the homeobox genes, the En (engrailed) gene family was originally defined to include human EN1
, plus Drosophila en
]; these four genes arose by independent duplication from a single gene in the most recent common ancestor of insects and vertebrates. Moving outside the homeobox genes, this principle is also widespread; for example, the Hh (hedgehog) gene family was defined to include mouse Shh
, plus Drosophila hh
]. To clarify boundaries between gene families, we conducted molecular phylogenetic analyses of human homeodomain sequences, using a range of protostome and occasionally cnidarian homeodomain sequences as outgroups (Additional files 1
While the gene family definition described above is generally workable for homeobox genes, by necessity there are some exceptions. One type of exception relates to genes with an unknown ancestral number. For example, there is uncertainty as to whether there were one or two Dlx (distal-less) genes in the most recent common ancestor of bilaterians; however it is common practice to refer to a single Dlx gene family [18
]. Thus, we stick with convention for this set of genes. There is similar uncertainty over the ancestral number of Irx (iroquois) genes [19
], and again we treat these as a single gene family. The HOX genes are an interesting case as their precise number in the most recent common ancestor of bilaterians is unknown due to lack of phylogenetic resolution between 'central' genes [20
]. Here we divide the HOX genes into seven gene families: the 'anterior' Hox1 and Hox2 gene families, the 'group 3' Hox3 gene family, the 'central' Hox4, Hox5 and Hox6-8 gene families, and the 'posterior' Hox9-13 gene family. Another type of exception relates to 'orphan' genes. These are genes that have been found in one species (for example human) but not in other species, or at least not in a wide diversity of Metazoa. Some of these will be ancient genes that have been secondarily lost from the genomes of some species, in which case these comply with our evolutionary definition of a gene family made above. Others, however, will be rapidly evolving genes that originated from another homeobox gene and then diverged to such an extent that their origins are unclear [21
]. Whenever origins are unclear, we must define a new gene family to encompass those genes, even though they may not date back to the latest common ancestor of bilaterians. In these cases, the gene family is erected to recognize a set of distinct genes on the basis of DNA and protein sequence, rather than on evolutionary origins.
Using the aforementioned criteria, we recognize 102 homeobox gene families in the human genome (Table ). We are aware that other homeobox gene families exist in bilaterians but have been lost from humans (for example, Nk7, Ro, Hbn, Repo and Cmp; [7
]), and we recognize that some gene family boundaries will alter as new information is obtained. Nonetheless, at the present time the 102 gene families provide a sound framework for the study of human homeobox genes.
It is much more difficult to propose a rigorous evolutionary definition for the rank of gene class. Every attempt to classify genes above the level of gene family involves a degree of arbitrariness. We define gene classes by taking two principal criteria into account. First, gene classes should ideally be monophyletic assemblages of gene families. To identify probable monophyletic groups of gene families, we conducted molecular phylogenetic analyses of homeodomain sequences, and looked for sets of gene families that group together stably, regardless of the precise composition of the dataset used (Figures , , ; Additional files 3
). Some gene families were difficult to place from sequence data alone, and were found in different gene classes (or subclasses) depending on the precise dataset analyzed or the phylogenetic method employed. This is perhaps not surprising as trees that encompass many homeobox genes can only be built with a short sequence alignment (the homeodomain); under these conditions, phylogenetic trees can only be used as a guide to possible classification, not the absolute truth. In ambiguous cases, we used the chromosomal location of genes to guide possible resolution between alternative hypotheses. Second, some homeobox gene classes can be characterized by the presence of additional protein domains outside of the homeodomain [2
]. Recognized protein domains associated with homeodomains include the PRD domain, LIM domain, POU-specific domain, POU-like domain, SIX domain, various MEINOX-related domains, the CUT domain, PROS domain, and various ZF domains [2
Figure 1 Maximum likelihood phylogenetic tree of human ANTP-class homeodomains. Arbitrarily rooted phylogenetic tree of human ANTP-class homeodomains constructed using the maximum likelihood method. Bootstrap values supporting internal nodes with over 70% are (more ...)
Figure 2 Maximum likelihood phylogenetic tree of human PRD-class homeodomains. Arbitrarily rooted phylogenetic tree of human PRD-class homeodomains constructed using the maximum likelihood method. Bootstrap values supporting internal nodes with over 70% are shown. (more ...)
Figure 3 Maximum likelihood phylogenetic tree of human homeodomains excluding ANTP and PRD classes. Arbitrarily rooted phylogenetic tree of human homeodomains excluding the ANTP and PRD classes constructed using the maximum likelihood method. Bootstrap values (more ...)
Using the aforementioned criteria, we recognize eleven homeobox gene classes in the human genome: ANTP, PRD, LIM, POU, HNF, SINE, TALE, CUT, PROS, ZF and CERS (Table ). There is no expectation that the eleven gene classes will be of similar size, simply because some classes will have undergone more expansion by gene duplication than others. In the human genome, the ANTP and PRD classes are much larger than the other classes. Although gene classes should ideally be monophyletic, it is possible that the ZF homeobox gene class, characterized by the presence of zinc finger motifs in most of its members, is polyphyletic (Figure ; Additional file 5
). In other words, domain shuffling may have brought together a homeobox sequence and a zinc finger sequence on more than one occasion. The same may also be true for the LIM class; alternatively the apparent polyphyly of LIM-class homeodomains could be a consequence of LIM domain loss or artefactual placement of some ZF-class homeodomains in phylogenetic analyses (Figure ; Additional file 5
In theory, it is possible to recognize higher level associations above the level of the gene class, because the diversification of homeobox genes will have taken place by a continual series of gene duplication events. We do not propose names for hierarchical levels above the rank of class, and consider that gene name, gene family and gene class (and occasionally subclass) convey sufficient information for most purposes.
We use a consistent convention for writing gene classes and gene families. We present the names of all gene classes in abbreviated non-italicized upper case – for example, the ANTP and PRD classes – to avoid confusion with gene symbols (Antp
) or indeed gene names (Antennapedia
). In contrast, we present the names of all gene families in non-italicized title case; for example, the Cdx, En and Gsc gene families. We have used this style consistently in recent work [6
] and note that several other authors have done likewise [4
]. We suggest that this style, and most of these gene family names, can be used in other bilaterian genomes. Extending the scheme to non-bilaterians is more difficult, however, and awaits clarification of the relationship between the homeobox genes of sponges, placozoans, cnidarians and bilaterians [7
The ANTP homeobox class
The ANTP class derives its name from the Antennapedia
) gene, one of the HOX genes within the ANT-C homeotic complex of Drosophila melanogaster
. The human genome has 39 HOX genes, arranged into four Hox clusters. Here we divide the HOX genes into seven gene families: Hox1, Hox2, Hox3, Hox4, Hox5, Hox6-8 and Hox9-13. The HOX genes are not the only ANTP-class genes, and we recognize a total of 37 gene families in this class (Table ). We divide these 37 gene families between two subclasses that are relatively well-supported in phylogenetic analyses: the HOXL and the NKL subclasses (Figure ; Additional file 3
). As previously discussed, the subclasses are largely consistent with the chromosomal positions of genes [26
]. The HOXL (HOX-Like or HOX-Linked) genes primarily map to two fourfold paralogous regions: the Hox paralogon (2q, 7p/q, 12q and 17q) and the ParaHox paralogon (4q, 5q, 13q and Xq) (Figure ). The NKL (NK-Like or NK-Linked) genes are more dispersed, but there is a concentration on the NKL or MetaHox paralogon (2p/8p, 4p, 5q and 10q) (Figure ). Somewhat aberrantly, the Dlx and En gene families group with the NKL subclass in phylogenetic analyses (Figure ; Additional file 3
), but with the HOXL subclass on the basis of chromosomal positions (Figure ).
Figure 4 Chromosomal distribution of human homeobox genes. Ideograms of human chromosomes showing the locations of human homeobox genes. Hox clusters are each shown as a single line for simplicity. Probable pseudogenes are not shown. Genes are color coded according (more ...)
Most of the 37 gene families in the ANTP class have been clearly defined before. We draw attention here to several cases that could cause confusion. Other details can be found in Table .
Human ANTP class homeobox genes and pseudogenes
Cdx, Gsx and Pdx gene families. Some authors refer to the Pdx gene family as the Xlox gene family [28
]. One gene from each of these families (CDX2
) forms the ParaHox cluster at 13q12.2 (Figure ), and clustering of Cdx, Gsx and Pdx genes is ancestral for chordates [28
Mnx gene family. This gene family name derives from a previous study [29
]. The family includes one gene in the human genome: MNX1
), and two genes in the chicken genome: Mnx1
) and Mnx2
). Some authors refer to the Mnx gene family as the Exex gene family due to the Drosophila
Dlx gene family. It is currently unclear if this gene family is derived from one or more genes in the common ancestor of bilaterians [18
]. Phylogenetic analyses place this gene family firmly within the NKL subclass (Figure ; Additional file 3
), but chromosomal positions (on the Hox chromosomes 2, 7 and 17) place it within the HOXL subclass (Figure ). Here we favor placement of the Dlx gene family within the NKL subclass due to strong phylogenetic support.
En gene family. Phylogenetic analyses place this gene family either within the NKL subclass (maximum likelihood; Figure ) or close to the division between the NKL and HOXL subclasses (neighbor-joining; Additional file 3
). Here we place the En gene family within the NKL subclass, although we note that human EN2
maps close to the clear HOXL-subclass genes GBX1
on chromosome 7 (Figure ).
Nk2.1 and Nk2.2 gene families. The genes NKX2-1
divide into two distinct gene families each with an invertebrate ortholog, not a single Nk2 gene family. NKX2-1
are collectively orthologous to Drosophila scro
and amphioxus AmphiNk2-1
]; these comprise one gene family: Nk2.1. NKX2-2
are collectively orthologous to Drosophila vnd
and amphioxus AmphiNk2-2
]; these comprise a second gene family: Nk2.2.
Nk4 gene family. The genes NKX2-3
form a gene family, quite distinct from other human genes that confusingly share the prefix NKX2
. These three genes are actually orthologs of Drosophila tin
); they are not orthologs of Drosophila vnd
) or scro
]. Therefore, they do not belong to the Nk2.1 or Nk2.2 gene families, but belong to a separate Nk4 gene family. As the three gene names have very extensive current usage, it may be difficult for revised names to be used consistently. In this situation, we don't alter the current names, but raise for discussion the possibility of these genes being renamed to the more logical NKX4-1
) and NKX4-3
), or to CSX1
) and CSX3
), based on the alternative name CSX1
Noto gene family. This gene family falls close to the division between the ANTP and PRD classes in phylogenetic analyses (Additional files 1
). We favor placement within the ANTP class as the human NOTO
gene is chromosomally linked to the clear ANTP-class (NKL-subclass) genes EMX1
on chromosome 2 (Figure ), suggesting ancestry by ancient tandem duplication.
Most of the 100 genes in the ANTP class have been adequately named previously. However, several genes were unnamed or misnamed prior to this study. We have updated these as follows.
[Entrez Gene ID: 170825] is the second of two human members of the Gsx gene family. This previously unnamed gene has clear orthology to mouse Gsh2
, inferred from sequence identity and synteny. We designate the gene GSX2
and revise the nomenclature of the other human member of the family from GSH1
[Entrez Gene ID: 219409], in accordance with homeobox gene nomenclature convention.
[Entrez Gene ID: 3110] is the only member of the Mnx gene family in the human genome. This gene was previously known as HLXB9
; we rename it MNX1
because it is not part of a series of at least nine related genes.
[Entrez Gene ID: 3651] is the only member of the Pdx gene family in the human genome. This gene was previously known as IPF1
; we rename it PDX1
because the majority of published studies use this as the gene symbol.
[Entrez Gene ID: 390259] is the only member of the Bsx gene family in the human genome. We designate this previously unnamed gene BSX
on the basis of clear orthology to the mouse Bsx
gene, inferred from sequence identity and synteny.
[Entrez Gene ID: 120237] and DBX2
[Entrez Gene ID: 440097] are the only two members of the Dbx gene family in the human genome. We designate these previously unnamed genes DBX1
on the basis of clear orthology to mouse Dbx1
, inferred from sequence identity and synteny.
[Entrez Gene ID: 54729] and NKX1-2
[Entrez Gene ID: 390010] are the only two members of the Nk1 gene family in the human genome. These genes were previously known as HSPX153
respectively; we rename them NKX1-1
on the basis of clear orthology to mouse Nkx1-1
, inferred from sequence identity and synteny.
[Entrez Gene ID: 7080] is the first of two human members of the Nk2.1 gene family. This gene was previously known as TITF1
; we rename it NKX2-1
to show that it is a member of the Nk2.1 gene family.
[Entrez Gene ID: 137814] is the third of three human members of the Nk4 gene family. We designate this previously unnamed gene NKX2-6
on the basis of clear orthology to mouse Nkx2-6
, inferred from sequence identity and synteny, although nomenclature revision for the entire Nk4 gene family should be discussed (see above).
[Entrez Gene ID: 579] is the second of two human members of the Nk3 gene family. This gene was previously known as BAPX1
; we rename it NKX3-2
to show that it is a member of the Nk3 gene family.
[Entrez Gene ID: 157848] is the third of three human members of the Nk6 gene family. We designate this previously unnamed gene NKX6-3
on the basis of clear orthology to mouse Nkx6-3
, inferred from sequence identity and synteny.
[Entrez Gene ID: 27287] is the only functional member of the Ventx gene family in the human genome. This gene was previously known as VENTX2
. We remove the numerical suffix from this gene symbol because we discovered that the sequence formerly known as VENTX1
is actually a retrotransposed pseudogene derived from this gene. Accordingly, we also replace the VENTX1
symbol with VENTXP7
In contrast to the previous descriptions of probable functional genes, there has been much less research on pseudogenes within the ANTP class. Eleven pseudogenes derived from the human NANOG
gene have been described previously [22
], while four pseudogenes in the Ventx gene family have been reported following routine annotation of the human genome. We have identified two additional Ventx-family pseudogenes (VENTXP5
), and also found two cases of pseudogenes that were originally mistaken for functional genes (MSX2P1
). In all cases, we have clarified the origins and organization of these pseudogenes. This research brings the total number of ANTP-class pseudogenes in the human genome to 19.
[Entrez Gene ID: 55545]. A short cDNA sequence [EMBL: X74862
] related to the Msx gene family was reported previously [35
]; the former Entrez Gene record labeled HSHPX5
was based on this sequence. This locus was later provisionally called MSX4
, as it was distinct from human MSX1
, and by synteny it was clearly not the ortholog of mouse Msx3
]. It is now clear that this locus was formed by retrotransposition of mRNA from MSX2
and hence we name it MSX2P1
. The genomic sequence of MSX2P1
can now be accessed via the Reference Sequence collection [RefSeq: NR_002307]. The pseudogene shares 91% sequence identity with MSX2
mRNA, lacks intronic sequence, and has remnants of a 3' poly(A) tail. It is intriguing, but probably coincidental, that the MSX2P1
pseudogene has integrated at 17q23.2, close to several ANTP-class genes (HOXB cluster, MEOX1
[Entrez Gene ID: 404635]. We follow Booth and Holland [22
] and classify NANOGP1
as a pseudogene that arose by tandem duplication of NANOG
. The alternative view, argued by Hart et al [36
], is that this locus is a functional gene, and should be named NANOG2
. There is evidence for transcription of this locus in human embryonic stem cells [36
], and for selection-driven conservation of the open reading frame [37
], but as yet no clear evidence for function.
[Entrez Gene ID: 388112]. We follow Booth and Holland [22
] and classify NANOGP8
as a retrotransposed pseudogene. The alternative view, argued by Zhang et al [38
], is that this locus is a functional retrogene. There is evidence for transcription and translation of this locus in cancer cell lines and tumors [38
], but no evidence yet for a role in normal tissues.
[Entrez Gene ID: 139538], VENTXP2
[Entrez Gene ID: 347975], VENTXP3
[Entrez Gene ID: 349814] and VENTXP4
[Entrez Gene ID: 152101]. These four VENTX
retrotransposed pseudogenes have been reported previously, and were originally known as VENTX2P1
. The correction of the VENTX2
gene symbol to simply VENTX
(see above) means that each of the pseudogene names should also change; we rename them VENTXP1
is transcribed but due to mutations it can no longer encode a homeodomain protein; it can however encode an antigenic peptide (NA88A) responsible for T-cell stimulation in response to melanoma [39
[Entrez Gene ID: 442384]. We designate this previously unnamed sequence VENTXP5
because it is clearly a retrotransposed pseudogene of VENTX
. The genomic sequence of VENTXP5
can now be accessed via the Reference Sequence collection [RefSeq: NG_005091]. The pseudogene shares 83% identity with VENTX
mRNA (after masking of an Alu element in the parental mRNA sequence), lacks intronic sequence, and has remnants of a 3' poly(A) tail.
[Entrez Gene ID: 552879]. We designate this previously unannotated sequence VENTXP6
because it is clearly a retrotransposed pseudogene of VENTX
. Its lack of annotation may reflect the fact that it is located within an intron of an unrelated and well characterized gene, STAU2
. The genomic sequence of VENTXP6
can now be accessed via the Reference Sequence collection [RefSeq: NG_005090]. The pseudogene shares 87% identity with VENTX
mRNA (after masking of an Alu element in the parental mRNA sequence) and lacks intronic sequence.
[Entrez Gene ID: 391518]. A short cDNA sequence [EMBL: X74864
] was reported previously and named HPX42
]. This was later renamed the VENTX1
gene, after it was found to be related to Xenopus
Ventx-family genes. Our analysis of the genomic sequence at this locus reveals that it is actually a retrotransposed pseudogene of the VENTX
gene (formerly VENTX2
); thus we designate it VENTXP7
. The genomic sequence of VENTXP7
can now be accessed via the Reference Sequence collection [RefSeq: NR_002311]. The pseudogene shares 86% identity with VENTX
mRNA (after masking of an Alu element in the parental mRNA sequence), lacks intronic sequence, and has remnants of a 3' poly(A) tail.
One other gene could conceivably be included in the ANTP class, but is excluded from our survey. This gene [Entrez Gene ID: 360030; GenBank: AY151139], has been annotated as a homeobox gene and is located just 20 kb from NANOG. However, no homeodomain was detected when the deduced protein was analyzed for conserved domains. Also, secondary structure prediction did not predict the expected organisation of alpha helices. Alignment with the NANOG homeodomain reveals identity of the KQ and WF motifs, either side of the same intron position (44/45), but few other shared residues. It is possible, but unproven, that the locus arose by tandem duplication of part, or all, of the NANOG homeobox gene. This gene has generated two retrotransposed pseudogenes: one at 2q11.2 and another at 12q24.33.
The PRD homeobox class
The PRD class derives its name from the paired
) gene of Drosophila melanogaster
. In previous studies, the PRD class has been subdivided in several different ways, often based on identify of the amino acid at residue 50 in the homeodomain, for example S50, K50 and Q50. These categories are not monophyletic groupings of genes and so can be misleading if we aim for a classification scheme that reflects evolution [5
]. Here we divide the PRD class into two subclasses of unequal size: the PAX subclass (containing seven PAX genes, excluding PAX1
), and the PAXL subclass (containing 43 non-PAX genes and many pseudogenes) (Table ). PAX genes are defined by possession of a conserved paired-box motif, distinct from the homeobox, coding for the 128-amino-acid PRD domain. Of the nine human genes possessing a paired-box (PAX1
), only four also contain a complete homeobox (PAX3
). Three genes have a partial homeobox (PAX2
), while two lack a homeobox entirely (PAX1
). Phylogenetic analyses using PAX genes from a range of species suggest that these are secondary conditions, and that the ancestral PAX gene probably possessed both motifs [40
]. The PAX genes do not constitute a single gene family, because it is clear that the latest common ancestor of the Bilateria contained four PAX genes. Three of these are ancestors of the PRD-class homeobox gene families Pax2/5/8, Pax3/7 and Pax4/6; the fourth is the ancestor of PAX1
. Thus the PAX subclass contains three gene families. We divide the PAXL subclass into 28 gene families, although as explained below not all of these date to the base of the Bilateria. Thus, we recognize a total of 31 gene families in the PRD class (Table ).
Many of the 31 gene families in the PRD class have been clearly defined before. We draw attention here to newly defined gene families and cases that could cause confusion. Other details can be found in Table .
Human PRD class homeobox genes and pseudogenes
Argfx, Dprx and Tprx gene families. There are no known invertebrate members of these three gene families. Therefore, these are exceptions to the rule defining gene families as dating to the base of the Bilateria. The Dprx and Tprx gene families may have arisen by duplication and very extensive divergence from CRX
, a member of the Otx gene family, during mammalian evolution; origins of ARGFX
are obscure [21
Dux gene family. Members of this gene family are characterized by the presence of two closely-linked homeobox motifs. Most members are intronless sequences present in multiple polymorphic copies within the 3.3 kb family of tandemly repeated elements associated with heterochromatin. These comprise the sequences known as DUX1
reported in previous studies [12
] and numerous DUX4
copies detected in this study (see below). The absence of introns suggests that these sequences may have originated by retrotransposition from an mRNA transcript, thus they are probably non-functional. There are two noticeable exceptions; these members known as DUXA
possess introns, thus either one could be the progenitor for the large number of intronless Dux-family sequences found in the human genome. DUXA
has spawned 10 retrotransposed pseudogenes and has been described previously [21
is described here (see below).
Hopx gene family. Phylogenetic analyses places this gene family, containing a single very divergent homeobox gene HOPX
), either within the PRD class (maximum likelihood; Additional file 1
) or close to Zhx/Homez-family genes (neighbor-joining; Additional file 2
). We favor placement in the PRD class for three reasons. First, the HOPX homeodomain has highest sequence identity with PRD-class homeodomains (GSC: 38% and PAX6: 36%). Second, the HOPX homeodomain possesses the same combination of residues that are invariably conserved across human PRD-class homeodomains (Additional file 6
). Third, the HOPX homeodomain shares the 46/47 intron position seen in many PRD-class homeodomains. HOPX
does not map particularly near any other homeobox genes, although the closest is GSX2
in the ANTP class at 4q12 (Figure ). HOPX
is not a typical PRD-class homeobox gene; the homeodomain has a single amino acid insertion between helix I and helix II (Additional file 6
), and lacks the ability to bind DNA [41
Leutx gene family. This gene family contains a single gene in the human genome, LEUTX
, and no known invertebrate members. We place LEUTX
in the PRD class for four reasons. First, there is weak phylogenetic support for this placement (Additional files 1
). Second, the LEUTX homeodomain possesses the same combination of residues that are invariably conserved across human PRD-class homeodomains (except for a leucine at position 20; Additional file 6
). Third, the LEUTX homeodomain shares the 46/47 intron position seen in many PRD-class homeodomains. Fourth, the LEUTX
gene is located close to the PRD-class genes TPRX1
on the distal end of the long arm of chromosome 19 (Figure ). This fourth observation leads us to hypothesize that this gene family arose by tandem duplication and extensive divergence during mammalian evolution.
Nobox gene family. This gene family falls close to the division between the ANTP and PRD classes in both maximum likelihood and neighbor-joining phylogenetic analyses (Additional files 1
). We favor placement within the PRD class because the NOBOX homeodomain has higher sequence identity with PRD-class homeodomains (up to 55%) than with ANTP-class homeodomains (up to 46%). Chromosomal position does not shed light on the issue, as its location at 7q35 is close to both ANTP- and PRD-class genes (Figure ).
Otx gene family. This very well known gene family was originally considered to contain human OTX1
(and their mouse orthologs) and the Drosophila otd
]. Later, it was shown that the CRX
gene is a member of the same gene family, deriving from the same ancestral gene. Thus, CRX
could be considered the true OTX3
]. Unfortunately, the OTX3
symbol was formerly used erroneously for a gene in a different family, now called DMBX1
, thus complicating its future use. The gene family name Otx is derived by majority rule from the constituent genes.
Pax2/5/8 gene family. This gene family is also known as Pax group II; it contains PAX2
, clearly derived from a single ancestral gene [45
]. These genes have partial homeoboxes.
Pax3/7 gene family. This gene family is also known as Pax group III; it contains PAX3
, clearly derived from a single ancestral gene [46
Pax4/6 gene family. This gene family is also known as Pax group IV; it contains PAX4
. There is confusion as to whether this should be split into two gene families, because invertebrate homologs generally group with PAX6
in phylogenetic analyses and not as an outgroup to the two genes as might be expected. We follow the generally accepted view and group PAX4
into a single gene family, proposing that PAX4
is a divergent member, not an ancient gene [40
Rhox gene family. The mouse Rhox cluster was first described as comprising twelve X-linked homeobox genes, all selectively expressed in reproductive tissues [47
]. Subsequent studies reported a total of 32 genes in the cluster, with the additional genes attributed to recent tandem duplications [48
]. The human genome contains three homeobox genes at Xq24 that are clearly members of the Rhox gene family based on sequence identity, molecular phylogenetics, intron positions and chromosomal location. These are RHOXF1
) and RHOXF2B
Most of the 50 genes in the PRD class have been adequately named previously. However, several genes were unnamed or misnamed prior to this study. We have updated these as follows.
[Entrez Gene ID: 8092] is the first of three human members of the Alx gene family. This gene was previously known as CART1
; we rename it ALX1
because it is related to ALX3
; all three genes were formed by duplication from a single ancestral invertebrate gene [52
[Entrez Gene ID: 117065] is the only member of the newly defined Drgx gene family in the human genome. This gene was previously known as PRRXL1
, and there is a clear mouse ortholog (Prrxl1
). The symbol PRRXL1
is misleading because it infers membership of the Prrx gene family, containing PRRX1
in the human genome. Several lines of evidence suggest it belongs to a different gene family. First, this gene (at 10q11.23) is not located in the same paralogon as PRRX1
(1q24.3) and PRRX2
(9q34.11) so they are not three paralogs generated during genome duplication in early vertebrate evolution. Second, it has a completely different exon-intron structure from the Prrx-family genes, and it does not contain a Prrx domain or an OAR domain (present in PRRX1
]). Third, the homeodomain is only 73% identical to PRRX1 and PRRX2 homeodomains, much lower than the 80-100% usually encountered for members of the same gene family in humans. Finally, we have identified the Drosophila
. The homeodomains of Drosophila
IP09201 and human DRGX form a highly supported monophyletic group in our maximum likelihood (90%; Additional file 1
) and neighbor-joining (97%; Additional file 2
) phylogenetic analyses. The new symbol DRGX
(dorsal root ganglia homeobox
) incorporates the root of the former symbol DRG11
, referring to expression of the rodent ortholog in dorsal root ganglia neurons [54
[Entrez Gene ID: 100033411] is a human member of the Dux (double homeobox) gene family. As previously discussed, most members of this gene family are intronless and are probably derived by retrotransposition of an mRNA transcript from a functional intron-containing Dux gene (or duplication of such an integrant). Booth and Holland [21
] described the DUXA
gene containing five introns (including one within each homeobox), and noted the existence of a second intron-containing human Dux-family gene provisionally designated DUXB
. The DUXB
nomenclature is endorsed here. No cDNA or EST sequences have been reported for DUXB
[Entrez Gene ID: 2928] is the second of two human members of the Gsc gene family. This gene was previously known as GSCL
; we rename it GSC2
to remove the inadvertent implication that it is not a true gene, and also to reflect the clear orthology to chick Gsc2
as inferred by phylogenetic analysis and synteny.
[Entrez Gene ID: 84525] is the only member of the newly defined Hopx gene family in the human genome. The mouse version of the gene was first identified first and named Hop
(homeodomain only protein
) because the encoded protein is just 73 amino acids long, with 61 of these making up the homeodomain [41
]. The HOP
gene symbol is not ideal as it is also used for unrelated genes, including hopscotch
in mouse. Therefore, we revise the gene symbol from HOP
) in accordance with homeobox gene nomenclature convention.
[Entrez Gene ID: 342900] is the only member of the newly defined Leutx gene family in the human genome. We designate this previously unnamed gene LEUTX
(leucine twenty homeobox) to reflect the presence of a leucine residue at the otherwise highly conserved homeodomain position 20; other PRD-class homeodomains have a phenylalanine at this position (Additional file 6
). Studies of mutations in other homeobox genes suggest that mutation to leucine alters transcriptional activity of a homeodomain protein [55
[Entrez Gene ID: 84839] is the second of two human members of the Rax gene family. This gene was previously known as RAXL1
; we rename it RAX2
to standardize nomenclature.
[Entrez Gene ID: 158800] and RHOXF2
[Entrez Gene ID: 84528] are two of three human members of the Rhox gene family. These genes were previously known as OTEX/PEPP1
respectively. The prefix PEPP
is not suitable as it is used for numerous aminopeptidase P-encoding genes. Thus, we replace the gene symbols OTEX/PEPP1
respectively, to reflect their orthologous relationship with the mouse Rhox cluster (containing 32 genes, see above) whilst avoiding inadvertent equivalence to specific genes within the cluster.
[Entrez Gene ID: 727940] is the third human member of the Rhox gene family. This locus was referred to in previous studies as PEPP2b
] and PEPP3
]. The prefix PEPP
cannot be approved for reasons noted above. RHOXF2B
is located very close to RHOXF1
at Xq24 and is clearly a very recent duplicate of RHOXF2
. The genomic sequences at these two loci share 99% identity over exonic, intronic and approximately 20 kb flanking regions. Over the coding region, there are just two nucleotide substitutions (both nonsynonymous); one of these results in an unusual change within the homeodomain (arginine to cysteine at position 18). We currently list RHOXF2B
as a functional gene, although it is possible that it is a duplicated pseudogene.
[Entrez Gene ID: 645832] is the only member of the Sebox gene family in the human genome. The human gene is the ortholog of mouse Sebox
based on their locations in syntenic chromosomal regions (17q11.2 and 11B5 respectively) and presence of the same intron positions. However, sequence identity is lower than normal for orthologous genes in mouse and human (78% amino acid identity over the homeodomain) and there is evidence that the human gene has undergone divergence. Most surprisingly, the human sequence has two unusual substitutions in the homeodomain [57
]. At homeodomain position 51, the human sequence codes for lysine whereas mouse has asparagine; an earlier analysis of 346 homeodomain sequences found asparagine to be invariant at this position [1
]. Similarly, at homeodomain position 53, human has tryptophan whereas mouse has arginine; this position is almost invariably arginine [1
]. These sequence changes in the important third helix raise the possibility that human SEBOX
could have accumulated mutations as a non-functional pseudogene. Until this is shown more clearly we consider it to be a functional, but divergent, gene. This gene was previously known as OG9X
as the alternative symbol; we favor SEBOX
because the OG
prefix was originally used for several unrelated homeobox genes.
[Entrez Gene ID: 340260] is the only member of the Uncx gene family in the human genome. This gene was previously known as UNCX4.1
; we remove the numerals to give UNCX
as these do not denote a series within a gene family.
[Entrez Gene ID: 338917] is the second of two human members of the Vsx gene family. This gene was previously known as CHX10
; we rename it VSX2
to better reflect its paralogous relationship to VSX1
has been used as an alias for this gene in other vertebrate species and the gene symbol CHX10
has the disadvantage of implicitly suggesting presence of at least nine paralogs in human (CHX1
), which do not exist.
Unlike the situation with the ANTP class, many of the pseudogenes within the PRD class have been well characterized. A previous study has described and named two pseudogenes in the Argfx gene family, seven pseudogenes in the Dprx gene family, four pseudogenes in the Tprx gene family, and 10 pseudogenes derived from the DUXA
]. There is also a possibility that the SEBOX
loci are non-functional pseudogenes, as described above. We have identified a previously undescribed pseudogene from the Otx gene family (OTX2P1
), and argue that the majority of Dux-family sequences are pseudogenes.
[Entrez Gene ID: 100033409]. We designate this previously undescribed sequence OTX2P1
because it is clearly a retrotransposed pseudogene of OTX2
. The genomic DNA sequence of OTX2P1
shares significant homology with OTX2
transcript variant 2 [RefSeq: NM_172337]. There is an Alu element (AluSx subfamily) insertion, a Made1 (Mariner derived element 1) insertion, and a 1182-nucleotide deletion in OTX2P1
compared to OTX2
. The OTX2P1
sequence lacks introns, ends with a poly(A) tail, and harbors critical sequence alterations (including a three-nucleotide insertion introducing a stop codon into the deduced homeodomain).
] and DUX5
]. These sequences have been cloned in previous studies [12
]. We detected no matches with 100% identity to DUX1
in build 35.1 of the human genome sequence, which covers the euchromatic regions of each chromosome. This concurs with previous studies indicating that DUX1
are found in heterochromatin on human acrocentric chromosomes; each is apparently present in multiple copies within members of the 3.3 kb family of tandemly repeated DNA elements [12
]. Because the majority of human heterochromatin has not been sequenced, and may be variable between individuals, the exact number of copies of DUX1
is unknown. It is also debatable whether these loci encode functional proteins. These sequences lack introns and, as discussed above, are most likely derived from intron-containing genes in the Dux family, such as DUXA
]. This sequence has been extensively studied as some of its multiple copies exist within the 3.3 kb repetitive elements of the D4Z4 locus at 4q35 [14
]. The polymorphic D4Z4 locus is linked to facioscapulohumeral muscular dystrophy (FSHD); between 12 and 96 tandem copies of 3.3 kb elements are present in unaffected individuals and deletions leaving a maximum of eight such elements have been associated with FSHD [58
]. In build 35.1 of the human genome sequence, we identified 35 loci at 10 chromosomal locations containing a total of 58 DUX4
(and highly similar) homeobox sequences. This should not be taken as a precise figure due to copy number polymorphism and the possibility of additional copies existing in currently unsequenced heterochromatic regions. Some of the copies are 100% identical to the previously reported DUX4
sequence over the homeobox regions, others have single nucleotide polymorphisms, some have critical sequence mutations, and others have just a single homeobox. Most of the copies are located in tandemly repeated arrays (for example, on chromosomes 4, 10 and 16) and others are alone in the genome (for example, a single copy resides at 3p12.3). The majority of DUX4
copies are unlikely to encode functional proteins as suggested by their intronless, mutated and tandemly repeated nature. The lack of introns indicates they are most likely derived from intron-containing genes in the Dux family, such as DUXA
The POU homeobox class
The POU class generally encodes proteins with a POU-specific domain (named from the mammalian genes Pit1
), andnematode unc-86
) N-terminal to a typical homeodomain. The POU-specific domain is a DNA-binding domain of approximately 75 amino acids; the POU-specific domain and the homeodomain are collectively known as the bipartite POU domain [61
We have identified a total of 16 POU-class homeobox genes in the human genome (Tables and ). The genes form a distinct grouping even if the POU-specific domain is disregarded – phylogenetic analyses of homeodomains recover the POU class as a monophyletic group (Figure ; Additional files 1
). There are six widely recognized gene families within the POU class (Pou1 to Pou6), and nomenclature revisions approximately 10 years ago clarified which genes belong to which gene family [62
]. We have placed two additional genes (HDX
) in the POU class on the basis of their deduced homeodomain sequences, even though one of these genes (HDX
) does not encode a POU-specific domain. We have erected a new gene family for this gene, bringing the total number of gene families in the POU class to seven. We have also identified a total of eight POU-class pseudogenes in the human genome (Tables and ); we have named six of these (POU5F1P2
), and revised the nomenclature of one other (POU5F1P3
[Entrez Gene ID: 139324]. This gene was previously known as CXorf43
. The gene encodes a highly divergent atypical (68-amino-acid) homeodomain but not a POU-specific domain, and thus it is debatable whether it should be placed within the POU class. Phylogenetic analyses of homeodomains place it basally in a clade with the POU class (Figure ; Additional files 1
), or within the POU class (Additional file 2
), suggesting that the HDX protein either diverged before the POU-specific domain became associated with the homeodomain or lost the POU-specific domain during evolution. Further information on this gene may allow this tentative classification to be revisited.
[Entrez Gene ID: 134187]. We designate this previously unnamed gene POU5F2
on the basis of clear orthology to the mouse Sprm1
gene, which has been assigned the second member of the Pou5 gene family [63
]. The symbol POU5F2
ensures the gene conforms with standardized nomenclature for the POU class.
[GeneID: 100009665], POU5F1P3
) [GeneID: 5461], POU5F1P4
[GeneID: 100009666], POU5F1P5
[GeneID: 100009667], POU5F1P6
[GeneID: 100009668], POU5F1P7
[GeneID: 100009669] and POU5F1P8
[GeneID: 100009670]. Prior to this study, a single retrotransposed pseudogene of the POU5F1
gene had been annotated and designated POU5F1P1
[Entrez Gene ID: 5462]. Another POU5F1
-related sequence of unknown status had been annotated and designated POUF5F1L
[GeneID: 5461]. We replace the gene symbol POUF5F1L
as this sequence is a retrotransposed pseudogene of POUF51
. Our analyses of the human genome sequence identified a further six pseudogenes of POU5F1
, which we name sequentially POU5F1P2
through to POU5F1P8
. Each clearly aligns to the mRNA sequence of POU5F1
but with sequence alterations, indicating origin by retrotransposition. POU5F1P2
have frameshift mutations in the homeobox. POU5F1P5
have stop codons in the homeobox. POU5F1P7
are partial integrants of POU5F1
mRNA excluding the homeobox – POU5F1P7
covers part of the 3' untranslated region and POU5F1P8
a short region around the start codon.
The TALE homeobox class
TALE (three amino acid loop extension) class genes are distinguished by the presence of three extra amino acids between the first and second alpha helices of the encoded homeodomain [1
]. Genes belonging to the TALE class encode proteins with various domains outside of the atypical homeodomain.
We have identified a total of 20 TALE-class homeobox genes in the human genome (Tables and ). The genes form a distinct grouping in phylogenetic analyses even when the three extra homeodomain residues are excluded from the sequence alignment (Figure ; Additional file 5
). Bürglin [2
] has given the TALE group the rank of 'superclass' and distinguished between several 'classes' by the presence of distinct domains outside of the homeodomain. These are the IRX domain, MKX domains, the MEIS domain, the PBC domain and TGIF domains [2
]. Along with some others [4
], we have given the TALE group the rank of 'class' containing several 'gene families'; this maintains consistent terminology throughout the present paper. Phylogenetic analyses of homeodomains divide the TALE class into six gene families (Figure ; Additional files 1
), including an Mkx family containing the recently described MKX
gene, which is distinguished from Irx-family genes phylogenetically and by absence of an IRX domain [73
]. It should be noted that the established name of the Pknox gene family does not indicate orthology with Knox-family genes of plants. We have also identified a total of 10 TALE-class pseudogenes in the human genome (Tables and ); we have named six of these (IRX4P1
), and revised the nomenclature of two others (IRX1P1
[Entrez Gene ID: 646390]. This sequence was previously known as IRXA1
; we rename it IRX1P1
because it is clearly a retrotransposed pseudogene of IRX1
and not a functional gene. The IRX1P1
sequence aligns to the mRNA of IRX1
but has a frameshift mutation and two stop codons in the homeobox.
[Entrez Gene ID: 100009671]. We designate this previously unannotated sequence IRX4P1
because it is clearly a retrotransposed pseudogene of IRX4
. The IRX4P1
sequence is a partial integrant derived from a region of the IRX4
mRNA around the stop codon; it lacks the homeobox.
[Entrez Gene ID: 5088]. This sequence was previously known as PBXP1
; we rename it PBX2P1
because it is clearly a retrotransposed pseudogene of PBX2
. The former name of PBXP1
did not indicate its transcript of origin. The PBX2P1
sequence aligns to the mRNA of PBX2
but has a frameshift mutation in the coding region.
[Entrez Gene ID: 126052]. We designate this previously unannotated sequence TGIF1P1
because it is clearly a retrotransposed pseudogene of TGIF1
. The locus has many sequence alterations when compared to TGIF1
mRNA, including a 48 nucleotide insertion within the homeobox.
[GeneID: 126826], TGIF2P2
[GeneID: 100009674], TGIF2P3
[GeneID: 100009672] and TGIF2P4
[GeneID: 100009673]. These four sequences were unannotated prior to this study. We designate them TGIF2P1
because they are clearly pseudogenes of TGIF2
. Each aligns to the mRNA sequence of TGIF2
but with sequence alterations, indicating origin by retrotransposition. TGIF2P1
has many sequence alterations, including a frameshift mutation in the homeobox. TGIF2P2
are very similar neighboring loci that must have originated by tandem duplication of a retrotransposed TGIF2
mRNA; neither includes the homeobox. TGIF2P4
is a short partial integrant derived from part of the 3' untranslated region of TGIF2
Chromosomal distribution of human homeobox genes
The chromosomal locations of genes can give clues to evolutionary ancestry, including patterns of gene duplication, and the possible existence of gene clusters. In Figure , we show the chromosomal locations of all human homeobox genes. We do not include probable pseudogenes on these ideograms, because most of these have originated by reverse transcription of mRNA and secondary integration into the genome, and hence give no insight into ancestral locations of genes. The highly repetitive DUX1 to DUX5 sequences are also not shown, as these have undergone secondary amplification and are also most likely non-functional (see above).
The first observation is that there are homeobox genes on every human chromosome. Even the two sex chromosomes harbor homeobox genes, with SHOX
(short stature homeobox
) in the PAR1 pseudoautosomal region at the tip of the short arms of X and Y being the best known. Haploinsufficiency of SHOX
is implicated in the short stature phenotype of Turner syndrome patients who lack one copy of the X chromosome [83
]. There are also nine other homeobox genes in non-pseudoautosomal regions of the X chromosome, including three tandemly-arranged members of the Rhox gene family, collectively homologous to the multiple Rhox (reproductive homeobox) genes of mouse. Only one of the homeobox genes on the X chromosome, the TALE-class gene TGIF2LX
, has a distinct homolog on the Y chromosome, called TGIF2LY
. These genes map to the largest homology block shared by the unique regions of the X and Y chromosomes, spanning 3.5 Mb. It has been proposed that the ancestor of these two genes arose by retrotransposition of TGIF2
The autosomes with the lowest number of homeobox genes are chromosomes 21 (with just PKNOX1
) and 22 (with GSC2
). Examination of the remaining autosomes reveals that homeobox genes are quite dispersed with some interesting regional accumulations. The best known examples of close linkage between homeobox genes are the four Hox clusters on human chromosomes 2, 7, 12 and 17, comprising 9, 11, 9 and 10 genes respectively; each of these is shown as just a single line on each ideogram for simplicity (Figure ). These should not be considered in isolation, however, because many other ANTP-class genes map in the vicinity of the Hox clusters [26
]. These include genes very tightly linked to the Hox clusters, notably the Evx-family genes (on chromosomes 2 and 7), Dlx-family genes (on chromosomes 2 and 17), and Meox-family genes (on chromosomes 2 and 17).
There are other concentrations of ANTP-class genes away from the Hox clusters. These are the ParaHox cluster (GSX1
) on chromosome 13, and four sets of NKL-subclass genes on 2p/8p (split), 4p, 5q and 10q, hypothesized to be derived from an ancestral array by duplication [26
]. The accumulation on the distal half of the long arm of chromosome 10 is particularly striking, comprising eleven ANTP-class genes from 10 gene families. This is not a tight gene cluster, but it is compatible with ancestry by extensive tandem gene duplication followed by dispersal. Discounting the rather aberrant case of the Hox clusters, this region of the long arm of chromosome 10 is the most homeobox-rich region of the human genome.
There are additional groupings of homeobox genes outside the ANTP class. These include two TALE-class Irx clusters on chromosomes 5 and 16 homologous to the described mouse Irx clusters [19
], and a set of PRD-class genes on chromosome 19 proposed to be derived from the CRX
homeobox gene by duplication and rapid divergence [21
]. Perhaps the most interesting case, however, is found on the tip of the long arm of chromosome 9, where there is a concentration of homeobox genes from disparate gene classes. Four LIM-class genes, one ANTP-class gene, one PRD-class gene and one TALE-class gene are found in this location. Although dispersed over a large region, and not forming a tight gene cluster, the linkages are nonetheless intriguing. It is possible that these linkages reflect ancestry from the very ancient gene duplications that must have generated the distinctive homeobox gene classes found within animal genomes.