|Home | About | Journals | Submit | Contact Us | Français|
The Homeodomain Resource is a curated collection of sequence, structure, interaction, genomic and functional information on the homeodomain family. The current version builds upon previous versions by the addition of new, complete sets of homeodomain sequences from fully sequenced genomes, the expansion of existing curated homeodomain information and the improvement of data accessibility through better search tools and more complete data integration. This release contains 1534 full-length homeodomain-containing sequences, 93 experimentally derived homeodomain structures, 101 homeodomain protein–protein interactions, 107 homeodomain DNA-binding sites and 206 homeodomain proteins implicated in human genetic disorders.
Database URL: The Homeodomain Resource is freely available and can be accessed at http://research.nhgri.nih.gov/homeodomain/
Homeodomain-containing proteins are transcription factors that play a critical role in various cellular processes, including body plan specification, pattern formation and cell fate determination during metazoan development 1. Members of this family are characterized by a helix-turn-helix DNA binding motif known as the homeodomain. X-ray crystallographic and NMR spectroscopic studies on several homeodomain-containing proteins (2–6) show that this motif is comprised of three α-helices that are folded into a compact globular structure with an N-terminal extension. Helices I and II lie parallel to each other and across from the third helix. This third helix is also referred to as the ‘recognition helix’, as it confers DNA-binding specificity on individual homeodomain proteins. Homeodomain-containing proteins may interact with each other to enhance or mediate transcriptional activity, either by the binding of multiple proteins to the same segment of DNA or through the formation of DNA-independent complexes. Nucleotide- and protein-level mutations associated with homeodomain proteins can lead to a number of congenital abnormalities [c.f. (7,8)]. The homeodomain structural motif is highly conserved across eukaryotic species, and the expansion and diversification of this family of proteins in various lineages has been shown to coincide with the advent of major morphological innovations (9–12).
In recent years, studies utilizing high-throughput techniques have generated an extraordinary amount of information about these homeodomain proteins, but this information is not always easily accessible to the working biologist. For instance, recent large-scale genome sequencing efforts have led to the availability of complete collections of homeodomain proteins from an evolutionarily diverse set of species, but retrieving complete sets of homeodomain sequences from a particular species is not trivial. Likewise, while several large-scale projects aimed at computationally predicting protein–protein interactions through text mining and other similar approaches have been largely successful in terms of identifying potential relationships between proteins, identifying interactions specific to homeodomains remains an arduous task. In addition, the determination of 3D structures, identification of protein binding sites and our knowledge regarding the role of specific homeodomain proteins in disease causation has been steady, so keeping abreast of these discoveries remains challenging.
The Homeodomain Resource uses a combination of automated and manually verified extraction methods to yield a comprehensive collection of sequence, structure, interaction, genomic and functional information on the homeodomain family (13,14). In addition to a complete collection of homeodomains for 24 species (Table 1), the Homeodomain Resource contains information on DNA-binding targets, protein–protein interactions, 3D structures and homeodomains implicated in human disorders. Each annotation is manually curated, mapped to a specific protein and organism and fully cross-referenced to various external databases, including its primary citation in PubMed. Data are presented in an intuitive, user-friendly format and is keyword-searchable across all tables. Each reference in this database is rigorously selected to assure non-redundancy, and updates are performed on a continuous basis.
Examples of how data from the Homeodomain Resource have been used in various biological contexts to date include studies on the prediction of specific DNA-binding sites for homeodomain proteins (15), the analysis of non-conserved co-evolving positions within functional sites in a variety of protein families (16) and the interpretation of phage display selection experiments aimed at identifying elements within the engrailed homeodomain responsible for sequence-specific DNA binding (17). These data have also been used to help interpret features found within the structures of the stem cell transcription factor Nanog (18) and the Drosophila Bicoid–DNA complex (19). Finally, information from the Homeodomain Resource has been used as a reference to aid in understanding mutation data from patients with disorders such as idiopathic short stature and Leri-Weill dyschondrosteosis (20) and brachydactyly types D and E (21).
The Homeodomain Resource has expanded significantly since its last release [Tables 1 and and2;2; (13)], and substantial enhancements have been made to the user interface to allow for easier navigation and overall usability. Unlike previous versions of the database, the current version connects all annotations in a relational framework (Figure 1), providing an integrated view of all the analyses associated with a particular homeodomain protein. This new system allows for a more powerful query engine that enables a user to query across multiple annotations in a single search (Figure 2). Homeodomain Resource accession numbers are assigned to each entry in the database to facilitate data sharing amongst the user community. These accession numbers take the format HDRxn, where x indicates the data category for the entry (e.g. s=structures) and n is a three-digit number identifying the entry. In addition, the database is more genome-centric, with an eye towards evolutionary studies. Whereas previous versions relied heavily on choosing proteins that had annotations from Swiss-Prot associated with them, this new edition places more emphasis on compiling complete sets of homeodomains from a diverse range of species. The combination of additional sequences, more comprehensive datasets, and greater data connectivity provides a much more powerful and robust resource to biologists.
The sequence dataset in the Homeodomain Resource was assembled by first utilizing data from a series of homeodomain surveys of metazoan genomes (22–24). Next, a hidden Markov model (HMM) was generated from these aligned sequences using the HMMer Toolkit (25), and the HMM was subsequently used to search RefSeq (26) to identify additional members of the homeodomain family. Alignments produced by HMMsearch (25) were parsed using Perl scripts; this was followed by manual alignment to the HMMsearch alignment using GeneDoc (27). Inspection (and manual adjustment) of the alignments become necessary if HMMsearch introduces gaps in biologically implausible locations within the sequence. One such example involves the sequence of HDRp1895, which is truncated at its N-terminus. HMMsearch introduced a gap of length 7 between the next-to-last (R52) and the last (Q53) residue in the sequence; in this case, the gap was removed, placing R52 directly next to Q53, thereby producing a better-quality alignment. These alignments are then added, along with annotations from Entrez Gene (28), to the Homeodomain Resource. The International Protein Index (29) was used to match Entrez Gene identifiers with entries from other external resources, such as the Mouse Genome Database (MGD; 30) and the Zebrafish Information Network (ZFIN; 31), where possible. As of December 2008, 24 fully sequenced genomes have been sampled: 8 metazoan (4 vertebrate and 4 invertebrate), 11 fungi, 4 protozoan and 1 plant (Table 1). This process yielded 1534 protein entries. Individual protein entries are hyperlinked to a detailed view that presents gene- and protein-level annotation, full-lineage taxonomy and both the full-length and homeodomain-only sequences. Annotations that refer to external resources are hyperlinked to their source database (e.g. Entrez Gene).
The complete set of homeodomain proteins can be downloaded in FASTA format as either full-length sequences or homeodomain alignments. Alternatively, a customized dataset can be built either by selecting sequences resulting from a query or by manually selecting sequences from the entire dataset. Query results can be sorted to facilitate the construction of custom datasets. The ability to retrieve a complete set of aligned homeodomains from a range of species makes the Homeodomain Resource an invaluable first step in a phylogenetic analysis. For example, a researcher wanting to know the phylogenetic affinity of a previously undescribed homeodomain from a fungus could download an aligned dataset of homeodomains from several fungal species, align the undescribed homeodomain to this dataset and then run one or more phylogenetic algorithms on this alignment. Users interested in an evolution-based classification of homeodomain-containing proteins are also encouraged to explore HomeoDB (32), a complementary database focusing on homeobox gene phylogenetic classification.
The homeodomain structures are manually compiled from the NCBI Entrez Structure database (33) and the Protein Data Bank (PDB; 34). Each structural entry is manually inspected to ensure that the solved structure contains the homeodomain region of the protein. Also noted is the experimental technique used to determine its structure (either X-ray diffraction or NMR spectroscopy). Information on solved 3D structures of both homeodomain proteins and protein–DNA complexes is available in a concise, columnar format. Protein name, PDB and MMDB accessions and the source organism are given for each entry, and the table can be sorted, as needed. For each entry, a link is also provided to a detailed view of that structure record, providing additional information such as experimental technique, PDB title and its primary PubMed reference. From this detailed view, users can follow links to the source Entrez Structure and PDB records, where one can view still images of the structure and download the 3D coordinates of a structure of interest. The detailed view also provides a link to the protein annotation within the Homeodomain Resource itself, as well as to the PubMed abstract corresponding to the primary literature citation listed in PDB.
The Homeodomain Resource contains a systematically and thoroughly curated catalogue of experimentally determined protein–protein interaction data for the homeodomain protein family. To the best of our knowledge, this collection represents the most comprehensive collection of protein–protein interaction annotations specific to the homeodomain family. Interaction data were collected through manual literature searches; essential information about the nature of the specific protein–protein interactions was then extracted from the experimental data presented in these manuscripts. The identification of articles containing relevant biological information from PubMed required the use of discriminatory MeSH terms, from specific to more general keyword search combinations. PubMed titles, abstracts and full text were searched for keywords that would be indicative of relevant protein–protein interactions (e.g. ‘DNA-independent interaction’). Interacting proteins were annotated and cross-linked to their corresponding protein entry within the Homeodomain Resource.
Protein–protein interaction data can be searched by publication information, interaction description and keyword data associated with their corresponding protein entries. Interaction data are returned in columnar format, listing the interacting proteins, the primary citation from the literature, the corresponding Biomolecular Interaction Network Database (BIND; 35) identifier and a link to a detailed view of the interaction. The detailed view provides additional information describing the homeodomain protein interacting regions, interacting residue locations and a description of the mechanism of interaction derived from the primary publication, as well as internal links to details on each of the interacting proteins within the HDR.
A new feature of this release is the cross-referencing of homeodomain protein–protein interaction data to their respective BIND interaction entries. BIND was queried for previously unreported homeodomain protein–protein interactions in parallel with the aforementioned PubMed literature searches, using general (e.g. ‘homeobox OR homeodomain AND interaction’) to more specific (e.g. ‘homeobox OR homeodomain AND interaction_object_type=protein AND NOT=DNA’) search criteria. Following a manual extraction of false positives, interactions from BIND were extracted and deposited in the Homeodomain Resource. All protein–protein interaction data derived from manual curation of PubMed have also been deposited into the BIND database. Each interaction derived from the Homeodomain Resource has been assigned a unique BIND accession number and is hyperlinked from BIND back to the Homeodomain Resource (Figure 3).
DNA binding sites for homeodomain proteins have been obtained through extensive review of the published literature, citations in Online Mendelian Inheritance in Man (OMIM; 36,37) and entries for DNA-bound homeodomain structures from PDB. As with the interaction data described above, binding site data can be searched by publication information and by keyword data associated with its corresponding protein entry. Binding site data are returned in columnar format; the columns include homeodomain names, their respective DNA-binding sequences and references to the primary citation from which the information was retrieved. The core regions of each of the DNA binding sites are shown in bold type. A detailed view of the binding site record displays the consensus DNA sequence, the corresponding PubMed reference and a link to details about the protein; the protein details includes the Protein HDR identifier, the common name of the protein, the gene symbol listed in Entrez Gene and the UniProt protein accession.
Information on human genetic and genomic disorders linked to homeodomain proteins has been compiled from manual searches of both OMIM and the Human Gene Mutation Database (HGMD; 38). Any false positives resulting from the OMIM and HGMD searches were manually removed from the dataset. Manually derived entries from the previous Homeodomain Resource release were automatically compared and updated, while new automated entries were manually verified.
Each entry in the Disorders and Mutations dataset represents a single homeobox gene associated with one or more disease(s) or disorder(s). For each, the corresponding OMIM nucleotide- (e.g. 1-BP DEL, 504T) and/or protein-level (e.g. GLN140TER) mutations are shown. This dataset can be queried using any of the aforementioned fields, and the results can be sorted by clicking on the appropriate column field heading. Gene symbols are hyperlinked to the corresponding entry in the proteins table as well as to entries in HGMD (registration required).
In addition to an overhaul of the interface, a number of back-end technical modifications have been made to improve data collection, storage and automation. A number of new Perl scripts have been developed for this release which facilitate the automation and updating of external annotation sources linked to the database, thereby eliminating a number of manual steps previously required for these processes. For example, a new set of Perl scripts uses a list of existing gene symbols obtained from Swiss-Prot to automatically search Entrez Gene, pairing protein-centric annotation of existing homeodomain entries with their gene-centric equivalent. A second set of Perl scripts parses Entrez data via E-utilities, mapping a homeodomain entry to its corresponding Disease and Disorders entry at OMIM. Each of the new entries is examined manually and either added to the database or designated as false positive. The search and update functions are executed quarterly to update the disorders and mutations annotation. Another Perl script was developed to parse the output of HMMsearch, retrieve sequence and annotation information from Entrez, and insert unique hits into the Homeodomain Resource. This approach results in a relatively simple pipeline for adding new sequence entries, thereby keeping this database current.
With these new tools in hand for importing complete sets of homeodomain sequences from fully sequenced genomes, we intend to continue to add sequence data from already-sequenced species. We also intend to include additional homeodomain sequence data from newly sequenced genomes, fully anticipating a new wave of such data becoming available with the advent of new, next-generation sequencing technologies.
It is becoming increasingly evident that homeodomain transcription factors have played and continue to play key roles in the evolution of eukaryotic species. Likewise, research in this area continually shows that disruptions in the wild-type function of this class of proteins underlie a significant number of devastating human disorders, as evidenced by the extensive list of genetic and genomic disorders catalogued in the Disorders and Mutations section of the Homeodomain Resource Web site. As a result, the amount of homeodomain-related data being generated—and the need for biologists to be able to process and consider these data—will be critical to the advancement of our understanding of these proteins. It is our intention to continue to maintain and update the Homeodomain Resource in the future, so as to provide a solid discovery framework for biologists and clinicians studying this important class of proteins.
This research was supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health.
Conflict of interest. None declared.