|Home | About | Journals | Submit | Contact Us | Français|
Chromatin modification (CM) is a set of epigenetic processes that govern many aspects of DNA replication, transcription and repair. CM is carried out by groups of physically interacting proteins, and their disruption has been linked to a number of complex human diseases. CM remains largely unexplored, however, especially in higher eukaryotes such as human. Here we present the DAnCER resource, which integrates information on genes with CM function from five model organisms, including human. Currently integrated are gene functional annotations, Pfam domain architecture, protein interaction networks and associated human diseases. Additional supporting evidence includes orthology relationships across organisms, membership in protein complexes, and information on protein 3D structure. These data are available for 962 experimentally confirmed and manually curated CM genes and for over 5000 genes with predicted CM function on the basis of orthology and domain composition. DAnCER allows visual explorations of the integrated data and flexible query capabilities using a variety of data filters. In particular, disease information and functional annotations are mapped onto the protein interaction networks, enabling the user to formulate new hypotheses on the function and disease associations of a given gene based on those of its interaction partners. DAnCER is freely available at http://wodaklab.org/dancer/.
Epigenetics plays a key role in DNA replication, transcription and repair (1,2), and its disruption is implicated in the development of many forms of cancer and other complex human diseases (3,4). As a result, there are now a growing number of projects dedicated to the study of chromatin modification—a crucial component of epigenetic processes (5). Chromatin modification (CM) is defined as the alteration of DNA or protein in chromatin, which may result in changing the chromatin structure (6). It encompasses chromatin remodeling (eviction, deposition or sliding of nucleosomes along DNA), histone exchange (substitution of core histones with histone variants) and covalent modification of histones (acetylation, methylation, ubiquitylation, phosphorylation, etc.).
Similarly to other cellular processes, CM is carried out by groups of physically interacting proteins (7,8). Anomalies in protein interactions often lead to disease phenotypes (9). Yet there remains a dearth of public databases and analysis tools that explore the relationship between the chromatin machinery and human disease, especially in the context of protein-interaction networks.
ChromDB (10) is perhaps the best known and comprehensive chromatin database, but no direct links are provided to human disease annotations or to data on protein interactions. ChromatinDB (11) contains only data on CM genes from the yeast Saccharomyces cerevisiae and is therefore ill-suited for analyzing links of CM proteins to disease in human. The recent Human Histone Modification Database (12) provides detailed information on specific types of chromatin modifications and their relationship to cancer. Data on interaction partners, or links to diseases other than cancer are not available. The Network of Cancer Genes resource (NCG) (13) maps cancer-related phenotypes onto the human protein-interaction network, but focuses entirely on cancer and is not specific to CM and related epigenetic processes. Other related resources focus either on DNA methylation rather than chromatin machinery [MethyCancer (14)], or on specific diseases [liver cancer in OncoDB.HCC (15)], or on disease-related interactions of proteins with chemicals in the environment rather than on protein networks [Comparative Toxicogenomics Database (16)].
Thus, most of the existing resources devoted to CM focus mainly on detailed information about individual genes and proteins, and less on their interaction partners in the cell or their associated disease phenotypes. To fill this gap, we developed DAnCER (disease-annotated chromatin epigenetics resource), publically available at: http://wodaklab.org/dancer.
Molecular interactions between genes and proteins are underpinning all biological processes, and in particular those of CM. Our research effort therefore strives to explore CM-related genes in the context of their protein-interaction network, their partnership in multi-protein complexes and cellular pathways, as well as their gene expression profiles. To gain additional insights into the CM process in human cells, we also explore patterns of evolutionary conservation across model organisms of properties such as the amino acid sequence, domain composition and 3D structure, to interaction patterns and regulatory mechanism.
DAnCER collates records of CM-related genes from human and four model organisms. Genes are represented in DAnCER using the NCBI Entrez Gene identifiers (17). Individual gene pages also contain links to matching records in model organism databases (18–22).
The collection of genes stored in DAnCER is based on a core set of genes whose CM function is deemed ‘confirmed’. This set was derived using in-house manual curation from the existing literature on CM (23–25), from the analysis of protein complexes and their associated function (26,27), and from external genomic databases that provided experimental annotations (6,28). The ‘confirmed’ CM genes include those whose products are histones, various categories of histone-modifying enzymes, and members of protein complexes known to regulate CM processes.
This collection has been expanded with additional genes whose CM-related function is predicted using computational methods. Two main methods have been used so far to predict CM-related function. One is an evolutionary analysis of the CM machinery (29), which uses an in-house version of the InParanoid algorithm (30) to uncover protein homology relationships in CM genes across more than a 100 different organisms. The second method relies on Pfam domain composition and domain co-occurrence in CM genes in human and yeast. Having observed that CM domains tend to co-occur with a limited number of partner domains in CM genes, we exploit this property to establish patterns of domain composition characteristic of CM function. We then train a Support Vector Machine predictive model (31) using the domain annotations of confirmed CM genes and use it to predict putative CM function in additional human genes, which were not earlier known to be CM-related (Pu S. et al., manuscript in revision).
For each confirmed or putative CM gene, DAnCER displays several types of supporting evidence and additional annotations collected from both in-house and external resources. Disease annotations for human genes are obtained from the well-curated online Mendelian inheritance in man (OMIM) resource (32). Functional annotations from the Gene Ontology (GO) (6) are retrieved on-the-fly for each gene, which ensures the display of the most current known annotations; only experimental evidence codes are used for retrieval of the GO data. Other collected data include information on proteins from UniProt (33), homologous gene clusters from InParanoid (30,34), domain composition from Pfam (35), and membership in protein complexes (27,28). Supporting evidence includes PubMed identifiers, a range of metrics associated with the reliability of the predicted information, and web links to external resources. DAnCER search allows filtering the data using a range of these attributes.
DAnCER allows users to visually explore the full interaction neighborhoods of individual CM genes, including protein complexes. The integrated visual display represents genes, diseases, physical protein–protein interactions, protein complexes, genetic interactions, domains and other information present in the interaction neighborhood. The displays are generated using Cytoscape Web (36). A customized Cytoscape plugin OrthoNets with additional visualization features has also being developed (http://wodaklab.org/orthonets/).
The interaction data are consolidated from 10 major public databases: BIND (37), BioGRID (38), CORUM (28), DIP (39), HPRD (40), IntAct (41), MINT (42), MPact (43), MPPI (44), OPHID (45). This collection contains 404384 interactions from over a thousand different organisms, of which 263479 are physical protein–protein interactions curated from literature. The consolidation was performed using the iRefIndex process (46). All interactions, along with their supporting evidence, are available on the iRefWeb resource (http://wodaklab.org/iRefWeb), and are seamlessly merged with the data in DAnCER.
Knowledge of a protein's 3D structure is essential for the examination of the spatial effect of specific mutations on protein interactions, especially if such mutations lead to phenotypic abnormalities. Therefore, we scan the entire protein data bank (PDB) (47) using BLASTp sequence alignment algorithm (48), and retrieve all known 3D protein structures that match any of our CM proteins within a certain threshold of sequence similarity. To ensure broad coverage, we retain two types of PDB matches: longer but possibly imprecise alignments, such as those between homologs (BLAST e-value ≤1e–15 and sequence identity at least 50%); and relatively short but near-exact alignments, targeting 3D structures of isolated domains and short motifs (e-value ≤1e–4 but sequence identity at least 90%). For each gene, the retrieved 3D structures are grouped into families according to the SCOP classification (49) where possible, and can be sorted by standard BLAST metrics. Original BLAST alignments are also shown.
DAnCER currently contains information on 5976 genes from human, mouse Mus musculus, yeast S. cerevisiae, fruit fly Drosophila melanogaster and worm Caenorhabditis elegans. Among the 1924 human genes, 422 have been confirmed as related to CM. The rest have domain composition indicative of a CM function and/or are linked through homology to known CM genes from other organisms. For 1202 human genes (62% of all human genes) we are able to find relevant 3D structure information, in most cases from multiple PDB records. This constitutes a reasonable coverage of the human data, although many of the PDB matches are to homologous proteins in other species. This and other statistical information is available at the DAnCER search page and statistics page (Figure 1).
A distinctive feature of DAnCER is its neighborhood-based approach to disease representation. For each human CM gene, DAnCER shows diseases that are associated not only with that gene but also with any of its interacting partners, which includes both pairwise interactions and membership in multi-protein complexes. Given that experimental annotations of mammalian genomes are often incomplete, this important feature will help users to hypothesize potentially novel disease associations that can be inferred from known protein interactions and complexes.
For example, only 205 human genes in DAnCER have known disease associations—but as many as 1036 have protein interaction neighbors implicated in various diseases. The existence of such interactions per se does not necessarily imply a shared disease association, but may warrant further analysis, especially if multiple partners in the interaction neighborhood share similar diseases. Figure 2 illustrates the case of SUV420H2, a known human histone methyltransferase (50). Although OMIM does not provide a direct disease association for SUV420H2, this protein interacts with several members of the Retinoblastoma 1 family (51), with the corresponding interactions curated by HPRD and the disease associations of the RB1 protein mapped visually in DAnCER. Users alerted to the presence of these interaction neighbor diseases may trace the supporting evidence and observe, for example, that the authors of the study viewed their protein-interaction experiments as ‘linking tumor suppression and the epigenetic definition of chromatin’. Furthermore, despite the lack of a direct disease association, the OMIM record for SUV420H2 references a study that correlates aberrant expression of SUV420H2 and the associated changes in histone H4 trimethylation with tumor progression (52).
DAnCER allows users to quickly identify these and other interesting patterns by selecting appropriate search filters. These cases may then be prioritized for a more focused analysis using either experimental or computational approaches. To facilitate this task, all DAnCER search results show a brief statistical summary of disease associations for each retrieved gene, its interaction neighborhood size, and diseases found in its neighborhood.
The detailed DAnCER records present various types of annotations and supporting evidence for the known CM-related genes, as well as for genes whose CM function has not yet been experimentally confirmed. The user may examine the homologs of each gene across the four model organisms, compare their annotations (such as similarity of protein structures, or experimentally supported GO terms), and visualize their interaction neighborhoods. DAnCER allows single-gene as well as gene-list queries. Search filters enable retrieval of the genes that match user-specified restrictions for the organism of interest, the number of interacting partners, the level of support (confirmed versus putative), availability of known 3D structures and a range of other attributes.
The mapping of DAnCER data onto the protein-interaction networks from iRefWeb provides a rich visual context for exploring interaction neighborhoods of CM-related proteins. The web-enabled graphs are customizable and support different node types for proteins, diseases, complexes, domains, etc. This allows a highly intuitive visual examination of various patterns in the integrated data. All graph nodes and edges are linked to the original data records in DAnCER and iRefWeb, which contain further annotation details and links to primary data sources and literature.
To our knowledge, the simultaneous focus of DAnCER on CM, molecular networks and human disease is unique among existing databases. As a result, it provides the biomedical community with a combination of useful features that are either unique, or are not available together in any other resource and would be very time-consuming for a user to assemble. Therefore, we believe that this resource would be a welcome addition to the tools currently available to molecular biologists as well as to medical researchers interested in epigenetics and the genetic mechanisms of diseases.
Several types of predictions and analyses presented in DAnCER were produced in-house and hence are unavailable elsewhere, such as the systematic fuzzy-match search for 3D structures over the entire Protein Databank, as well as the predictions of genes and domains with CM function, based on evolutionary and domain-architecture analyses. DAnCER also provides added value by mapping the supporting data from public resources, into a single environment, and offers users a convenient web-accessible visual interface to the integrated data.
DAnCER is part of an interdisciplinary project involving seven Canadian laboratories, whose objective is to elucidate the mechanisms of CM and remodeling processes and gain understanding on how their disruption may lead to human disease.
During the course of our project we intend to expand DAnCER to include data on gene expression profiles, transcriptional regulation and signal transduction pathways. In addition we plan to store data on protein interactions and complexes derived by the project team using affinity purification and mass spectrometry (53,54), as well as results of knockdown experiment using RNA interference to probe functional roles of predicted CM genes and their interacting partners. We are also planning to consolidate disease annotations from multiple sources, and to develop tools in iRefWeb for automatically comparing human interactions to their counterparts in model organisms.
Canadian Institutes of Health Research (MOP#82940); the Sickkids Foundation (to S.J.W.); the Ontario Research Fund (to S.J.W.); CIHR New Investigators award (to J.P.); an Early Researcher award from the Ontario Ministry of Research and Innovation (to J.P.). S.J.W. is Canada Research Chair, Tier 1. Funding for open access charge: Canadian Institute of Health Research (MOP#82940).
Conflict of interest statement. None declared.
J. Vlasblom is gratefully acknowledged for helpful discussions regarding the protein interaction data. Mostafa Abdellateef, Jaimin Patel and Negar Tootoonchian are acknowledged for their early investigation of the Protein Data Bank data format.