PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of databaseAlertsAuthor InstructionsSubmitAboutDatabase
 
Database (Oxford). 2011; 2011: bar048.
Published online 2011 October 22. doi:  10.1093/database/bar048
PMCID: PMC3199919

The Histone Database: an integrated resource for histones and histone fold-containing proteins

Abstract

Eukaryotic chromatin is composed of DNA and protein components—core histones—that act to compactly pack the DNA into nucleosomes, the fundamental building blocks of chromatin. These nucleosomes are connected to adjacent nucleosomes by linker histones. Nucleosomes are highly dynamic and, through various core histone post-translational modifications and incorporation of diverse histone variants, can serve as epigenetic marks to control processes such as gene expression and recombination. The Histone Sequence Database is a curated collection of sequences and structures of histones and non-histone proteins containing histone folds, assembled from major public databases. Here, we report a substantial increase in the number of sequences and taxonomic coverage for histone and histone fold-containing proteins available in the database. Additionally, the database now contains an expanded dataset that includes archaeal histone sequences. The database also provides comprehensive multiple sequence alignments for each of the four core histones (H2A, H2B, H3 and H4), the linker histones (H1/H5) and the archaeal histones. The database also includes current information on solved histone fold-containing structures. The Histone Sequence Database is an inclusive resource for the analysis of chromatin structure and function focused on histones and histone fold-containing proteins.

Database URL: The Histone Sequence Database is freely available and can be accessed at http://research.nhgri.nih.gov/histones/.

Introduction

Histones play central roles in both chromatin organization and gene regulation, as they constitute the fundamental protein units of the nucleosome (1). The nucleosome consists of DNA wrapped around an octameric core histone complex, composed of a central H3–H4 tetramer and two adjacent H2A–H2B dimers; the nucleosome is commonly identified as the first order of compaction of eukaryotic chromatin (2). Core histone genes also display conserved expression patterns that show periodic expression across the eukaryotic cell cycle, with a pronounced peak during S-phase (3). This allows for histone proteins to be produced at the same time DNA is being synthesized. Thus, the histone proteins can be readily assembled into nucleosomes and then compacted into chromatin.

Core histones are highly conserved across eukaryotes in terms of sequence and structure. Despite overall sequence conservation, extensive histone tail post-translational modifications, in addition to histone variants present during development, contribute to epigenetic mechanisms that signal transcriptional activation, repression and recombination events. Histone proteins and their variants have an essential function in gene regulation (4–6). Nucleosomes are disassembled at transcriptionally active promoters via histone post-translational modifications (7), specific histone variants are known to mark active promoters and regulatory regions (8), and other variants are involved in the transition between transcriptionally active or silent chromatin (9). In recent years, much progress has been made toward genome-wide profiling of chromatin modifications (10), where histones play critical roles in defining the overall structure and function of chromatin and, by extension, in gene regulation.

The histone fold is a common structural motif shared by each of the four core histones, which mediates interactions between the individual core histones. The histone fold is structurally composed of three α-helices connected by two loops, and this overall architecture allows for heterodimeric interactions between core histones (11). Interestingly, even though each individual histone protein family is highly conserved, the histone fold is not conserved at the sequence level but, rather, at the structural level (12). Higher-resolution crystal structure of the nucleosome core particle has demonstrated detailed structures of the histone folds in each of the histones (13, 14). The DNA wrapped around each nucleosome is held in place by linker histones (called H1, or H5 in avian species). The linker histones, which do not contain the histone fold motif and have a different evolutionary origin from the core histones (15), are critical to chromatin higher-order compaction and facilitate internucleosomal interactions (16). In addition, H1 variants have been shown to be involved in the regulation of developmental genes (17). The overall structural state of chromatin controls DNA replication, recombination and gene expression, with histones playing critical roles during these processes (18).

Interestingly, despite the conservation of core histone gene expression patterns, the regulatory machinery that controls core histone gene expression has changed greatly among eukaryotic evolutionary lineages. Specifically, the identity of the core histone gene cis-regulatory sequence motifs and the protein factors that bind these motifs are distinct for the yeast Saccharomyces cerevisiae, as well as for other fungi, plants, insects and mammals (19). Therefore, different species have developed unique gene regulatory mechanisms for core histone genes that converge in the same gene expression phenotype, high expression levels specifically during S phase, concomitant with DNA replication.

Although the core histones are among the most slowly evolving eukaryotic proteins, members of the histone H2A and H3 families have diversified extensively, assuming specialized roles in DNA repair, gene silencing, gene expression and centromere function (5, 6). Interestingly, the centromere H3 variant appears to form tetrameric nucleosomes that induce positive supercoils, and these specialized ‘centromeric nucleosomes’ have been proposed as the epigenetic inheritance mechanism for centromeres (20).

The histone fold motif—common to all core histones—has also been found in a variety of non-histone proteins. The large majority of these non-histone proteins are localized in the nucleus and their functions are related to DNA metabolism; they include nuclear factor Y (NF-Y) and the TFIIB transcription factors (12). A few histone fold-containing proteins localized in the cytoplasm include the Ras activator Son of Sevenless (SOS) (21): SOS1 is localized primarily in the nucleus and SOS2 localized in the cytoplasm (22). Huntingtin interacting protein M (CXorf27) also contains a histone fold and is localized in the cytoplasm. We hypothesize that histone folds in cytoplasm-localized proteins are used to mediate protein–protein interactions.

Given the central role of histones and related proteins in a wide variety of critical cellular functions, we feel the need to continue to provide a centralized, curated source of important information on these proteins to the biomedical community. To this end, the Histone Sequence Database represents an organized collection of all histones and histone fold-containing proteins (23). The information presented in this Database includes a list of published three-dimensional structures for histones and histone fold-containing proteins, as well as manually curated multiple sequence alignments for each histone family.

Database and software

Data tables

The Histone Sequence Database, which has been developed and expanded significantly since its last release (23), has three tables stored in a relational database schema using Oracle 10 g (Figure 1). The HISTONES table stores information about the histone category, its accession, the sequence string, the submitting database, as well as NCBI's taxonomic information on the sequence. The ORGANISM table contains detailed taxonomic information for the sequences contained in the Histone Sequence Database. The STRUCTURES table stores information on the experimentally determined structures of proteins contained in the database, including the method of determination (i.e. X-ray crystallography or NMR spectroscopy).

Figure 1.
Histone Database data model. The Histone Database stores selected manually curated information from GenBank records. The information stored as part of each record includes the GenBank unique identifier (GI), accession number, definition line, sequence ...

Software

The Histone Sequence Database uses Common Gateway Interfaces (CGIs) written in the Perl programming language that communicate with the relational database software. The connectivity between the CGIs and the Oracle 10 g Relational Database Management Software was implemented using Perl's Database Interface (DBI) and the Oracle database driver for the DBI module (DBD::Oracle), available through the Comprehensive Perl Archive Network (CPAN; http://search.cpan.org/). The use of object-oriented design methodologies and Perl modules that are both open source and developed in-house allows for flexibility and scalability. The Web pages displaying data, such as the summary of contents, non-redundant sets, and search pages are dynamically generated using CGI. Comments concerning the Web front-end are welcomed and encouraged.

Data sources and histone protein identification

The protein databases searched for the update and curation of the Histone Sequence Database were the NCBI non-redundant (nr) database (18 November 2010); nr includes sequences of all non-redundant GenBank CDS translations (24), as well as the sequences of RefSeq proteins, sequences of structures represented in the Protein Data Bank (PDB) (25), and sequences from UniProtKB/Swiss-Prot (26), the Protein Information Resource (PIR) (27), and the Protein Research Foundation (PRF) (http://www.prf.or.jp/index-e.html). The collection of histones was extended and revised, using the HMMER3 software package (28). We constructed hidden Markov models (HMMs) for each of the four core histones and the linker histone H1 from the alignments generated in the last release of the Histone Database. Additional HMMs were generated for archaeal histones (29) and bacterial proteins that contain a histone-likefold (30); only the protein entries that have a complete domain hit with an E < 0.01 are collected for further analysis. For each histone family, multiple sequence alignments were generated using MUSCLE (31). The alignments that are manually curated to include proteins with complete folds are also available in PDF format and are color-coded to allow easy identification of amino acid variants. The Histone Database uses a color scheme designed to highlight the specific amino acid differences that a particular group of sequences may have inside the core or linker histone alignments by coloring amino acids with similar physicochemical properties differently. A summary table of the number of sequences found grouped by family and species represented in the database is provided (Table 1).

Table 1.
Histone Database content

Identification of histone fold-containing proteins

Histone fold-containing proteins were identified using a different search strategy. We used the sequences from each of the four core histone MUSCLE alignments (H2A, H2B, H3 and H4) as seeds for PSI-BLAST (32) searches. The PSI-BLAST searches were run to convergence with an E-value inclusion threshold of 0.01; the core histone seeds were excluded from the final list of histone fold-containing proteins. Additionally, related structures were identified using NCBI's VAST-related structures searches (33, 34), in an effort to identify more distant histone fold-containing proteins that could not be identified through PSI-BLAST searches. Using this strategy, we were able to identify a total of 2180 histone fold-containing proteins.

Results and Discussion

The computational approach presented here has identified proteins throughout a wider evolutionary spread of genomes. Currently, the Histone Database contains entries that represent a total of 7356 unique NCBI taxonomic identifiers, which correspond to approximately the same number of organisms. The sequences of core histones, linker histones and archeal histones are available in FASTA format. They are also available as a series of multiple sequence alignments, one for each class of proteins. A number of search engines can be used to query the database in several different ways: by protein family, organism, keyword or based on a sequence pattern (Figure 2). Each histone sequence for which three-dimensional structure data is available is linked to the corresponding entry in both PDB and the Molecular Modeling Database (MMDB) (35).

Figure 2.
Histone Database query and results. The Histone Database main page displays a search engine that allows users to find histone sequences from a large variety of organisms. Additionally, users have the possibility of exploring other features to access complete ...

The Histone Sequence Database has been expanded significantly since its last update (23) (Table 1). However, the expansion is not proportional for each of the core histones. The H3 sequences, which contain a large number of variants with specialized roles in chromosome segregation and transcription, show an increase over 400% since the last database update. Similarly, the H2A core histone sequences that include variants with specialized functions in DNA repair and transcription regulation show an increase over 200% since the last update. In contrast, we observe a more modest growth in sequence numbers for the relatively invariant H4 and H2B core histones.

The Histone Sequence Database now includes archaeal histone sequences. The current update contains 182 sequences from 89 archaeal organisms, which includes members of all classified archaeal phyla (i.e. euryarchaeota, crenarchaeota, nanoarchaeota, korarchaeota and the newly proposed phylum thaumarchaeota). The presence of histone folds in all classified archaeal phyla indicates that the histone fold originated before the archaeal and eukaryotic lineage divergence (29). Most of the archaeal histones have a single histone fold domain; however, there are a number of sequences that contain two histone folds, with the C-terminal histone fold sharing higher sequence similarity with archaeal histones with a single histone fold. Archaeal histones containing two histone folds have been proposed as intermediates between archaeal and eukaryotic histones (36, 37), where both core histones H3 and H4 would have originated at the same time, followed by a second event that gave rise to core histones H2A and H2B. In the current release of the Histone Sequence Database, archaeal histones with two histone folds are confined to two distinct branches: Halobacteriaceae and the hyperthermophilic methanogen Methanopyrus kandleri. Although archaeal histones containing two histone folds have been previously identified in these lineages, it is not clear how these histones could also contribute to pack DNA in extreme high temperature or high salinity environments.

Structural comparisons confirmed the presence of the histone fold in the extreme bacterial thermophile Aquifex aeolicus (30). Additionally, the RIKEN Structural Genomics/Proteomics Initiative (RSGI) (38) has solved two Thermus thermophilus structures for a protein that also contain the histone fold (PDB:1WWI and PDB:1WWS). The histone fold was also found in diverse types of bacteria, including aquificales, ε-proteobacteria, thermaceae, actinobacteria and nostocaceae. This suggests that the histone fold appeared in bacteria by lateral gene transfer (29, 39). Interestingly, the structure from T. thermophilus (PDB:1WWS), predicted to be a dimer, is strikingly similar to the H3–H4 tetramer. However, an analysis of the electrostatic surface potential for protein Aq_328 from the hyperthermophilic bacterium A. aeolicus (PDB:1R4V) and archaeal histone from M. kandleri (PDB:1F1E) (Figure 3) reveals the DNA binding surface in the archaeal histone (Figure 3F) but shows no conservation of any of the DNA-binding residues present in both archaeal and eukaryotic histones (Figures 3A and and3D)3D) (29). Therefore, it is possible that histone fold-like bacterial proteins have functions unrelated to DNA binding. However, it is likely that the histone-like fold is used as a dimerization domain in these species.

Figure 3.
Histone-like folds in A. aeolicus and M. kandleri. Protein Aq_328 from the hyperthermophilic bacterium A. aeolicus (PDB:1R4V) and archaeal histone from M. kandleri (PDB:1F1E) have two histone like folds. These are colored as dark blue and dark green (for ...

Conclusions

Researchers studying chromatin structure and function have traditionally relied on the Histone Sequence Database to explore the taxonomic breadth of histones and their variants (40–43). Others have focused on epigenetics and transcriptional regulation and use the database to discover newly reported core histones and histone-fold-containing proteins (44–48). The Histone Database continues to be a comprehensive bioinformatic resource that organizes and stores histone sequences and groups them into families (that now includes archaeal histones), maintains a collection of histone fold-containing sequences, and provides information on three-dimensional structures available in PDB. In the future, we will enhance our histone fold identification pipeline with state-of-the-art sequence- and structure-based methods to continue to identify new members of this biologically critical family of proteins. We also plan to integrate functional information from other publicly available Web resources.

Funding

Funding for open access charge: The Intramural Research Programs of the National Center for Biotechnology Information, National Library of Medicine and the National Human Genome Research Institute, both at the National Institutes of Health.

Conflict of interest. None declared.

Acknowledgments

The Histone Database update utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, Maryland. (http://biowulf.nih.gov/).

References

1. van Holde KE. Chromatin. New York: Springer; 1988.
2. Eickbush TH, Moudrianakis EN. The histone core complex: an octamer assembled by two sets of protein-protein interactions. Biochemistry. 1978;17:4955–4964. [PubMed]
3. Cho RJ, Campbell MJ, Winzeler EA, et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell. 1998;2:65–73. [PubMed]
4. Ausio J. Histone variants—the structure behind the function. Brief. Funct. Genomic Proteomic. 2006;5:228–243. [PubMed]
5. Elsaesser SJ, Goldberg AD, Allis CD. New functions for an old variant: no substitute for histone H3.3. Curr. Opin. Genet. Dev. 2010;20:110–117. [PMC free article] [PubMed]
6. Talbert PB, Henikoff S. Histone variants—ancient wrap artists of the epigenome. Nat. Rev. Mol. Cell Biol. 2010;11:264–275. [PubMed]
7. Luebben WR, Sharma N, Nyborg JK. Nucleosome eviction and activated transcription require p300 acetylation of histone H3 lysine 14. Proc. Natl Acad. Sci. USA. 2010;107:19254–19259. [PubMed]
8. Jin C, Zang C, Wei G, et al. H3.3/H2A.Z double variant-containing nucleosomes mark ‘nucleosome-free regions’ of active promoters and other regulatory regions. Nat. Genet. 2009;41:941–945. [PMC free article] [PubMed]
9. Kouzarides T. Chromatin modifications and their function. Cell. 2007;128:693–705. [PubMed]
10. Schones DE, Zhao K. Genome-wide approaches to studying chromatin modifications. Nat. Rev. Genet. 2008;9:179–191. [PubMed]
11. Arents G, Burlingame RW, Wang BC, et al. The nucleosomal core histone octamer at 3.1 A resolution: a tripartite protein assembly and a left-handed superhelix. Proc. Natl Acad. Sci. USA. 1991;88:10148–10152. [PubMed]
12. Baxevanis AD, Arents G, Moudrianakis EN, et al. A variety of DNA-binding and multimeric proteins contain the histone fold motif. Nucleic Acids Res. 1995;23:2685–2691. [PMC free article] [PubMed]
13. Luger K, Mader AW, Richmond RK, et al. Crystal structure of the nucleosome core particle at 2.8 A resolution. Nature. 1997;389:251–260. [PubMed]
14. Davey CA, Sargent DF, Luger K, et al. Solvent mediated interactions in the structure of the nucleosome core particle at 1.9 a resolution. J. Mol. Biol. 2002;319:1097–1113. [PubMed]
15. Kasinsky HE, Lewis JD, Dacks JB, et al. Origin of H1 linker histones. FASEB J. 2001;15:34–42. [PubMed]
16. Bustin M, Catez F, Lim JH. The dynamics of histone H1 function in chromatin. Mol. Cell. 2005;17:617–620. [PubMed]
17. Khochbin S. Histone H1 diversity: bridging regulatory signals to linker histone function. Gene. 2001;271:1–12. [PubMed]
18. Marino-Ramirez L, Kann MG, Shoemaker BA, et al. Histone structure and nucleosome stability. Expert Rev. Proteomics. 2005;2:719–729. [PMC free article] [PubMed]
19. Marino-Ramirez L, Jordan IK, Landsman D. Multiple independent evolutionary solutions to core histone gene regulation. Genome Biol. 2006;7:R122. [PMC free article] [PubMed]
20. Henikoff S, Furuyama T. Epigenetic inheritance of centromeres. Cold Spring Harb. Symp. Quant. Biol. 2010 [PubMed]
21. Sondermann H, Soisson SM, Bar-Sagi D, et al. Tandem histone folds in the structure of the N-terminal segment of the ras activator Son of Sevenless. Structure. 2003;11:1583–1593. [PubMed]
22. Berglund L, Bjorling E, Oksvold P, et al. A genecentric Human Protein Atlas for expression profiles based on antibodies. Mol. Cell Proteomics. 2008;7:2019–2027. [PubMed]
23. Marino-Ramirez L, Hsu B, Baxevanis AD, et al. The Histone Database: a comprehensive resource for histones and histone fold-containing proteins. Proteins. 2006;62:838–842. [PMC free article] [PubMed]
24. Benson DA, Karsch-Mizrachi I, Lipman DJ, et al. GenBank. Nucleic Acids Res. 2011;39:D32–D37. [PMC free article] [PubMed]
25. Berman HM, Westbrook J, Feng Z, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. [PMC free article] [PubMed]
26. UniProt. The Universal Protein Resource (UniProt) Nucleic Acids Res. 2007;35:D193–D197. [PubMed]
27. Wu CH, Yeh LS, Huang H, et al. The Protein Information Resource. Nucleic Acids Res. 2003;31:345–347. [PMC free article] [PubMed]
28. Eddy SR. A new generation of homology search tools based on probabilistic inference. Genome Inform. 2009;23:205–211. [PubMed]
29. Sandman K, Reeve JN. Archaeal histones and the origin of the histone fold. Curr. Opin. Microbiol. 2006;9:520–525. [PubMed]
30. Qiu Y, Tereshko V, Kim Y, et al. The crystal structure of Aq_328 from the hyperthermophilic bacteria Aquifex aeolicus shows an ancestral histone fold. Proteins. 2006;62:8–16. [PMC free article] [PubMed]
31. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. [PMC free article] [PubMed]
32. Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
33. Gibrat JF, Madej T, Bryant SH. Surprising similarities in structure comparison. Curr. Opin. Struct. Biol. 1996;6:377–385. [PubMed]
34. Madej T, Gibrat JF, Bryant SH. Threading a database of protein cores. Proteins. 1995;23:356–369. [PubMed]
35. Wang Y, Addess KJ, Chen J, et al. MMDB: annotating protein sequences with Entrez's 3D-structure database. Nucleic Acids Res. 2007;35:D298–D300. [PubMed]
36. Fahrner RL, Cascio D, Lake JA, et al. An ancestral nuclear protein assembly: crystal structure of the Methanopyrus kandleri histone. Protein Sci. 2001;10:2002–2007. [PubMed]
37. Malik HS, Henikoff S. Phylogenomics of the nucleosome. Nat. Struct. Biol. 2003;10:882–891. [PubMed]
38. Aoki M, Matsuda T, Tomo Y, et al. Automated system for high-throughput protein production using the dialysis cell-free method. Protein Expr. Purif. 2009;68:128–136. [PubMed]
39. Alva V, Ammelburg M, Soding J, et al. On the origin of the histone fold. BMC Struct. Biol. 2007;7:17. [PMC free article] [PubMed]
40. Gonzalez-Romero R, Rivera-Casas C, Ausio J, et al. Birth-and-death long-term evolution promotes histone H2B variant diversification in the male germinal cell line. Mol. Biol. Evol. 2010;27:1802–1812. [PubMed]
41. Eirin-Lopez JM, Gonzalez-Romero R, Dryhurst D, et al. The evolutionary differentiation of two histone H2A.Z variants in chordates (H2A.Z-1 and H2A.Z-2) is mediated by a stepwise mutation process that affects three amino acid residues. BMC Evol. Biol. 2009;9:31. [PMC free article] [PubMed]
42. Potoyan DA, Papoian GA. Energy landscape analyses of disordered histone tails reveal special organization of their conformational dynamics. J. Am. Chem. Soc. 2011;133:7405–7415. [PubMed]
43. Ozboyaci M, Gursoy A, Erman B, et al. Molecular recognition of H3/H4 histone tails by the tudor domains of JMJD2A: a comparative molecular dynamics simulations study. PLoS One. 2011;6:e14765. [PMC free article] [PubMed]
44. Kolarik C, Klinger R, Hofmann-Apitius M. Identification of histone modifications in biomedical text for supporting epigenomic research. BMC Bioinformatics. 2009;10(Suppl. 1):S28. [PMC free article] [PubMed]
45. Sun XJ, Xu PF, Zhou T, et al. Genome-wide survey and developmental expression mapping of zebrafish SET domain-containing genes. PLoS One. 2008;3:e1499. [PMC free article] [PubMed]
46. Shultz RW, Tatineni VM, Hanley-Bowdoin L, et al. Genome-wide analysis of the core DNA replication machinery in the higher plants Arabidopsis and rice. Plant Physiol. 2007;144:1697–1714. [PubMed]
47. Weidenbach K, Gloer J, Ehlers C, et al. Deletion of the archaeal histone in Methanosarcina mazei Go1 results in reduced growth and genomic transcription. Mol. Microbiol. 2008;67:662–671. [PubMed]
48. Huda A, Marino-Ramirez L, Jordan IK. Epigenetic histone modifications of human transposable elements: genome defense versus exaptation. Mob. DNA. 2010;1:2. [PMC free article] [PubMed]
49. Schrödinger . The PyMOL Molecular Graphics System. 2010. Version 1.3.
50. Lerner MG, Carlson HA. APBS Plugin for PyMOL. 2009. http://www.pymolwiki.org/index.php/APBS.

Articles from Database: The Journal of Biological Databases and Curation are provided here courtesy of Oxford University Press