|Home | About | Journals | Submit | Contact Us | Français|
JASPAR (http://jaspar.genereg.net) is the leading open-access database of matrix profiles describing the DNA-binding patterns of transcription factors (TFs) and other proteins interacting with DNA in a sequence-specific manner. Its fourth major release is the largest expansion of the core database to date: the database now holds 457 non-redundant, curated profiles. The new entries include the first batch of profiles derived from ChIP-seq and ChIP-chip whole-genome binding experiments, and 177 yeast TF binding profiles. The introduction of a yeast division brings the convenience of JASPAR to an active research community. As binding models are refined by newer data, the JASPAR database now uses versioning of matrices: in this release, 12% of the older models were updated to improved versions. Classification of TF families has been improved by adopting a new DNA-binding domain nomenclature. A curated catalog of mammalian TFs is provided, extending the use of the JASPAR profiles to additional TFs belonging to the same structural family. The changes in the database set the system ready for more rapid acquisition of new high-throughput data sources. Additionally, three new special collections provide matrix profile data produced by recent alternative high-throughput approaches.
The wide availability of TF affinity data is becoming essential for an increasing number of research efforts to understand gene regulation in the post-genomic era. The increasing amount of assembled genome sequences, transcriptome data (1), as well as high-throughput studies revealing genome-wide locations of core promoters (2) and enhancer elements (3,4) have resulted in the greatest demand for TF binding site content analyses.
TF binding affinities are typically modeled as position frequency matrices (PFMs, also known as raw count matrices or simply binding profiles), summarizing nucleotide counts in an alignment of active binding sites These can be used to scan genomes for new binding sites (5). Since the first official release of JASPAR in 2004 (6), the research community has embraced it as the leading open-access database of such matrix profiles for TF binding sites. From the beginning, the aim of its core collection has been to provide a non-redundant set of curated, high-quality matrix profiles derived from experimental binding data in the form of position frequency matrices (7); in other words, the goal is to present the best currently available DNA binding model for a given TF, decided by expert curators.
The availability of potentially useful matrices derived by other means (e.g. using a number of genome-wide computational approaches) as well as non-TF binding profiles, prompted the addition of separate JASPAR Collections in the second release (8): the intention was to provide those matrix profiles in the same format and hence usable with the same tools as the core JASPAR database, while keeping the latter reserved for profiles representing experimentally derived data.
While the community has valued the open-access policy and non-redundant nature of JASPAR, a common complaint was that the size of the core collection was small compared to the commercial TransFac database (9), currently the only comprehensive alternative to JASPAR. In this update, our goal was to make this gap smaller by performing a major expansion of the core database, while maintaining the popular non-redundant, curated quality. As a result, this fourth major release introduces a wealth of new and improved matrix profiles and represents the largest expansion of the core database since its inception, with new data coming either from high-throughput methods like Chip-seq, or assembled from TF binding site databases particularly PAZAR (10) described below.
Several recent genome-wide studies have revealed thousands of TF binding sites for individual TFs. Compared to the original matrices, the larger number of representative target sequences provides potentially more accurate profiles and brings the added benefit that (unlike in DNA SELEX), all the binding sites come from the actual genome sequence to which the TFs in question are bound in vivo.
To make the derivation of matrices uniform, we extracted the original sets of bound regions from published experiments (11–19). We retrieved 200 bp sequences centered on each peak and performed de novo motif discovery on them using parallelized MEME (20) on a Cray XT4 supercomputing platform, which can handle inputs of many thousands of sequences in manageable time. In most cases, the resulting matrices closely resemble those reported in the original publications, produced using various motif discovery tools. The single exception was the Zfx profile, where our profile obtained with MEME from sites reported in (13) differed reproducibly from the profile reported therein. In this case, we chose to include the newly derived matrix.
In most cases, the ChIP-seq data resulted in improved matrices with higher information content than the original ones derived from either compiled single promoter assays or from DNA SELEX (Figure 1). This contradicts the widely held view that SELEX is prone to producing over-specified models since many selection rounds are commonly used. Also, somewhat surprisingly, the resulting matrices did not differ much as thresholds were varied for the inclusion of ChIP identified regions (e.g. top 100 highest confidence bound regions versus top 1000).
The ChIP-chip derived TF binding sites, while not providing the resolution of the ChIP-seq data, are a rich source of binding data. Even though they are currently being superseded by ChIP-seq (21), the published sets contain a number of high-quality binding data currently unavailable in the ChIP-seq version. As with ChIP-seq, we use the enriched regions reported by the authors of the study in question, and then apply MEME to find the pattern.
Previous versions of JASPAR did not include any matrix profiles for yeast TFs. Responding to community requests, we have compiled results from several large-scale binding profile projects to produce a non-redundant set of matrix profiles for TFs from Saccharomyces cerevisiae. The sources used, in order of preference, were a recent in vitro binding screen (22), a protein-binding microarray (PBM) experiment (23), the compiled SCPD binding profile database (24), the SwissRegulon computational re-analysis of multiple data collections (25) and a motif discovery-based collection from a widely used ChIP-chip data collection (26). The prioritization of the contributions, as well as the indicated deviations, reflect the curators’ personal perspective. The preferred set, from Badis et al. (22), appeared to offer matrices of consistently high-quality, likely reflecting the curated nature of the effort (new experimental data were compared against existing data for consistency). All matrices were manually curated to remove redundancies and converted to count matrices. In curating the collection, the curators identified a few instances in which profiles were preferred in contradiction with the source priority: GAL4 (SwissRegulon), GCR1 (SwissRegulon), MATALPHA2 (SCPD), PHO4 (UniProbe with the six leftmost and rightmost nucleotides trimmed) and ROX1 (SCPD). The resulting non-redundant set represents a comprehensive open-access compilation of yeast binding profiles, facilitating genome-wide computational studies of yeast regulatory inputs. We are grateful to the commitment of all of the data providers to open information, without which the compilation would have been impossible.
Recently, annotations of hundreds of experimentally validated TF binding sites from published studies have accumulated in the PAZAR database (27), allowing us to produce additional matrices similar in nature to the original JASPAR release (DNA SELEX or compiled from multiple studies on individual binding sites). The PAZAR database was mined to identify TFs with more than 15 annotated binding sites. The resulting data was manually curated, selecting only the results from the most high-quality data collections (i.e. collections manually annotated from the literature by specialists) and discarding any redundant sequences to build the profiles. The resulting set of compiled binding sites for each TF was used as input to the MEME software to obtain a profile. If non-informative positions were obtained on the edges of the matrices, the profiles were trimmed accordingly.
For this new release, two major sources of Drosophila melanogaster matrix profiles have been used: DNaseI footprinting data by Bergman et al. (28) and bacterial one-hybrid data by Wolfe and colleagues (29–31). The profiles from these data sets have been curated by the authors to remove redundancies among the results and with the existing profiles in the previous version of JASPAR database. In addition, any profile based on less than 10 sequences has been discarded. This new insect sub-section of JASPAR core includes 123 curated profiles; however, these are heavily dominated by the homeodomain profiles (29). For Caenorhabditis elegans, no large sources of data are currently available. Through literature searches, we identified only five profiles suitable for inclusion in the core database (32–36).
In addition to the expansion of the core database, we remain committed to providing other collections of matrix profiles within JASPAR.
Recently, the PBM technology has emerged as a new in vitro method for the characterization of TF binding affinities (37). The UniPROBE database hosts the PBM datasets and makes the derived matrix profiles available to the community (38). We have selected three of these new datasets as new collections in JASPAR:
With these additions, JASPAR now holds 840 profiles within collections outside of the core database.
In line with our goal of presenting the best currently available binding model for any TF, we updated some previous JASPAR entries motivated by new available data. Seventeen entries of the previous release were updated. The replacement of existing matrices with the new ones led us to the introduction of version numbers in matrix IDs, in a manner equivalent to the management of sequence versions in GenBank. For example, the old GATA1 profile MA0035 is replaced with a new one, and the full identifier of the new matrix is MA0035.2, while the old one becomes MA0035.1. By default, the latest version of non-redundant database includes the latest version of each profile. A search for ‘MA0035’ also retrieves the newest version, with an option to view older versions. Older versions can also be downloaded from the JASPAR web site.
The addition of 177 yeast matrices to the core collection means that the JASPAR matrices now span the entire eukaryote crown group. Even before that, a typical user scenario included the selection of only a subset of matrices derived from a particular taxonomic category of organisms, across which the TFs are strictly orthologous and their binding activities largely unchanged (e.g. vertebrates). For that reason, both the JASPAR web interface and the download section now present the database content split into major taxonomic categories—vertebrates, insects, nematodes, (higher) plants and fungi—within which most of the binding sites are transferable across species. The option to search with and download the entire core collection is still available and behaves as before.
Up to now, JASPAR used an ad hoc structural class annotation for the TFs associated with each matrix profile. In this release, we have updated the structural class annotation using our recently published catalog for mouse and human TFs (42) in which DNA binding proteins are associated with a structural classification system. We adopted the two-level classification described by Luscombe et al. (43) and extended it to accommodate additional binding domain structures. For the TFs from other species, we extrapolated the structural class and family based on the PFAM annotation of the DNA-binding domains. This addition to JASPAR provides a standardized system for the classification of TFs and allows a better grouping into families (or sub-families) with potentially similar binding preferences. A curated list of putative mouse/human DNA-binding proteins is provided at the JASPAR web site. It is also possible to browse the catalog by structure, to see what profiles that are available within the web interface.
The underlying database schema was updated to accommodate matrix versions and to allow multiple species and TF accession numbers, as well to allow the storage of multiple collections in the same sql database. A Perl API (JASPAR5) for the new schema is available as part of the open-source TFBS Perl framework (44).
In the forthcoming months and years, a large amount of whole-genome binding data from ChIP-seq and related techniques will become available. We have created the first steps towards a standardized way of including this new data into JASPAR, which is expected to expand significantly with the concomitant increase in the quality of matrix data. At the same time, JASPAR collections outside the core will continue to include interesting matrix sets derived by other means.
Supplementary Data are available at NAR Online.
EU Framework Programme 6 integrated project EuTRACC (to S.T.); YFF grant 180435 from the Norwegian Research Council (NRF), and by Bergen Research Foundation (BFS) (to B.L.). Novo Nordisk Foundation to the Bioinformatics Centre (to X.Z., E.V. and A.S.); The European Research Council under the EU 7th Framework Programme (FP7/2007-2013)/ERC grant agreement 204135 (to A.S.); Scholar of the Michael Smith Foundation for Health Research (to W.W.); Canadian Institutes for Health Research, GenomeCanada (via the Pleiades Promoter Project), GenomeBritishColumbia and the Canada Foundation for Innovation (to W.W. research laboratory). Funding for open access charge: Norwegian Research Council (NFR) (project no. 180435).
Conflict of interest statement. None declared.
We thank Frank Grosveld and Eric Soler for permission to include into JASPAR of ChIP-seq derived profiles prior to publication. We thank the laboratories of Yair Benita, Martha Bulyk, Richard Gronostajski, Steven Jones, and Zhiping Weng for suggestions and/or contributions of data reviewed for inclusion in the new release. We are grateful to Debra Fulton for her efforts to make the TFCat catalog available to the community.