PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of narLink to Publisher's site
 
Nucleic Acids Res. 2011 January; 39(Database issue): D124–D128.
Published online 2010 October 30. doi:  10.1093/nar/gkq992
PMCID: PMC3013812

UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein–DNA interactions

Abstract

The Universal PBM Resource for Oligonucleotide-Binding Evaluation (UniPROBE) database is a centralized repository of information on the DNA-binding preferences of proteins as determined by universal protein-binding microarray (PBM) technology. Each entry for a protein (or protein complex) in UniPROBE provides the quantitative preferences for all possible nucleotide sequence variants (‘words’) of length k (‘k-mers’), as well as position weight matrix (PWM) and graphical sequence logo representations of the k-mer data. In this update, we describe >130% expansion of the database content, incorporation of a protein BLAST (blastp) tool for finding protein sequence matches in UniPROBE, the introduction of UniPROBE accession numbers and additional database enhancements. The UniPROBE database is available at http://uniprobe.org.

INTRODUCTION

A comprehensive understanding of gene-expression regulation requires the thorough characterization of transcription factor (TF)–DNA binding properties. TFs play central roles in transcriptional regulatory networks by binding specific DNA sequences and activating or repressing gene expression. Consequently, TF–DNA-binding specificities have broad impact on cell physiology and development and in evolution (1,2).

Advances in DNA microarray synthesis and the development of protein-binding microarray (PBM) technology (3,4) led to the development of universal PBMs (5), which allow high-throughput measurement of comprehensive data on protein–DNA binding specificities, resulting in large data sets requiring curation and searchability. The Universal PBM Resource for Oligonucleotide-Binding Evaluation (UniPROBE) (6) database was created to satisfy these requirements. Please refer to the original UniPROBE publication (6) for a description of major differences between UniPROBE and the JASPAR (7), TRANSFAC (8) and PAZAR (9) databases. The original UniPROBE publication (6) also provides a detailed description of PBM technology and data types.

Since its inception 2 years ago, the UniPROBE database has continued to expand in size, utility and user base. UniPROBE previously housed data for 177 non-redundant proteins (6). That number has recently grown to over 400 non-redundant proteins or protein complexes, with additional, unpublished PBM data sets already planned for future deposition. Currently, the UniPROBE database averages 933 unique visitors per month (classified by IP address) from over 40 different countries and 3558 page views per month. UniPROBE is the standard for curating universal PBM data, and we invite other researchers generating universal PBM data to contact us about depositing their data in UniPROBE.

DATABASE ADDITIONS

UniPROBE has more than doubled in size since its introduction in January 2009 (6) (Table 1). As of this writing, in addition to the data deposited from the initial set of four publications (5,10–12), PBM data are included from six newer publications (13–18) with additional published (19) and soon to be published data currently in planning for deposition. The new additions include data on TFs from Caenorhabditis elegans, Saccharomyces cerevisiae, Mus musculus and Homo sapiens. The UniPROBE database now houses PBM data for 415 individual proteins or protein complexes, nearly all of which are TFs, corresponding to 404 non-redundant proteins or protein complexes.

Table 1.
UniPROBE database contents, with indication of additions in PBM data sets since its introduction in 2009

NEW BLASTP SEARCH FEATURE

In the latest version of UniPROBE, the available online search features have been augmented with a new search tool that permits a user to perform a blastp (20) search of a protein sequence of interest (the ‘query protein’) against all protein sequences in the UniPROBE database (the ‘subject proteins’). This feature incorporates NCBI’s Protein–Protein BLAST tool (21), blastp v.2.2.23+, for accurate and efficient alignments. This blastp tool returns a list of links to the Details page for each subject protein that either exactly matches or is similar to the query protein(s) according to user-specified search parameter settings. Links from the Details pages allow further exploration and links to download the PBM data for the matching proteins.

Query protein sequences may be entered manually into a web-page form or uploaded as a text file. The sequence is parsed using fail-safe rules to interpret the format. Multiple sequences can be processed in batch either by specifying one sequence per line, or by entry of FASTA-formatted sequences, which may cross multiple lines but are separated by header lines. Numbers and unnecessary white-space are stripped from the sequence prior to performing the search.

For the subject proteins, the blastp search tool uses a database comprising all the clone insert sequences corresponding to all the PBM experiments with data curated in UniPROBE. For example, consider a search for the human TF GATA4, which is not currently in UniPROBE. Running the blastp tool on the human GATA4 sequence with default parameter settings (Figure 1) results in eight hits, four from yeast and four from mouse, all with the GATA DNA-binding domain (Figure 2). Among the hits, the tool correctly retrieves two hits to Gata3, which is represented in the database by two proteins: the full-length TF and just the DNA-binding domain. The blastp search parameter settings (E-value threshold, species, substitution matrix and word size) are passed directly to a local instance of NCBI’s blastp executable.

Figure 1.
Blastp search of UniPROBE with human GATA4 protein sequence and default parameter settings for the advanced search options.
Figure 2.
Results from blastp search of all protein sequences in UniPROBE using the human full-length GATA4 protein sequence as the query.

Results are output with the sequence matches within matching subject proteins rendered with yellow highlighting on all the residues within the confines of the alignment. Also provided is the offset of the first aligned residue of the query protein. As defined by blastp, the score provided is a measure of similarity, and the E-value is the number of expected matches if the subject protein sequences were generated randomly.

UNIPROBE ACCESSION NUMBERS

A significant new feature is the addition of UniPROBE accession numbers. Each TF PBM data set now has its own UniPROBE accession number, regardless of whether or not its protein is unique in the database. Accession numbers are five digits prefixed with ‘UP’ (abbreviation for ‘universal PBM’), e.g. UP00350. Accession numbers are returned as part of the search results and are also listed on each protein’s Details page. A user can use the ‘Quick Search’ tool to find TFs by accession number. Accession numbers can be requested prior to publication of new PBM data sets, such as for unpublished PBM data sets in new article submissions.

OTHER NEW FEATURES

New to this version of UniPROBE is the inclusion of PBM data for protein complexes. This functionality was implemented to accommodate homodimer and heterodimer data for bHLH TFs from C. elegans (13). This feature allows the Details page to render data sets for the protein of interest and for each of the proteins with which the protein of interest dimerizes.

The UniPROBE statistics cited here were derived with the aid of several minor but useful enhancements. It is now possible to use ‘Text Search’ to find TFs by publication; TFs can be searched by species using the same tool. The search results now include the total number of TFs returned. To easily distinguish between separate, published PBM data sets for the same protein, a reference to the publication for each separate data set has been added to the bottom of all TF Details pages, along with the array design number(s).

For convenience a new, shorter URL (http://uniprobe.org) has been registered, which redirects to the legacy UniPROBE URL (http://thebrain.bwh.harvard.edu/uniprobe).

FUTURE DIRECTIONS

Future updates planned for UniPROBE include additional user and administrative tools. Currently in development is a negative control sequence generator which, given an E-score threshold indicative of DNA-binding preference, will generate random sequence of user-specified length that does not include any 8-mer with scores exceeding the given threshold for user-selected TFs and species in UniPROBE. Another planned feature is the display of sequence alignments resulting from the blastp searches of UniPROBE. Also under development are administrative tools to allow for self-deposition and automated pre-publication UniPROBE accession number requests. The template for the Details page will be generalized to support self-deposition of PBM data for protein complexes. These tools and others will be facilitated, and system performance will generally improve, with the implementation of a newly designed database schema. As always, we continue to encourage user registration and feedback for error reports and feature requests, some of which motivated the development of the new features described here.

AVAILABILITY AND LICENSE

All data hosted by the PBM database are freely available for distribution at the database website. The sequences of the 60-mer DNA probes synthesized on the custom-designed universal arrays are available under the terms of the academic research use license available at http://thebrain.bwh.harvard.edu/uniprobe/academic-license.php.

FUNDING

Funding for open access charge: National Institutes of Health (grant number R01 HG003985 to M.L.B.).

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

The authors thank Ivan Adzhubey for technical assistance and Dan Newburger and Mike Berger for helpful discussions.

REFERENCES

1. Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. 2000;16:16–23. [PubMed]
2. Bulyk ML. Computational prediction of transcription-factor binding site locations. Genome Biol. 2003;5:201. [PMC free article] [PubMed]
3. Bulyk ML, Huang X, Choo Y, Church GM. Exploring the DNA-binding specificities of zinc fingers with DNA microarrays. Proc. Natl Acad. Sci. USA. 2001;98:7158–7163. [PubMed]
4. Mukherjee S, Berger MF, Jona G, Wang XS, Muzzey D, Snyder M, Young RA, Bulyk ML. Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nat. Genet. 2004;36:1331–1339. [PMC free article] [PubMed]
5. Berger MF, Philippakis AA, Qureshi AM, He FS, Estep PW, Bulyk ML. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 2006;24:1429–1435. [PubMed]
6. Newburger DE, Bulyk ML. UniPROBE: an online database of protein binding microarray data on protein-DNA interactions. Nucleic Acids Res. 2009;37:D77–D82. [PMC free article] [PubMed]
7. Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, Yusuf D, Lenhard B, Wasserman WW, Sandelin A. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 2010;38:D105–D110. [PMC free article] [PubMed]
8. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006;34:D108–D110. [PMC free article] [PubMed]
9. Portales-Casamar E, Arenillas D, Lim J, Swanson MI, Jiang S, McCallum A, Kirov S, Wasserman WW. The PAZAR database of gene regulatory information coupled to the ORCA toolkit for the study of regulatory sequences. Nucleic Acids Res. 2009;37:D54–D60. [PMC free article] [PubMed]
10. Berger M, Badis G, Gehrke A, Talukder S, Philippakis A, Penacastillo L, Alleyne T, Mnaimneh S, Botvinnik O, Chan E. Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell. 2008;133:1266–1276. [PMC free article] [PubMed]
11. Pompeani AJ, Irgon JJ, Berger MF, Bulyk ML, Wingreen NS, Bassler BL. The Vibrio harveyi master quorum-sensing regulator, LuxR, a TetR-type protein is both an activator and a repressor: DNA recognition and binding specificity at target promoters. Mol. Microbiol. 2008;70:76–88. [PMC free article] [PubMed]
12. De Silva EK, Gehrke AR, Olszewski K, Leon I, Chahal JS, Bulyk ML, Llinas M. Specific DNA-binding by Apicomplexan AP2 transcription factors. Proc. Natl Acad. Sci. USA. 2008;105:8393–8398. [PubMed]
13. Grove CA, De Masi F, Barrasa MI, Newburger DE, Alkema MJ, Bulyk ML, Walhout AJM. A multiparameter network reveals extensive divergence between C. elegans bHLH transcription factors. Cell. 2009;138:314–327. [PMC free article] [PubMed]
14. Scharer CD, McCabe CD, Ali-Seyed M, Berger MF, Bulyk ML, Moreno CS. Genome-wide promoter analysis of the SOX4 transcriptional network in prostate cancer cells. Cancer Res. 2009;69:709–717. [PMC free article] [PubMed]
15. Lesch BJ, Gehrke AR, Bulyk ML, Bargmann CI. Transcriptional regulation and stabilization of left-right neuronal identity in C. elegans. Genes Dev. 2009;23:345–358. [PubMed]
16. Zhu C, Byers KJRP, McCord RP, Shi Z, Berger MF, Newburger DE, Saulrieta K, Smith Z, Shah MV, Radhakrishnan M, et al. High-resolution DNA-binding specificity analysis of yeast transcription factors. Genome Res. 2009;19:556–566. [PubMed]
17. Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X, et al. Diversity and complexity in DNA recognition by transcription factors. Science. 2009;324:1720–1723. [PMC free article] [PubMed]
18. Wei G-H, Badis G, Berger MF, Kivioja T, Palin K, Enge M, Bonke M, Jolma A, Varjosalo M, Gehrke AR, et al. Genome-wide analysis of ETS-family DNA-binding in vitro and in vivo. The EMBO J. 2010;29:2147–2160. [PMC free article] [PubMed]
19. Alibés A, Nadra AD, De Masi F, Bulyk ML, Serrano L, Stricher F. Using protein design algorithms to understand the molecular basis of disease caused by protein-DNA interactions: the Pax6 example. Nucleic Acids Res. 2010 doi:10.1093/nar/gkq683 [Epub ahead of print, 4 August 2010] [PMC free article] [PubMed]
20. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
21. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press