The development and expansion of the PAZAR information mall continues to progress. In addition to a growth in the number and size of data collections, we have described the new integrated ORCAtk to facilitate analyses based on the data in PAZAR. Enhancements to the XML format and introduction of download and upload tools makes it easier for researchers to interact with the system. While much work remains to achieve the goal of making PAZAR the primary open-access repository for transcriptional regulatory sequence annotation, the described updates move the system closer to the mark.
The ORCAtk has proven to be widely useful for applied analysis of regulatory sequence data. The underlying software has been used in a variety of applied projects, such as the oPOSSUM system for motif over-representation analysis (21
), gene-centric studies [e.g. PIK3CA (22
)] and the Pleiades Promoter Project (http://www.pleiades.org
). By integrating the toolkit into the PAZAR system, researchers are enabled to build a binding profile for a TF and immediately apply the model to promoter sequence analysis. Moreover, users can now easily integrate and compare experimentally verified and predicted regulatory elements.
ORCAtk will continue to grow as new research approaches become mature. Possible extensions include more focus on TFBS modules rather than individual binding sites, detection of clusters of binding sites involved in cooperative binding and integration of tools for the analysis of variably spaced half-sites such as nuclear receptor binding sites.
A growing community of researchers needs a convenient system for sharing regulatory sequence data. As increasing numbers of researchers develop data collections, demand for a shared repository naturally grows. This has been evident for DNA sequences [GENBANK (23
)], genome sequences [UCSC (7
) and ENSEMBL (8
)], microarrays [ArrayExpress (24
) and GEO (25
)] and polymorphisms [dbSNP (26
)]. Unfortunately, experimental results of cis
-regulatory sequence and TF analyses are often contained in disparate sites on the Internet and are not usually kept up-to-date, a fact that limits utility. In addition, there are many regulatory sequence databases, but none currently allows data producers to create and maintain their own collections. PAZAR was created to bring these disparate ‘boutique’ datasets together under one roof and maintain their relevancy through reference to current genomic coordinates. The association of data collections with individual research labs provides an important recognition of the work, promotes longer-term maintenance of the information and allows users to access the collections they deem to be reliable. In comparison to commercial systems, PAZAR is more likely to attract participation from researchers wishing to share their data with the research community. With the increasing production of high-throughput TFBS data, the depositing of data into PAZAR is likely to increase.
The advances described in this report largely address the curated annotation of regulatory sequences based on individual gene studies in the scientific literature. In order to make the system suitable for sequencing or microarray derived binding data, the data depositing process will be expanded. Bulk uploading of target sequence coordinates coupled to a common TF (or TF complex) would allow for high-throughput data to be rapidly collected. The identification of the TF remains cumbersome in the current implementation. In the future, we will incorporate the transcription factor catalog (TFCat) (Fulton D.L. et al., under revision) which provides an organized structural classification of human and mouse DNA binding proteins. Future work will also provide access to the data in the system via web services.
The regulation of gene transcription is a fundamental process in health and disease. PAZAR serves as a data resource of growing importance to researchers committed to understanding how and when genes will be transcribed.