|Home | About | Journals | Submit | Contact Us | Français|
The stability, localization and translation rate of mRNAs are regulated by a multitude of RNA-binding proteins (RBPs) that find their targets directly or with the help of guide RNAs. Among the experimental methods for mapping RBP binding sites, cross-linking and immunoprecipitation (CLIP) coupled with deep sequencing provides transcriptome-wide coverage as well as high resolution. However, partly due to their vast volume, the data that were so far generated in CLIP experiments have not been put in a form that enables fast and interactive exploration of binding sites. To address this need, we have developed the CLIPZ database and analysis environment. Binding site data for RBPs such as Argonaute 1-4, Insulin-like growth factor II mRNA-binding protein 1-3, TNRC6 proteins A-C, Pumilio 2, Quaking and Polypyrimidine tract binding protein can be visualized at the level of the genome and of individual transcripts. Individual users can upload their own sequence data sets while being able to limit the access to these data to specific users, and analyses of the public and private data sets can be performed interactively. CLIPZ, available at http://www.clipz.unibas.ch, aims to provide an open access repository of information for post-transcriptional regulatory elements.
Almost all cellular RNAs interact with RNA-binding proteins (RBPs) to form ribonucleoprotein complexes (RNPs). The overall composition and precise architecture of these RNPs undergo dynamic remodeling in response to signals and cellular state. Initial annotation (1) indicated that the human genome contains ~300 genes that encode proteins with an RNA-recognition motif (RRM). This is only one of the over 40 distinct protein domains known to contact RNA. RBP–RNA interactions are highly context dependent and many RBPs carry out different functions in different cellular compartments. For instance, the T-cell intracellular antigen 1 (TIA-1) functions as a splicing factor in the nucleus; it binds to an intronic splice enhancer in the Fas pre-mRNA leading to the inclusion of the proximal exon (2). In the cytoplasm, TIA-1 regulates the stability of mature mRNAs: its binding to AU-rich elements located in the 3′ untranslated regions (3′UTRs) of mRNAs (such as that of transforming growth factor beta, TGFβ) attracts the mRNA degradation machinery. The same AU-rich element in the TGFβ 3′UTR when bound by the HuR RBP leads to mRNA stabilization (2). Thus, precise knowledge of spatio-temporal associations between RBPs and mRNAs under various conditions is key to understanding how the level, translation rate and cellular localization of those mRNAs are regulated during the life time of a cell.
With some exceptions, such as the knowledge-based potential function designed by Zheng et al. (3) to predict the specificity and relative binding energy of RNA-binding proteins, computational models describing the binding specificity of RBPs (by contrast, for instance, with transcription factors) are lacking (4). Recently, however, experimental methods for high-throughput and high-resolution identification of RBP binding sites have been developed. They rely on cross-linking and immunoprecipitation (CLIP) of RBPs of interest (5) followed by deep sequencing (6–8). In a particular variant of CLIP, termed PAR-CLIP (photoactivatable-ribonucleoside-enhanced cross-linking and immunoprecipitation), the incorporation of photo-reactive nucleotides in mRNAs prior to cross-linking induces a specific mutational signature in the sequenced reads relative to the reference genome, thereby enabling the separation of cross-linked binding sites from other RNA fragments that are captured non-specifically during the experiment (7). Many questions concerning the function, specificity and modulation of activity of RBPs can be addressed through analyses of corresponding PAR-CLIP data sets. For example, the sites with the highest number of cross-linking events (indicated by T-to-C mutations in the sequenced reads) can be analyzed to uncover the sequence specificity of the RBP and to identify cellular pathways that are targeted by the RBP in a concerted manner. Moreover, with PAR-CLIP data available for multiple RBPs, one can begin to identify regions of cross-talks between multiple RBPs on individual mRNAs.
Here, we describe a database of binding sites that we constructed based on CLIP data for various proteins that are known to regulate mRNA splicing (polypyrimidine tract binding protein), stability and/or translation rate (Quaking, Pumilio2, Argonautes 1-4, TNRC6 A-C, Insulin-like growth factor II mRNA-binding proteins 1–3). The data are presented through a web interface that supports not only visualizations but also further analyses of RBP binding sites. The platform also allows registered users to submit for functional annotation short reads resulting from CLIP, small RNA sequencing and mRNA sequencing experiments. Once uploaded, these data can be explored through various interactive analysis tools that we developed. Due to its user- and dataset-management system, the platform can support collaborative projects involving private and public data and multiple users. This resource is of great value to researchers that study the mechanisms regulating mRNA stability and translation.
The computational pipeline underlying the construction of the CLIPZ database takes as input fasta-formatted files of sequences that were obtained from CLIP samples through deep sequencing. These sequences are submitted to an initial annotation process that attempts to identify the origin (within the genome and within known transcripts) of individual sequence reads. The annotation procedure is described in detail elsewhere (9). Briefly, it consists of adaptor removal, mapping of sequenced reads to the genome and to known transcripts and functional annotation of each read based on its best mappings. A sketch of the data flow is shown in Figure 1.
During sample preparation, adaptors are ligated at both 5′ and 3′ ends of CLIP sequence fragments. Because most of the CLIP data that is currently available has been generated using the Solexa sequencing technology (10), our procedure for adaptor removal is specific to this technology (though other adaptor configurations can easily be taken into account). The 5′ adaptor serves as a sequencing primer, and we expect that only the 3′ adaptor (or part of it) is sequenced. We use an in-house ends-free local alignment algorithm (11) (parameters: 2 for match, −3 for mismatch, −5 for gap opening, −2 for gap extension) to align the 3′ adaptor to the reads. The part of the sequence read that aligns to the 5′ end of the 3′ adaptor is removed, and if the remainder of the read is longer than 15nt, it is retained for further analysis. Distinct sequences are deposited in the database together with their copy number in the sample under study.
All distinct sequences are mapped to the genome assembly. Currently, the database contains CLIP samples obtained from human cells, for which we used the hg18 version of the human genome assembly from the University of California at Santa Cruz (http://genome.cse.ucsc.edu), but analyses of mouse data sets are also supported. Because not all transcripts that have been sequenced and are present in sequence databases can be mapped to the genome assembly and because various contaminants can be found in CLIP samples, we also map the reads to a database of sequences with known function (ribosomal, transfer, small cytoplasmic, small nuclear and small nucleolar RNAs, PIWI proteins-associated RNAs, miRNAs, messenger RNAs, miscellaneous non-coding RNAs obtained from sequencing projects, bacterial and fungal ribosomal RNAs, genomes of common bacteria, vector, adaptor and size marker sequences). The sources of these sequences are as follows.
To align sequence reads to target sequences, we use the ‘Oligomap’ algorithm (9) that exhaustively reports the mappings with 0 or 1 error (mismatch, insertion or deletion). The ‘Oligomap’ software can be downloaded from http://www.mirz.unibas.ch/software.php. In principle, we take into account all the possible loci for a given sequence read and we assume that the read originated from any of these loci with equal probability. Based on the GMAP (12) mappings of mRNAs to genome, we determine whether a genome-mapped read falls inside an intronic or exonic region. Based on the coding region annotation of transcripts in Genbank, we determine whether the exonic sequence reads originate from the 5′UTR, CDS or 3′UTR region of the individual transcripts. Sequence reads that map to regions with alternative splicing are assigned fractional numbers that denote the proportion of transcripts in which the region appeared in a particular section of the transcript.
Whenever an extracted sequence read maps to one or more known sequence(s) of the same functional category, that functional category is readily transferred to the sequence read. There are, however, sequence reads that map equally well to known sequences of different functional categories (e.g. tRNA, rRNA, mRNA and repeat). In these cases, we assign a functional annotation with the following priority scheme rRNA > tRNA > snRNA > snoRNA > scRNA > miRNA > piRNA > repeat > miscRNA > mRNA (reflecting roughly the abundance of various types of sequences in the cell).
Initial analysis of PAR-CLIP data indicated that the sequence reads obtained in individual experiments generally form well-delimited, relatively short (20–40nt) clusters. When the binding specificity of the protein was already known, the clusters obtained from PAR-CLIP data typically contained the sequence motif known to represent the binding site of the protein (7). We therefore use a cluster as the central unit for data analysis and visualization. Two sequences are placed in the same cluster if they overlap by at least one nucleotide in their genomic or transcript location. We note that in data sets obtained with other CLIP protocols, the correspondence between clusters that are generated this way and individual RBP binding sites may not be as clear as it is in PAR-CLIP. As more data generated with different variants of CLIP becomes available, the definition of the visualization unit (ideally the RBP binding site) may need to be revised accordingly. Furthermore, in PAR-CLIP experiments T-to-C mutations are indicative of cross-linked positions and our analysis has shown that clusters with the largest number of T-to-C mutations are most enriched in functional binding sites for the studied RBP. The number of T-to-C mutations as well as other statistics are therefore computed for each cluster and made available in the interface. The user can sort the clusters based on these computed features in order to extract the targets that are most frequently bound by the RBP of interest.
We use a MySQL 5 database management system to store the results of the functional annotation process and to support downstream analyses. The database contains the following types of tables:
In order to maximize the efficiency of processing subsequent queries, database tables are generated for each individual sample (for the detailed description of the database schema see the ‘Help’ pages provided on the web site).
The software supporting the web-based queries has the following components (see Figure 2).
The web server is responsible for the validation of the user inputs and for rendering the results of various computations. It uses PHP 5 and a Model View Controller (MVC)-Framework that we developed. It communicates with the application server using a freely available PHP-Java bridge from http://php-java-bridge.sourceforge.net/pjb/.
The application server, implemented in Java 1.6, provides functions that can be accessed by the web server, such as applying the functional annotation pipeline to an uploaded sample. It is also responsible for process control, logging the job outputs and reporting the errors whenever jobs fail. Due to the large volume of typical CLIP data sets, we employ a PC-Cluster for parallel processing. The job distribution to the cluster and the handling of conflicts that may result from multiple parallel-running jobs requesting the same data/resource at the same time are also handled by the application server.
For each sample in the database, the user can browse the clusters of overlapping sequence reads which in the PAR-CLIP samples typically correspond to individual RBP binding sites. The clusters can be sorted by various criteria including the number of T-to-C mutations in all reads of a cluster, which in the PAR-CLIP experiments is indicative of the affinity of the protein for the RNA. To distinguish cross-link-induced mutations from single nucleotide polymorphisms (SNPs) we incorporated a track that shows the known SNPs, and for identifying the miRNAs that guide the Argonaute to the target RNA, we incorporated a track of predicted miRNA binding sites (14) (see Figure 3).
The association of an RBP with a specific site and the downstream effects of this interaction frequently depend on other regulatory elements that are present in close vicinity and recruit other regulatory factors. Through the transcript and genome browsers, one can visualize the position of binding sites within transcripts, as well as the spatial relationship between binding sites determined in different experiments, as shown in Figure 4.
Many questions arising in the context of analyzing RBP binding sites can be phrased in terms of the spatial relationship between binding sites obtained in different experiments. For example, one would like to know whether experimental results for one protein are reproducible, in which case we expect that the sets of sites obtained in different experiments are largely identical. Alternatively, one may like to find out whether two proteins frequently compete for sites, in which case we would expect that the sites are occupied by one of the proteins in one condition and by the other protein in a different condition. The super-clustering tool enables the user to uncover such relationships. The visualizations that can be performed are very similar to those described for clusters of a single RBP. But they operate on super-clusters that are built through single-linkage clustering of clusters obtained in different experiments and are either overlapping or at a specified maximum distance from each other. The user may define complex operations between sites obtained in different experiments using logical operators such as (OR, AND, NOT).
Another common question is whether any binding sites are known for specific transcripts or genes that a user may be studying. To be able to answer this question, we implemented a search tool that allows the user to retrieve from the database a gene name or symbol, select an accession number associated with it and access the binding site information associated with the transcript in our database (see Figure 5).
Because the Argonaute/EIF2C proteins that are part of the RNA-induced silencing complex have been a major focus of the CLIP studies performed so far, we integrated in our server a set of tools that enable the user to explore the identity, abundance and predicted targets of the miRNAs present that were isolated in the CLIP samples. These tools have been described extensively in (15).
Finally, one of the main reasons for performing CLIP studies is to determine the sequence specificity of a protein of interest. How a multi-domain RBP contacts RNAs is a challenging question that most likely requires complex computational as well as experimental analyses. However, to provide some preliminary insights we implemented a tool that identifies sequence motifs (defined as n-mers) that are over-represented in an input file (which could contain for instance the most abundant 1000 clusters obtained in an experiment) compared with randomized sequences with the same mono/di-nucleotide composition. We have previously used this tool to show that the motifs that are most over-represented in the clusters from Argonaute/EIF2C PAR-CLIP experiments correspond to the reverse complements of the 5′ end (‘seed’ region) of the most abundant miRNAs in the cell (7).
Deciphering the post-transcriptional regulatory code that is implemented by regulatory RNAs and RBPs is a problem of great interest (5–8,16–18). The bottleneck in characterizing RBP binding sites is no longer the availability of an experimental approach, but rather the efficient analysis of the large volumes of data that result from such experiments. Here, we present a software system that we developed to analyze data resulting from CLIP experiments. With this system we constructed a database of RBP binding sites that were determined through CLIP and deep sequencing. Our system provides several views of the data, from the level of sequence reads to that of a whole-genome browser. Transcript regions with the highest abundance in the CLIP data or that exhibit the highest number of cross-linking events can be easily extracted for further analyses. Both the database and the analysis environment can be easily extended. Registered users can expand the database by submitting their own sequence data sets, the repertoire of organisms can be expanded to include additional species for which a genome assembly is available, and the genome assemblies and transcript databases that are used in the analysis pipeline can be updated as necessary. In the future, we will continue to develop the platform in order to accommodate developments in the sequencing technologies. We expect for instance that the increase in sample size and sequence read length will require the use of heuristic algorithms for mapping short reads to the genome. Such algorithms are in fact already available (19–22) and will only require one to write adapter programs to interface these programs with the database that stores the alignments. Thus, CLIPZ can eliminate many bottlenecks in the computational analysis of CLIP data and can form the basis for a repository of binding site data for RNA-binding proteins.
ProDoc program of the Swiss National Science Foundation [Grants PDAMP3_127218 and PDFMP3_123123]; Swiss Institute of Bioinformatics. Funding for open access charge: University of Basel.
Conflict of interest statement. None declared.
We are grateful to Lukasz Jaskiewicz, Shivendra Kishore and the other members of the Zavolan group, to Markus Hafner, Manuel Ascano and Markus Landthaler of the Tuschl group for providing input and feedback on individual tools and to Lukas Burger and Jean Hausser for critical comments on the manuscript.