|Home | About | Journals | Submit | Contact Us | Français|
Recombination signal sequences (RSSs) flanking V, D and J gene segments are recognized and cut by the VDJ recombinase during development of B and T lymphocytes. All RSSs are composed of seven conserved nucleotides, followed by a spacer (containing either 12 ± 1 or 23 ± 1 poorly conserved nucleotides) and a conserved nonamer. Errors in V(D)J recombination, including cleavage of cryptic RSS outside the immunoglobulin and T cell receptor loci, are associated with oncogenic translocations observed in some lymphoid malignancies. We present in this paper the RSSsite web server, which is available from the address http://www.itb.cnr.it/rss. RSSsite consists of a web-accessible database, RSSdb, for the identification of pre-computed potential RSSs, and of the related search tool, DnaGrab, which allows the scoring of potential RSSs in user-supplied sequences. This latter algorithm makes use of probability models, which can be recasted to Bayesian network, taking into account correlations between groups of positions of a sequence, developed starting from specific reference sets of RSSs. In validation laboratory experiments, we selected 33 predicted cryptic RSSs (cRSSs) from 11 chromosomal regions outside the immunoglobulin and TCR loci for functional testing.
V(D)J recombination is a mechanism of vertebrate genetic recombination that assembles gene segments into functional immunoglobulin (Ig) and T-cell receptor (TCR) genes. This site-specific recombination reaction generates the enormous repertoire of TCR and Ig molecules that are necessary for the recognition of diverse antigens from bacterial, viral and parasitic invaders. This reaction is directed by recombination signal sequences (RSSs), which flank each of the hundreds of potential donor gene segments. The V(D)J recombinase, comprised of the RAG1 and RAG2 proteins, introduces double-strand DNA breaks at the junction between a RSS and the flanking gene segment. DNA repair activity then re-joins breaks at two distant cuts to generate a functional gene through chromosomal rearrangement. Each RSS is composed of seven conserved nucleotides (a heptamer), residing next to the gene encoding sequence, followed by a spacer (containing either 12 ± 1 or 23 ± 1 poorly conserved nucleotides) and a conserved nonamer (9 bp). The RSSs are present on the 3′-side of a V region, on both sides of D segments, and on the 5′-side of the J region. Assembly of the correct composition of gene segments is directed by spacer length; recombination only joins gene segments flanked by RSSs with different spacer lengths.
Aberrant V(D)J recombination activity has been associated with oncogenic chromosomal translocations in lymphoid leukemia and lymphomas. The mechanisms of translocation remain unclear, but appear to include aberrant cutting of RSS-like sequences (cryptic RSSs) by the V(D)J recombinase at sites outside the Ig and TCR loci. Joining of these breaks to recombinase cuts within the Ig or TCR loci potentially leads to high expression of an oncogene in lymphocytes and leukemic transformation (1,2).
Considering the impact of this process in the context of translocation related to leukemia, a tool for predicting cRSSs on a genome-wide scale would be very useful to identify potential recombination sites. This task is complicated by the modular nature of RSSs and the variability of nucleotide sequences within the heptamer, spacer and nonamer elements. Using correlations between nucleotides at several positions within the 12 and 23 spacers and correlations between the identities of nucleotides at key positions, it is possible to predict the overall recombination efficiency of a RSS (3–5). It is now recognized that while RSSs are defined by a strict requirement for highly conserved nucleotides in the heptamer and nonamer, the quality of RSS function is determined in an analog manner by numerous complex interactions between the RAG proteins and the less-well conserved nucleotides in the heptamer, the nonamer, and, importantly, the spacer. For the latter, the importance of consensus nucleotides in defined points is emerging as a determinant for the efficient recognition of RSSs by RAG proteins (5).
To accomplish the prediction of cryptic RSS, we developed a software tool, DNAGrab, which scans the whole genome to identify candidate cRSSs. This algorithm makes use of probability models, which can be recasted to Bayesian networks, taking into account the correlations between groups of positions of a sequence, developed starting from specific reference sets of physiological RSSs (3). These Bayesian models are created by searching statistical correlations for each position in RSS patterns with all the other positions, placing no restrictions on the number of correlations or in the spatial positions relationship in the sequence. This approach determines, from all possible combinations of disjoint probability distributions, the set of distributions that most effectively distinguishes functional sites from non-functional sequences. Although the family of models to consider is very large, the use of mutual information allows a fast model selection by maximizing the mean recombination information content (RIC) for physiological RSSs (3). The RIC score, defined as the natural logarithm of the joint probability function of mutually correlated positions, is finally used to predict the possible functional cryptic RSSs.
The prediction capability of this score with respects to the effective recombination efficiency of RSSs has been long investigated and validated with in vitro testing (3–5). However, some recent experiments for determining how well RIC scores correlated with the levels of RAG-mediated cleavage and V(D)J recombination activity demonstrate that the prediction capability of the RIC score with respect to in vivo testing is quite weak (6). This aspect is explained, at least partially, by experiments showing the effect of the chromatin structure on the RAG cleavage efficiency, which confirms the role of chromatin in discriminating the RSS functionality (7). This information is actually not included in the RIC score, which functional prediction capability must be considered, in the light of this limitation, only as a preliminary screening of genome-wide predictions.
The implementation of DNAGrab, written in C++ and based on modern computational optimization techniques, allows a time-efficient screening of genome-wide data. To perform a genome scale analysis, we started from the previously developed models for RSSs with 12 and 23 spacer nucleotides (4), which define the groups of spatial positions to be considered for correlation screening. While other models can be developed which take into account different nucleotide spacers, in a genome-wide perspective our choice was to focus on 12 and 23 spaced RSS, since only these models are experimentally validated.
Concerning the mouse genome, we used a non-redundant version of the reference datasets employed in the original work (4), for a total of 143 RSS12 and 145 RSS23 sequences. Regarding the human genome, the reference sets of 168 RSS12 and 178 RSS23 were compiled in the context of this study, selecting sequences available from the IMGT database (8). The human and mouse RSS reference datasets can be downloaded from the RSSsite homepage.
These models were used by DNAGrab to scan all human and mouse chromosomes, searching for candidate cRSS sequences. To provide a reasonable dataset of cRSSs predicted to be possible functional substrates for the V(D)J recombinase, we scored the DNAGrab-provided RSSs using the RIC score. In detail, the algorithm computes a score for each sequence starting with CA and compares it with a score that has been experimentally correlated with the RSS function. In the current version of the system, pass/fail RIC thresholds are set accordingly to the work of Cowell and colleagues (3,4): RSS12 are scored as functional (pass) with RIC ≥ −38.81, while RSS23 pass with RIC ≥ −58.45.
Using DNAGrab, the RSS reference models and the RIC pass thresholds, a total of 3 089 308 and 1 833 319 potential RSS12 sequences were identified in human and mouse genomes, respectively, while for RSS23 3 218 664 and 2 091 561 RSSs passing RIC were identified and mapped. Considering that ~5% of cRSS locations are scored as both 12 and 23 RSS in both genomes, these values indicate a global density of about 1 cRSS per 500 bp for human and 1 cRSS per 720 bp for mouse. These cRSS densities compare with the estimates from Lewis and colleagues (9) of 1 cRSS per 600 bp, which were based only on inference from functional testing of plasmid DNA. Statistics of the genome-wide distribution of cRSSs, which are predicted to be functional according to this scoring scheme, are reported in Tables 1 and and2.2. There is no significant difference in the number of 12 and 23 cRSSs within the human and the mouse genome. On the other hand, the difference in the frequency of 12 and 23 cRSSs between these two eukaryotic genomes can be attributed to the different reference datasets. There seems to be no apparent selection between intragenic and extragenic regions.
Data about cRSS predictions performed on the human chromosome sequences (Hg18) and mouse chromosome sequences (Mm9), as downloaded from the UCSC Genome Browser (10), have been collected in a MySQL database. A web interface has been developed using HTML and the perl scripting language. Using this interface, it is possible to query the database about predicted human and mouse cRSSs, both with 12 and 23 spacers, and also to analyze user-provided human or murine sequences using the DNAGrab algorithm directly.
Three different interfaces have been provided to query the database and retrieve predicted cRSS subsets: a genomic region ‘Cytoband search’, a position-related ‘Chromosomal search’ and a generic ‘Gene search’ on an all chromosomes. Using the first option, the cRSS search is performed in a specific cytoband of the selected chromosome, according to annotations reported in our local database. Using the second option, the cRSS analysis is restricted to a specific region of a single chromosome, according to the UCSC genomic coordinates. Gene search in all chromosomes is based on the string entered by the user in a text area, which is compared with all the gene symbols, RefSeq and protein accession identifiers stored in the local annotation database. All the cRSSs predicted within the user-specified gene will be displayed with the relative genomic coordinates and RIC score. The query region extends from the transcription start site to the transcription end site of the gene and can be expanded upstream and downstream using the menu in the text area.
The output results can be obtained in two different formats: tab-delimited text (Figure 1) or UCSC-uploadable format. The results can also be downloaded as a compressed file. If the UCSC-uploadable format option is selected, from the generated report it will be possible to display the RSS predictions as user-generated tracks into the UCSC Genome Browser. The system creates a temporary text file in bed format and uploads it directly to the UCSC web site (10), hence this option may be slow depending on the number of sequences selected. An example of a cRSS track displayed on the UCSC annotation background of the Gnb211 mouse gene is shown in Figure 2 .
The ‘Analyse your own sequence’ section of the web interface enables users to identify the presence of putative cRSSs within any sequence. The user can choose to use the human or the murine model for searching both for 12 and 23 spaced RSSs. For multiple analyses only sequences in FASTA format are accepted while, for single sequence analysis, sequences with or without the FASTA definition line are accepted. The DnaGrab algorithm is used to predict cRSSs within user-provided sequences. While predicted cRSSs stored in the database are all considered functional according to the defined thresholds, this section provides a RIC score, with respect to the selected model, for all the substrings starting with CA, providing information also for sequences that did not pass the RIC filter. The output is also available in tabular form from the web page (Figure 3).
The RSSsite web server provides a valuable tool to build preliminary in silico hypothesis on the sequence-based mechanisms regulating both physiological and aberrant V(D)J recombination. For example, an earlier version of RSSsite was used in a study by Dik and colleagues (11) focusing on the (11;14)(p13;q11) translocation, which is presumed to arise from an erroneous T-cell receptor delta TCRD V(D)J recombination and to result in LMO2 activation. Using our algorithm, this group was able to determine RIC scores for LMO2 cRSSs, which were used to drive functional experimentation. Furthermore, they analyzed the LMO2 locus (−10 kb to +30 kb) for the occurrence of 12- and 23-bp cRSSs predicted to be functional according to the RIC score.
In our laboratory experiments linked with this work, we selected 33 RSSs from 11 chromosomal regions outside the immunoglobulin and TCR loci for functional testing. The dataset is fully described in Table 3. Using ligation-mediated PCR (LM-PCR), V(D)J recombinase-mediated DNA breaks at RSSs were analyzed in genomic DNA prepared from mouse primary thymocytes, where RAG1 and RAG2 were expressed. We observed breaks of putative cRSSs in 12 out of the 33 sites tested (7 out of 15 for RSS12 and 5 out of 18 for RSS23). Concerning the distribution of these 12 RSSs with respects to genes, four of them are intragenic, while eight are outside genes (five upstream and three downstream). Breaks were detected in ~1% of genomic DNA. DNA sequencing of LM-PCR products confirmed that breaks had occurred precisely at the 5′ boundary of RSSs, which is consistent with bona fide V(D)J recombinase-mediated cuts. V(D)J recombinase-mediated breaks were detected at 8 out of the 11 chromosomal sites tested, including previously uncharacterized RSSs. No translocation products involving these break sites were detected in the assays performed so far. No selection was made for live cells, therefore many unrepaired breaks may have led to cell death.
As discussed earlier in the ‘Algorithmic’ section of this work, while in vitro experiments have moderate correspondence with functional predictions achieved with the RIC score, in vivo tests demonstrate a lower accordance with the proposed results, which can be partially attributed to the influence of the chromatin structure in recombination signal sequences recognition by RAG proteins. Therefore, the RIC score provided by our algorithm should be considered only as a screening method for a preliminary identification of functional cRSSs in the context of genome-wide analyses, which takes into account the sequence patterns considered outside the chromosomal context. Nonetheless, we believe that our results, as confirmed by the presented experimental validation, can be valuable for the identification of potential recombination sites, although these predictions must be considered taking into account the actual limitations of the developed algorithm that lacks, for example, of information about the chromatin structure.
The identification and mapping of putative RSSs in the human and mouse genomes has significant applicative potential, given the well-established observation of translocations at cRSS sites in lymphoid malignancies. We described here RSSsite, a web-based database and search tool for the retrieval of predicted cRSSs from the human and mouse genomes, starting from a given chromosome region, a given gene identifier or user-supplied sequences. The software is freely available at the web address http://www.itb.cnr.it/rss.
The surprisingly high frequency of observed V(D)J breaks suggests that many cryptic RSSs may be cut by RAG proteins during lymphocyte development. However, V(D)J-mediated chromosomal translocations remain rare events. The control mechanisms that prevent more frequent involvement of aberrant V(D)J breaks in potentially oncogenic chromosomal translocations are currently unknown. We hope that the RSSsite web server will provide a valuable tool to systematically test genomic and eventually epigenetic mechanisms regulating RSS accessibility and usage at different chromosomal sites.
Italian Fund for Basic Research (FIRB-MIUR) project grants ‘ITALBIONET’ and ‘LITBIO’. Funding for open access charge: National Research Council.
Conflict of interest statement. None declared.
We gratefully acknowledge the help of Lindsay Cowell and Joe Volpe who kindly provided an updated version of their RIC filter. We also thank the former student and collaborators who contributed to this project: Riccardo Fallini (MSc student at IFOM), Lizeta Gjanci (INGENIO Program fellow at ITB-CNR) and Chiara Bishop (Web Designer at ITB-CNR)