|Home | About | Journals | Submit | Contact Us | Français|
Alternative splicing is emerging as a major mechanism for the expansion of the transcriptome and proteome diversity, particularly in human and other vertebrates. However, the proportion of alternative transcripts and proteins actually endowed with functional activity is currently highly debated. We present here a new release of ASPicDB which now provides a unique annotation resource of human protein variants generated by alternative splicing. A total of 256939 protein variants from 17191 multi-exon genes have been extensively annotated through state of the art machine learning tools providing information of the protein type (globular and transmembrane), localization, presence of PFAM domains, signal peptides, GPI-anchor propeptides, transmembrane and coiled-coil segments. Furthermore, full-length variants can be now specifically selected based on the annotation of CAGE-tags and polyA signal and/or polyA sites, marking transcription initiation and termination sites, respectively. The retrieval can be carried out at gene, transcript, exon, protein or splice site level allowing the selection of data sets fulfilling one or more features settled by the user. The retrieval interface also enables the selection of protein variants showing specific differences in the annotated features. ASPicDB is available at http://www.caspur.it/ASPicDB/.
Alternative splicing is a well characterized mechanism which, coupled with alternative initiation and termination of transcription (1), may expand the transcriptome and proteome complexity in human and other organisms by over one order of magnitude with respect to the number of annotated genes (2,3). In particular, it is now widely demonstrated that virtually all multi-exon genes may generate multiple transcripts and protein variants (3,4) and that the splicing process is tightly regulated in different physiological conditions, tissues or developmental stages (5). Furthermore, alterations of the splicing process can be observed in several genetic diseases and in cancer (6–10).
The huge amount of EST sequences (11) together with the relevant reference genome sequence has been used to carry out an extensive analysis of alternative splicing in human through the ASPIC algorithm (12–14). The alternative splicing pattern of human multi-exon genes, determined by ASPIC, has been collected in ASPicDB, a database resource which presents some unique features with respect to other similar databases (15). The ASPIC algorithm implements an optimization strategy that, performing a multiple alignment of all available transcript data (including full-length cDNA and EST sequences) to the relevant genome sequence, detects the set of introns that minimizes the number of splicing sites. It also generates through a directed-acyclic graph combinatorial procedure the minimal set of non-mergeable transcript isoforms compatible with the detected splicing events (14). The reliability of splicing isoforms detected by ASPIC has been recently established through a comparative assessment (16).
The advent of massive transcriptome sequence data generated by RNA-Seq (17) is steadily increasing the number of validated splicing sites and isoforms in human and other organisms thus suggesting that a fraction of alternative splicing events are the result of background noise in the splicing process (18) which generates non-functional isoforms expressed at low level. Therefore, extensive research efforts are required to distinguish functional species-specific variants from non-functional ones originated from neutral drift in the splicing process, as well as to asses the biological role of functional isoforms.
The annotation of the protein variants predicted with ASPIC is an essential step for exploring the functional and structural diversity of the proteins originating from the same gene by means of alternative splicing and therefore for unraveling the complex physiological effects of alternative splicing events (19). Indeed, currently available databases, such as ASD (20), ASAP II (21), ASTALAVISTA (22) and H-DBAS (23), mostly collect information on alternative transcripts at the mRNA level, without considering the effect of alternative splicing on the protein structure and function. The ProSAS (24) database contains structural information as derived from comparative modeling procedure, but due to the limitations of the modeling techniques, only ~15% of the human transcripts are endowed with a reliable protein structure prediction.
ASPICdb aims at filling the gap of structural and functional annotation of protein splicing variants, by adopting a set of analysis and prediction tools that do not rely only on annotation transfer by sequence similarity. It provides a thorough computational annotation of predicted human protein variants including PFAM domains (25), N-terminal signal peptides, GPI-anchor propeptides, transmembrane domains, subcellular localization and other features, also reporting the relevant crosslinks to UniprotKB/Swissprot (26) and PDB databases (27). A comprehensive annotation of the domain architecture and other structural features could also be extremely useful to critically assess the reliability of the functional classification provided the GO System (25), which still neglects much of the relevant information for alternative splicing products.
In addition, in consideration of the fragmented nature of the available transcript data, the new version of ASPicDB include the annotation of CAGE tags (28) in order to identify truly transcription initiation sites and discriminate between full-length isoforms using alternative transcription initiations and 5′-partial transcripts for which a full-length CDS and the encoded protein cannot be reliably predicted.
The computational pipeline implemented for supplementing the ASPicDB protein sequences with functional and structural annotations is represented in Figure 1 and integrates several state-of-the-art tools for similarity search and for machine-learning based prediction of protein features starting from residue sequence.
For each one of the 256939 protein variants coming from 17191 human genes, a first layer of annotation consists in the retrieval of similar sequences from the two major repositories containing well-characterized proteins, namely: (i) the UniProtKB/SwissProt data base (26) (rel. 2010_07, June 2010), that contains 547011 protein sequences with curated annotations, including 517802 principal entries and 29209 splicing variants (UniProt Consotium, 2010); (ii) the Protein Data Bank (rel. July 2010), that contains resolved three dimensional structures for 50171 different protein sequences (29).
Similarity searches were performed with BLAST (30) setting the E-value threshold to 10−3.
A second layer of annotation is obtained by mapping the structural and functional domains collected in the PFAM-A database (rel. 24.0, October 2009) that contains curated multiple sequence alignments based on hidden Markov models (HMM) for 8691 families, 2985 domains, 162 repeats and 74 motifs (25). The PFAM models were mapped on the ASPicDB protein sequences by means of the pfam_scan.pl program (ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Tools/), based on HMMER3.0 (31).
The third layer of annotation results from the integration of several predictors based on machine learning tools, such as neural networks, hidden Markov models, support vector machines and conditional random fields. Since most of the methods take advantage of the evolutionary information encoded in sequence profiles, we compiled them starting from the similar sequences retrieved with two PSI-BLAST iterations (setting the E-value threshold to 10−3) from the UniRef90 data set consisting of 6955504 sequences (July 2010). The first predicted features are the presence of N-terminal signal peptide and of C-terminal GPI-anchor propeptides, with SPEPlip (32) and PredGPI (33), respectively. Both the methods are among the best available predictors, scoring with accuracy as high as 95% the former and 88% the latter. When present, the signal peptide and the propeptide are cleaved from the protein sequence. The presence of coiled-coil domains is predicted with CCHMM-PROF that is able to locate coiled-coil segments in protein sequences with 80% accuracy (34). α-Helical transmembrane domains are then predicted with ENSEMBLE (35), that discriminates transmembrane from globular proteins with false positive and false negative rates both equal to 3%. The same tool is adopted for predicting the number and the position of transmembrane segments along the sequence, with an accuracy of 90% on the protein base. The subcellular localization of globular proteins is predicted with BaCelLo (36), which discriminates four localizations in animals (secretory pathway, cytoplasm, nucleus and mitochondrion) with 74% accuracy.
Table 1 reports some statistics on the data contained in the current version of ASPicDB (version 2.0, August 2010) which refers only to human multi-exon genes annotated in NCBI Entrez Gene (37) with at least one RefSeq transcript (38) and the relevant Unigene cluster (39) collecting all available gene-specific cDNA and EST sequences.
In the current version of ASPicDB some more features are available including the annotation of the CAGE tags (28) which define truly transcription initiation sites and a comprehensive protein annotation. A total of 12789394 CAGE tags have been mapped thus supporting constitutive or alternative transcription start sites. To each transcript variant a ‘unique identifier’ (16) has been associated in order to make possible the unambiguous comparison with alternative transcripts collected in other databases.
All alternative proteins collected in ASPicDB have been compared with UniprotKB/SwissProt (26) and PDB (29) databases. The results of similarity searches are reported in Table 2. Only 17% of the ASPicDB protein sequences are identical to proteins deposited in UniProtKB/SwissProt database. However, 94% of the sequences share significant similarity with proteins annotated in the same database, prompting the possibility of a reliable annotation transfer. Moreover, 54% of ASPicDB sequences are similar to proteins deposited in the PDB suggesting that their structures can be modeled, at least partially.
A considerable amount of PFAM models map on the ASPicDB sequences (Table 2). On the overall, 71% of sequences match with at least one model. This result is in agreement with the reported sequence coverage on the human proteome of the current PFAM release, which is equal to 72.5% (25). It is worth noticing that, although all the models map with an E-value < 10−5, only 20% of the matches are complete (that is, involve the whole model). A note of caution is necessary when inferring features from partial matches and the actual extent of the match has to be evaluated for each instance.
Table 3 summarizes the results of the annotation process performed with machine learning based predictors. Two percent of proteins were not predicted since they are shorter than 50 residues, 16% of proteins are predicted as transmembrane and 82% are predicted as globular. Among the globular proteins, 12% are predicted as secreted, 35% as cytoplasmic, 27% as globular and 8% as mitochondrial. Signal peptides and GPI-anchor propeptides are predicted in the 12 and 0.7% of the sequences, respectively. Coiled-coil domains are predicted in 1.3% of the proteins. At the gene level, 30 and 92% of genes encode for transmembrane and globular proteins, respectively. Since the sum exceed 100%, it follows that 22% of the genes encode for both globular and transmembrane variants. The same consideration holds for the other annotations as reported in Table 4. The amount of genes predicted to encode for proteins with different subcellular localization achieves 56%. This is partially explained by the fact that BaCelLo scores with an accuracy equal to 74%, which is the lowest among the methods included in the pipeline. Indeed the discrimination between the ‘cytoplasmic’ and the ‘nuclear’ classes is still a difficult task for all subcellular localization predictors (40). When the two classes are merged together, the BaCelLo accuracy increases up to 91%, but the rate of genes encoding for proteins with different localizations is still as high as 44%, suggesting that localization diversity is inherent in the ASPicDB protein variants. The structure of PFAM annotations is also highly variable: 38% of genes encode for variants matching with different number and/or type of PFAM models. Altogether, results listed in Table 4 suggest that alternative transcripts can encode for proteins endowed with different structural and functional features. ASPicDB provides a unique resource reporting the annotation of alternative splicing variants at the protein level and an interface enabling the discovery of such differences.
ASPicDB can be accessed though simple or advanced query forms. The simple query form allows the user to obtain the splicing pattern of one or more genes selected according to several criteria (e.g. HGNC name, RefSeq or Unigene accession IDs, etc.). The advanced query form allows the user to search for (i) genes, (ii) transcripts; (iii) exons; (iv) splicing sites; and (v) proteins, fulfilling different criteria (e.g. exons in a given length range, etc.). Depending on the choice separate query forms appear. The ‘gene’, ‘transcript’ and ‘splicing sites’ query forms have been described previously (15) whereas the ‘exon’ and ‘protein’ query forms are novel features of this version of ASPicDB. The exon query form allows the user to select exons in a given length range, belonging to a specific type (initial, internal or teminal), flanked by specific splicing sites or associated to one or more Affimetrix ExonArray probeset IDs.
The ‘protein’ query form allows the retrieval of transcripts encoding proteins isoforms of a specific class (e.g. globular or transmembrane), subcellular localization (e.g. mitochondrion, nucleus, secretory, cytoplasm) or containing one or more features, including occurrence and number of PFAM or transmembrane domains, GPI-anchor propeptides, signal peptides. Finally, it is also possible to retrieve genes encoding for alternative proteins that show differences in the above mentioned features.
After a simple or advanced query has been submitted the output for each selected gene is shown which is organized in eight panels.
After a query at the gene, transcript, exon, protein or splice site level has been completed, the user can also download specific sets of sequences in FASTA format for further analyses, e.g. genes, transcripts, exons, proteins, 5′-UTRs, coding sequences, 3′-UTRs, introns as well as sequence regions surrounding splice site boundaries.
ASPicDB is an ongoing project and we plan to further develop it in the next releases. In particular we plan to add specific annotations on splicing regulatory elements and their interacting RNA-binding proteins located both in exonic and intronic regions. We also plan to update alternative splicing prediction by using the huge amount of RNA-Seq data which are now being produced by next generation sequencing, possibly annotating splicing events as constitutive or tissue-specific. Furthermore, literature-screened splicing patterns related to diseases will be annotated as they represent potential molecular biomarkers and possible targets for therapy. Finally, the inclusion in the database of data related to other organisms will certainly favor a better understanding of the alternative splicing process through comparative analyses.
Ministero dell’Istruzione, dell’Università e della Ricerca: Fondo Italiano Ricerca di Base: ‘Laboratorio Internazionale di Bioinformatica’ (LIBI); Laboratorio di Bioinformatica per la Biodiversità Molecolare (MBLAB) and Telethon (project GGP01658). Funding for open access charge: Ministero dell’Università e della Ricerca: Fondo Italiano Ricerca di Base: ‘Laboratorio Internazionale di Bioinformatica’ (LIBI).
Conflict of interest statement. None declared.