Alternative messenger RNA (mRNA) splicing (
Black, 2000;
Gilbert, 1987) allows for the generation of a diverse range of mature RNAs. Studies have suggested that at least 60% of human genes can produce differently spliced mRNAs (
Harrow et al., 2006;
Scherer et al., 2006) and that alternative splicing has the potential to more than double the number of different proteins in the cell.
While alternative splicing can produce a range of differently spliced protein isoforms, there is conflicting evidence about their biological relevance. It has been suggested that the purpose of alternative splicing is to expand functional complexity and that the multiple variants are likely to encode functional proteins (
Graveley, 2001;
Hui and Binderief, 2005). Proteins with new functions are generated at different stages of development and in different tissues by a sophisticated regulation of the splicing process (
Florea, 2006;
Smith and Valcarcel, 2000). Many splice variants are hypothesized to function as dominant negative isoforms that regulate the pathways in which the main functional form is involved (
Arinobu et al., 1999;
Stojic et al., 2007).
However, recent work (Rodriguez-Trellez
et al., 2005) has suggested that gene expression is not as tightly related to protein function as had been thought. In addition we have shown that despite widespread evidence for the expression of alternative transcripts, there was little to indicate this is translated into an increase in the repertoire of protein functions (
Tress et al., 2007). Many of the proteins that result from alternative exon use would almost certainly have substantially rearranged structures with respect to their constitutively spliced counterparts (
Tress et al., 2007;
Talavera et al., 2007) and these changes are likely to have profound effects on the location and function of these alternative gene products. Predicting the effect of these changes on the cell is complicated, not least because heavy selection pressure would not normally tolerate such large transformations (
Xing and Lee, 2006).
In addition it seems possible that gene expression is not as tightly related to protein function as has been thought (Rodriguez-Trellez
et al., 2005). Recent work from this group showed that despite widespread evidence for the expression of alternative transcripts, there was little to indicate this translated into an increase in the repertoire of protein functions (
Tress et al., 2007).
At present the SwissProt database (
Bairoch et al., 2004), part of UniProtKB (
The UniProt Consortium, 2007), provides the best organization of the complicated web of alternative variants. Even if the SwissProt database is relatively small, it is the
de facto gold standard of protein databases because entries are manually curated.
As part of the manual curation of the proteins in the SwissProt database, all UniProt variants from the same gene are merged into a single entry. One crucial step of the merging process is the selection of one of the merged sequences as the ‘display’ sequence for the entry. The display sequence is selected after careful inspection and remaining merged sequences are tagged as alternative splice variants of the corresponding display sequence. The longest variant is often chosen as the display sequence, not necessarily because it is the principal functional isoform but because this allows annotators to map more features to the sequence (Amos Bairoch, personal communication).
Each SwissProt gene entry brings together experimental and predicted information, including domain definitions, functional annotation, cellular location, post-translational modifications and disease association. The entries are extensively cross-referenced to a range of external sources. All this information is associated to a single display isoform.
SwissProt display sequences are ideally suited for the goals of annotators, but there are many purposes for which it is important to know which of a gene’s transcripts codes for the principal functional isoform.
Although many genes have been studied in depth, there are still a considerable number of genes with little experimental evidence. For these genes it is important to know which variant is likely to have the principal functional activity in order to design experiments to determine the structure and function of a protein. Labelling one of the splice variants as the principal isoform will allow research groups to concentrate their efforts on the main functional isoform.
In addition, identifying a principal splice isoform for a gene would allow bioinformatics groups to make more reliable predictions of function and structure. In particular, automatic prediction pipelines need reliable input data. One good example of this is the structure prediction for the protein PTPA_HUMAN by ModBase (
Pieper et al., 2006). ModBase makes automatic predictions of structure using the SwissProt display sequence. The structure of SwissProt alternative isoform 2, missing the fourth protein coding exon of the display sequence, has already been solved. In order to model the structure of the display sequence with the inserted exon ModBase is forced to squeeze the extra 35 residues into an extended non-protein like loop.
Defining a principal functional isoform for each gene presents two problems. The first is that considerable experimental work would be required for each gene and the second is that it may be difficult to define a principal isoform for those genes where two (or more) variants might be regarded as equally important.
In order to define the principal coding variant for each gene, we had to make two assumptions. The first was that each gene has a single variant that gives rise to a principal functional isoform. The remaining annotated variants would then be alternatively spliced. This is a general assumption and comparative studies usually suggest that one isoform has the principal function or is expressed in most tissues or in most stages of development. While this is likely to be true for most genes, it will not be true for all genes.
The second assumption is that this principal variant is evolutionarily conserved between species. Alternative exons tend to be recent evolutionary developments (
Alekseyenko et al., 2007), so this is a reasonable assumption. Again this may not always be true for all genes — the principal variant may have evolved (possibly through alternative splicing) towards a function distinct from those performed by the orthologous gene products in neighbouring species.
For the purposes of this study, we have defined the principal functional isoform as the isoform that performs an orthologous functional role across a wide range of related organisms.
Most of the methods used here are based on conservation between related proteins or transcripts. The success of these conservation-based methods depends on the evolutionary diversity of the species studied and on alternative exons evolving at measurably different rates. In those cases where there was no clear difference in the evolutionary rates of competing alternative exons, it was not possible to determine a principal isoform.
With the pipeline we were able to define a principal variant for 83% of the genes with multiple variants. Comparisons with SwissProt showed that the definitions from the pipeline concurred with the display sequences 75% of the time. In the majority of the cases where there was disagreement between our method and the SwissProt display sequences, the experimental and transcript evidence suggested that the definitions based on conservation point to the principal functional isoforms.