Figure S1: Distribution of GD Family Sizes
Distribution of human sequences across GD families as determined by different seq.id. cutoffs (40%, 60%, 80%, 90%). GD families of size 1 denote singletons, i.e., genes without paralogues (GD−).
(54 KB PPT)
Figure S2: The Distribution of Molecular Function and Biological Process
We tested for functional biases across proteins with AS and/or GD using the GO annotation available for humans from the GO database [70
]. For a more detailed analysis of function characteristics, see Table S3
Human genes were annotated with respect to biological process (A) and molecular function (B) using GO annotation [51
]. GD families were determined according to an 80% seq.id. cutoff; AS family information was taken from the AltSplice database [43
]. All sequences were assigned to one of the four sets, and the distribution of biological processes (A) and molecular functions (B) is shown for the four sets separately: AS−/GD− no duplication or AS known; AS−/GD+ duplicates, but no AS known; AS+/GD− no duplication, but AS known; and AS+/GD+ both duplicates and alternative splice variants known. There are no obvious biases in the function composition for any of the four constellations of AS/GD.
(749 KB PPT)
Figure S3: Chromosomal Location of the Duplicated Genes
We show the fraction of duplicated genes per gene family that have different chromosomal location, using a 40% seq.id. cutoff (dark red). (Data for GD80 families are not shown because of the small amount of data.) In all except one group of families, on average >55% genes within a family have different chromosomal locations. This indicates different regulation between duplicates [71
] and therefore no interchangeability between AS and GD, given that transcription and mRNA splicing are tightly coupled [72
(55 KB PPT)
Figure S4: Analysis for Mouse (40% seq.id. Cutoff)
The four figures reproduce, for mouse, the analysis shown in A, , A, and 7, respectively The results are qualitatively identical to those discussed for human in the main text, and support the idea that in general AS and GD are not interchangeable at the molecular level. Due to the smaller amount of data, the analysis at 80% cutoff did not produce statistically meaningful results.
(A) Substitutions in AS have different effects on global versus local seq.id. Light and dark green correspond to global and local seq.id. for AS substitutions, respectively. Global seq.id. is obtained after aligning two isoforms for the same gene, for which the AS event involved a substitution. Local identity applies only to the substituted stretches. Dark red corresponds to the seq.id. distribution for GD families at 40%, after sequence alignment between paralogues. The global seq.id. between splice isoforms is very high while the local seq.id. in alternative splicing variants is very low. Both seq.id. distributions for AS contrast with those of GD families.
(B) Maximal mismatch distance between nonconservative substitutions is much smaller in AS than in GD. The maximal mismatch distance is the number of residues between the two most distant, nonconservative substitutions, normalized by whole sequence length. Nonconservative mismatches have a negative value in the Blosum62 matrix and were chosen for their stronger impact in protein structure and function. The plot depicts AS data in green, and GD data for families at 40% seq.id. in dark red. Substitutions in alternative splice variants are much more localized than those in gene duplicates.
(C) Size distribution for indels. The AS distribution is shown in green. Indels for GD are shown for the whole-gene model (dark red). Clear differences are found between both distributions.
(D) Frequency distribution of the amount of overlap between AS and GD indels, taking as reference the sequence of the AS indel (see Materials and Methods
). Dark blue bars correspond to the case when indels of any size are considered. Light blue bars correspond to the case when only subdomain indels (≤30 aa) are considered.
(1.1 MB PPT)
Figure S5: Comparison between AS, Whole-Gene, and Domain-Based GD Families
To provide another definition of gene families, we estimated GD families based on domain families. We used domain annotations from the Pfam database [54
] that mapped to one or several of the SwissProt [44
] proteins in the AS dataset. Nonhuman sequences were removed from the alignment.
(A) Global seq.id. distribution. The distribution of human AS sequences is shown in green; for GD whole-gene families (40% level) are shown in dark red; indel sizes for GD families defined by Pfam domains are shown in light red. We observe that the range of seq.id. for the latter is much lower than for AS and GD whole-gene families. At the local level (results not shown) the range of seq.id. for the Pfam model of GD is lower than that observed for AS. However, for the former the amino acid replacements spread over the whole sequence, contrary to what we observe for AS.
(B) The indel size distribution of human AS sequences is shown in green. Indel sizes for GD whole-gene families (seq.id. cutoff of 40%) are shown in dark red; indel sizes for GD families defined by Pfam domains are shown in light red. In the former, whole sequences were compared within each family to obtain the indel size distribution. In the domain-based GD families, indels were obtained from the multiple sequence alignments of the Pfam databank [54
] Indels for both GD models show behaviour similar to that described by Benner and colleagues [69
(C,D) Size distributions for external and internal indels, respectively, with the same colour code as in (B). These distributions indicate that indels from Pfam domains and GD families show similar trends when compared with AS indels. Overall, our results indicate that GD and AS are in general different in their sequence/structure changes, independently of the model representing GD.
(1.1 MB PPT)
Figure S6: Effect of Filtering Out Putative NMD Targets from the AS Data
No significant differences are found between the original results and those obtained after eliminating from the AS dataset all the isoforms that may be targets of NMD machinery [61
(A) Overall versus local seq.id. Original AS global and local seq.id. are shown in light and dark green, respectively. Overall and local seq.id. for NMD-filtered AS are shown in orange and yellow, respectively.
(B) Maximal mismatch distance between nonconservative substitutions. Original AS data are shown in dark green, NMD-filtered data are shown in orange.
(C) Indel size. Original AS data are shown in dark green, NMD-filtered data are shown in orange.
(D) Overlap between AS and GD indels. Original data are shown in violet, dark blue, and light blue, while the corresponding NMD-filtered data are shown in yellow, orange, and light green.
(1.3 MB PPT)
Figure S7: Excluding Potential Database Biases
To exclude biases in our results introduced by the use of the SwissProt database [44
] (dark green), we compared some of the findings with those obtained from using the ASAP database [62
] (dark violet). The data are for human. Here we show the indel size distribution obtained using data from both databases. No obvious differences are found between the SwissProt and ASAP distributions that may affect the validity of our results.
(35 KB PPT)
Figure S8: Number of Exons per Gene
To obtain the number of exons per gene, we followed the procedure employed by Saxonov and colleagues to build the EID database [74
]. For each sequence, we obtain the exon information from the corresponding NCBI's GenBank [75
], looking at the CDS join feature.
Three distributions show the number of exons per gene, corresponding to the following cases: singleton genes with AS (AS+/GD−, dark green); genes that are both duplicated and have AS (AS+/GD+, light green), and duplicated genes with no AS (AS−/GD+, dark blue). The results are obtained for gene families at both the 80% level (A) and the 40% level (B). In both cases we see that there is a trend for AS−/GD+ to have a smaller number of exons than AS+/GD+ and AS+/GD− genes.
(525 KB PPT)
Figure S9: Computation of the Local Sequence Identity Between Gene Duplicates
We describe the two procedures followed to compute the local seq.id. between duplicates (see Materials and Methods
(A) The first procedure is based on the use of a moving window the size, N, of the AS event. The window is moved along the aligned sequences of both duplicates, and at each position the seq.id. between them is computed (within the limits of the window).
(B) In the second procedure, we first aligned the sequence of the protein with known splicing to one of its duplicates. The former was always the sequence of the SwissProt [44
] reference isoform. Then, we mapped location and length of the AS substituted stretch to the sequence of the duplicate and computed seq.id. between both sequence stretches.
(462 KB PPT)
Figure S10: Overview of the Expression Data Analysis
(A) Illustrates the basic comparisons of coexpression, whose results are shown in C and D. In C, we compare coexpression amongst gene duplicates (GD coexpression) of GD families with and without alternative splice variants. Data on expression of gene duplicates come from two “conventional” datasets [53
] (GDS596 and GDS1096) and the data from Johnson et al. (GDS829–834; (B)) [17
]. Data on the existence of alternative splice variants is from the AltSplice database [43
]. In D, we compare coexpression amongst exon junctions [17
], approximating the extent of AS (AS coexpression) of GD− families and singleton genes.
(B) In the datasets published by Johnson et al. [17
], each of the 3,840 human genes is represented by a matrix of absolute expression values of all exon junctions across 44 different tissues. We estimate AS coexpression by analyzing the variation of expression values in each gene's matrix. The average expression value of all exon junctions across the different tissues forms a vector representing the gene's overall expression pattern.
For each gene family, we can produce a second matrix of gene expression patterns of the duplicates across different tissues. We estimate GD coexpression by analyzing the variation of expression values in each gene family's matrix. GD coexpression was analyzed for the dataset by Johnson et al. [17
] and two conventional [53
] gene expression datasets (see Table S4
We tested the following measures for analysis of coexpression. (i) The average pairwise PC. We calculated average PC between each pair of row vectors in the AS or GD matrix. PC close to 0 indicates no correlation in expression between exon junctions (representing AS) or gene duplicates, respectively. PC close to 1 indicates strong correlation between the row vectors and is indicative of little AS or differential expression amongst gene duplicates. (ii) The number of unique binarized row vectors per matrix. To normalize for the number of exon junctions per gene or number of gene duplicates in a family, we divided the number of unique row vectors by the total number of row vectors per matrix. We also tested relative entropy RE as a measure of coexpression. We calculated the relative entropy RE (mutual information) for each AS or GD matrix as the sum of pobs*log2(pobs/pexp) calculated for each column, where pobs is the observed frequency of the exon junctions or gene duplicates in one column and pexp is the expected frequency of all exon junctions or gene duplicates across all experiments. However, relative entropy did not prove to be a useful measure of matrix variation in our case, as it did not capture differential expression patterns (row vectors) but only general entropy in the matrix.
While matrices in the figure show binary expression data, calculations were done on both raw and binary data. All results are similar irrespective of the cutoff for binarization (600 or 150). They are also similar irrespective of the cutoff for gene family definitions (40%, 60%, or 80% seq.id.) or of the underlying AS+ datasets employed (SwissProt or AltSplice).
(746 KB PPT)
Figure S11: Anticorrelation between Family Size and Percentage of Genes with AS
An anticorrelation between AS and GD [12
] can also be produced using SwissProt [44
(66 KB PPT)
Table S1: Genomic Data Overview
Provides an overview of the genomic data from the Ensembl database (human release 37.35j, mouse release 37.34e) [1
] and the AS data from the AltSplice database (release 2.0 for human and mouse AS) [2
], SwissProt [3
], and from the Ensembl annotations.
(54 KB DOC)
Table S2: The Distribution of Single-Exon Genes across Human Sequences
Retrotransposition produces duplicates that consist of only one exon. To test for possible bias in families of gene duplicates (GD+) stemming from retrotransposition, we examined the distribution of single-exon genes across AS and GD sets using the SEGE database [1
], similar to an approach described by Kopelman and colleagues [2
]. The procedure followed was: (i) all human genes were clustered according to 80% seq.id.; (ii) all genes were labelled according to their known AS; and (iii) the number of single-exon genes [1
] amongst singletons, gene families, and genes with/without AS was calculated.
(53 KB DOC)
Table S3: Function Analysis
The table lists a selection of functions as obtained from the DAVID Web server [1
], for the four protein sets (AS+/GD+, AS+/GD−, AS−/GD+, AS−/GD−) derived from SwissProt (A) and AltSplice (B), using an 80% seq.id. threshold to estimate GD. A more general overview of GO functions and biological processes across the datasets is shown in Figure S2
All function annotations are significantly different from the background (E-value < 10−10). We removed redundant annotations and annotations that were too broad to be meaningful (e.g., “binding”). Duplication of particular gene families that are depleted in AS, such as ribosomal proteins or some receptors, has contributed to the inverse relationship between AS and GD, but cannot explain it completely.
(96 KB DOC)
Table S4: Analysis of Expression Data
(375 KB DOC)
Table S5: Overview of the Dataset Employed in the Protein Sequence/Structure Analysis
The table shows the number of genes with AS, and the number of multiple gene families, together with the respective number of sequences. Information on AS was taken from SwissProt [1
]. The data are provided for human and mouse, for 40% and 80% seq.id. clusters.
(50 KB DOC)