Derivation of Neural Progenitors from Embryonic Stem Cells
NPs were independently derived from two hESC lines, and RNA extracted from the cell lines was processed and hybridized onto Affymetrix Human 1.0 ST exon arrays. Immunohistochemical and reverse-transcriptase polymerase chain reaction (RT-PCR) analyses demonstrated that the hESCs expressed pluripotent marker genes, and the derived NPs expressed multipotent and neurogenic markers similar to hCNS-SCns. Undifferentiated Cythera (Cyt-ES) and HUES6 (HUES6-ES) hESC lines were maintained in culture as previously described [12
]. Utilizing specific antibodies, we observed that undifferentiated Cyt-ES and HUES6-ES cells were positive for the pluripotent markers Oct4
, and Tra-1–80
(unpublished data). NPs were derived from the hESC cell lines using protocols optimized for each line (see Materials and Methods
). Greater than 90% of derived NP cells (Cyt-NP from Cyt-ES and HUES6-NP from HUES6-ES) were positive for Sox1
, an early neuroectodermal marker, and Nestin
(A), and negative for Oct4 (unpublished data). As a natural benchmark for the derived NPs, we utilized hCNS-SCns, which were previously isolated from fresh human fetal brain tissues using antibodies to cell-surface markers and fluorescence-activated cell sorting [12
]. The hCNS-SCns form neurospheres in culture which are greater than 90% Nestin
positive, and differentiate into both neurons and glial cells in vitro [12
]. Immunohistochemical analysis confirmed that hCNS-SCns were negative for Oct4 (unpublished data) and positive for Sox1
Molecular Characterization of Human Embryonic Stem Cell Lines and Neuronal Progenitors
Here, known molecular markers were subjected to RT-PCR measurements, which were compared to gene-level signal estimates generated from the exon array data. Total RNA was extracted, and labeled cDNA targets were generated from three independent preparations of each cell type, namely Cyt-ES, HUES6-ES, Cyt-NP, HUES6-NP, and hCNS-SCns. To facilitate downstream analyses, instead of utilizing the meta-gene sets available from the manufacturers, we generated our own gene models by clustering alignments of ESTs and mRNAs to annotated known genes from the University of California Santa Cruz (UCSC) Genome Browser Database. After hybridization, scanning, and extraction of signal estimates for each probeset on the exon arrays, gene-level estimates were computed based on our gene models using available normalization and signal estimation software from Affymetrix. For every gene, a t
-statistic and corresponding p
-value were computed representing the relative enrichment of the expression of the gene in hESC versus NP, such as in Cyt-ES versus Cyt-NP. After correcting for multiple hypothesis testing using the Benjamini-Hochberg method, a p
-value cutoff of 0.01 was used to identify enriched genes. Close inspection of all pairs of hESC-NP comparisons revealed a generally significant overlap from 31% to 85% of the smaller of two compared sets of enriched genes (see Figure S1
). Thus for the purpose of identifying overall pluripotent and neural lineage-specific genes, the collective set of NPs (Cyt-NP, HUES6-NP, and hCNS-SCns) was compared to the collective set of hESCs (Cyt-ES and HUES6-ES).
, which are important in maintaining the pluripotent state of embryonic stem cells (ESCs), were highly expressed in hESCs but were significantly lower in NPs (B). RT-PCR of Oct4
mRNA levels accurately reflected the signal estimates from the array (C). Interestingly, Nestin
was not significantly higher in NPs as compared to the hESC from the gene-level estimates (p
-value 0.065), which was further confirmed by RT-PCR (C). Notch
was recently identified to be important in promoting the neural lineage entry in mouse ESCs [57
] and was shown to regulate stem cell proliferation in somatic mouse and hESC [58
]. Gene-level signal estimates suggested that Notch
was significantly higher in hCNS-SCns relative to hESCs, but levels of Notch
were not significantly different in the derived NPs compared to hESCs. Delta/Notch-like EGF-related receptor (DNER)
, a neuron-specific transmembrane protein, was recently shown to bind to Notch
at cell–cell contacts and activates Notch
signaling in vitro [59
]. RT-PCR validation of DNER
confirmed array-derived signal estimates, indicating an enrichment of DNER
in NPs relative to hESCs (C). Finally, Sox1
, a HMG-box protein related to SRY, was shown to be one of the earliest transcription factors expressed in cells committed to the neural fate [60
]. Here the gene-level estimates indicated that Sox1
was expressed significantly higher in NPs relative to hESCs (p
-value 0.00013, B), a finding that was confirmed by RT-PCR (C).
From these examples, we concluded that RT-PCR validation correlated well with gene-level estimates from the exon array. In addition, the derived NPs had decreased levels of pluripotent markers Oct4 and Nanog but had levels of Sox1 that were comparable to hCNS-SCns. This finding confirmed that the derived NPs were committed to a neural fate and validated the use of hCNS-SCns as a benchmark for NPs.
Next we asked whether the highest enriched genes in hESCs relative to NPs reflected our existing knowledge in the literature. Using the above-mentioned groupings of hESCs (Cyt-ES, HUES6-ES) and NPs (Cyt-NP, HUES6-NP, and hCNS-SCns), 2,945 genes were enriched in hESCs relative to NPs; and 552 genes were enriched in the NPs relative to hESCs, at a p
-value significance cutoff of 0.01 (correcting for multiple hypothesis testing using the Benjamini-Hochberg method). The 15 most enriched genes in hESCs included genes such as teratocarcinoma-derived growth factor 1 (TDGF1/cripto; p-
value < 10−12
), zinc finger protein 42 (Zfp42/Rex1
value < 10−12
value < 10−12
value < 10−10
), lin-28 homolog
value < 10−10
), cadherin 1 preprotein
value < 10−10
), claudin 6
value < 10−9
), ephrin receptor EphA1
value < 10−9
), and erbB3
value < 10−9
was first shown to stimulate DNA synthesis and cell proliferation of both undifferentiated and differentiated embryonic carcinoma cells [61
] and was later shown to be important for cardiomyocyte formation from mouse ESC [62
, reviewed in [63
], and Nanog
] are crucial for the pluripotency of hESCs. Recently, knockdown of Zfp42/Rex-1
in mouse ESC caused the cells to differentiate [65
]. Our gene-level exon array analysis confirmed that the hESCs and NPs were molecularly distinct.
To reveal global functional differences between the enriched genes in hESCs or NPs, the enriched genes were subjected to a Gene Ontology (GO, http://www.geneontology.org
) analysis as described previously [55
]. Enriched genes in hESCs were more likely to be in molecular function categories, such as “RNA binding” (p-
value < 10−12
), “structural constituent of ribosome” (p-
value < 10−51
), “exonuclease activity” (p-
value < 10−6
), “cytochrome-c oxidase activity” (p-
value < 10−5
), and “ATP binding” (p-
value < 10−6
), and in biological processes involved with “tRNA processing” (p-
value < 10−6
) and “protein biosynthesis” (p-
value < 10−48
), consistent with our knowledge of hESCs as a rapidly proliferating population of cells (A). Similar analysis of the enriched genes in NPs revealed an overrepresentation in molecular functional categories, such as “calcium ion binding” (p-
value < 10−8
) and “structural molecule activity” (p-
value < 10−5
), and in biological processes involved with “neurogenesis” (p-
value < 10−38
), “cell adhesion” (p-
value < 10−13
), “cell motility” (p-
value < 10−4
), “development” (p-
value < 10−6
), “neuropeptide signaling pathway” (p-
value < 10−4
), and “endocytosis” (p-
value < 10−4
) (B). Considering that these were the only categories that were significantly enriched out of more than 18,000 GO terms, and that randomly selected sets of similar numbers of genes did not reveal statistical differences in GO categories, our results confirmed that the global molecular profiles derived from exon array analysis were consistent with known differences between hESCs and NPs.
To summarize, firstly immunohistochemical and RT-PCR evidence validated that the cells exhibited expected characteristics; secondly, stage-specific marker gene differences by RT-PCR were reflected accurately by gene-level estimates from the exon arrays; thirdly, the hESC-enriched genes were coherent with known genes that controlled pluripotency and self-renewal; and lastly, the global functional profiles exemplified expected biological differences between hESC and NP cells.
Description of the Regression-Based Exon Array Protocol
Convinced that the signal estimates from the exon arrays reflected expected molecular and biological differences between hESCs and NPs, we sought to identify AS events. We compared Cyt-ES to hCNS-SCns to illustrate our approach. First we normalized the data and generated signal estimates with Robust Multichip Analysis (RMA) and estimated the probability that each probeset was detected above background (DABG) using publicly available Affymetrix Power Tools (APT). We analyzed probesets that (i) comprised three or more individual probes; (ii) were localized within the exons of our gene models with evidence from at least three sources (mRNA, EST, or full-length cDNA); and (iii) were detected above background in at least one of the cell lines. In total, 17,430 gene models were represented by probesets that satisfied these criteria.
Next we asked whether the probeset expression within each gene model was positively correlated for any two cell lines. To do this we calculated the Pearson correlation coefficient between the vectors of median signal estimates across replicates in Cyt-ES versus hCNS-SCns. The vast majority of genes (>80%) was found to have probeset-level Pearson correlation coefficients of greater than 0.8 (A). Next we randomly permuted the association between the median signal estimates and the probesets for each gene in hESCs (or hCNS-SCns) and observed that the distribution of Pearson correlation coefficients for the permuted sets was centered at zero, as expected (A). This indicated that the signal estimates for probesets between hESCs and hCNS-SCns were highly correlated and suggested that a scatter plot of probeset signal estimates between hESCs and hCNS-SCns would reveal a linear relationship for the majority of genes. We hypothesized that a linear regression to determine if some probesets behaved unexpectedly in one cell type compared to the other might be a reasonable approach to identify AS exons.
Description of the REAP Algorithm Comparing Exon Array Signal Estimates from hCNS-SCns and Cyt-ES
Here, a possible representation of the data was explored. If we had N replicates in one condition and M replicates in the other, we could consider N*M points if we analyzed every possible pairing. For instance, three replicate signal estimates for every probeset per cell line, such as signal estimates a, b, and c in hESCs and d, e, and f in hCNS-SCns, would translate to pairing every signal (d,a), (d,b), (d,c) … (f,a), (f,b), (f,c) for linear regression (B). Instead, pairing the signal estimates of all replicates in one condition to the median of the other would only require N + M − 1 points and would capture the variation of the signal estimates of each probeset. For example, we considered (d,b), (e,a), (e,b), (e,c), and (f,b) points where b and d were the median intensities for the replicates in Cyt-ES and hCNS-SCns, respectively (B). A scatter plot of all probesets of the EHBP1 (EH domain binding protein, RefSeq identifier NM_015252) is shown in C in the format described. Each probeset was represented by 5 points of log-transformed (base 2) values; and each point on the scatter plot reflected the extent of inclusion of an exon in hESCs and in hCNS-SCns (C).
A classical linear regression model could be proposed to fit the response variable yij, the log2 expression of probeset i in cell-type j (for example, j is Cyt-ESC) to explanatory variables xik, and the log2 expression of probeset i in cell type k (for example, k is hCNS-SCns). However, classical linear regression by least-squares estimation is biased because the least squares predictions are strongly influenced by the outliers, leading to completely incorrect regression line estimates, masking of the outliers, and incorrect predictions of outliers. Therefore, we applied M-estimation robust regression to estimate the line, which is less sensitive to outliers. Fitting was performed using an iterated, re-weighted least squares analysis. Our assumption was that most of the points were “correct,” i.e., that most of the exons were constitutively spliced. Thus, robust regression would find the line that was least dependent on outliers, which would be potential AS exons. This assumption was substantiated by our observation that, using publicly available ESTs and mRNAs, a minority of human exons (7%) have evidence for exon-skipping, the most common form of AS. Using robust regression, the regression line for Cyt-ESC versus hCNS-SCns in the EHBP1 gene is illustrated in C. The boxed points belonged to a probeset that was enriched in hESCs but depleted in hCNS-SCns, which was suspected to be due to AS. The difference between the actual and regression-based predicted value, normalized by the estimate of its standard deviation, is called the studentized residuals. Studentized residuals were computed for all probeset pairs in EHBP1, and the histogram depicting their distribution is illustrated in D. As expected, the mean of the distribution was close to zero, and the distribution was approximated by a t-distribution with n-p-1 degrees of freedom, where n was the number of points on the scatter plot, and the number of parameters p was 2. The boxed points had studentized residuals of 1.829, 3.104, 2.634, 3.012, and 2.125 with p-values of 0.034, 0.00119, 0.00477, 0.00158, and 0.01780, respectively, computed based on the t-distribution (C). At a stringent p-value cutoff of 0.01, four of the five studentized residuals were designated as significant “outliers,” indicating that the probeset was “unusual.” RT-PCR confirmed that the exon, represented by the probeset, was indeed differentially included in hESCs and skipped in hCNS-SCns (B). Applying this approach to all gene models revealed that, as expected, the majority of studentized residuals are centered at zero (E). Thus far in the example, our analysis was based on regression of hESCs (y-axis) versus hCNS-SCns (x-axis) (B–D). However, robust regression as described was not symmetrical, i.e., parameter estimation of y as a function of x was not the same as that of x as a function of y. The negative slope revealed that probesets enriched in hESCs versus hCNS-SCns (positive valued), were expectedly depleted when hCNS-SCns was compared to hESCs (negative valued; F). As our method for predicting candidate alternative exons was based on identification of outliers using robust regression, we named the method REAP.
RT-PCR Validation of REAP-Predicted Alternative Exons
Identification and Removal of False Positives
In the process of experimentally validating our predictions, we encountered three main sources of false positives (FP) from robust regression. First, we identified genes with probeset signal estimates that were poorly correlated and were not amenable to our method. As an example, the median probeset signal estimates in hESCs and hCNS-SCns of the FIP1L1
gene (gene identifiers BC011543, AL136910) had a Pearson correlation coefficient of 0.38, and the distribution of points was not amenable to robust regression (A). To avoid inappropriate application of REAP and generating false predictions, we empirically determined that a gene had to have a Pearson correlation coefficient cutoff of 0.6 before being amenable to REAP analysis. Next, we managed two additional sources of FPs, namely “high-leverage” and “high-influence” points, which we were able to identify by computing the following metrics. For every point, we computed (i) the studentized residual (as described above), (ii) the influence, and (iii) the leverage (see Materials and Methods
for more details). Leverage assessed how far away a value of the independent variable was from the mean value; the farther away the observation the more leverage it had. The influence of a point was related to its covariance ratio: a covariance ratio larger (or smaller) than 1 implied that the point was closer (or farther) than was typical to the regression line, so removing it would hurt (or help) the accuracy of the line and would increase (or decrease) the error term variance. Influence was computed as the absolute difference between the covariance ratio and unity. To illustrate further, a point was classified as an “outlier” if it had a large studentized residual (p
< 0.01) and low leverage (boxed point “a”); as a “high-leverage” point if it had a low studentized residual and high leverage (boxed point “b”); and as a “high-influence” point if it had a high studentized residual, high leverage, and high influence (boxed point “c”; B). Points that resembled boxed point “a” were designated as potential AS events. For example, four of the five boxed points in C were “outliers,” and RT-PCR validation indicated that the exon represented by the probeset was indeed skipped in hCNS-SCns (EHBP1
, B). Points that were “high-leverage,” such as the five points in the CLCN2
gene, were experimentally verified to be a FP (C; unpublished data). Points that were “high-influence,” such as the four of five boxed points in the ABCA3
gene were also experimentally verified to be a FP (D;
unpublished data). In conclusion, in order to reduce the FP rate, all points were evaluated according to the metrics described, and points that were significant “outliers” were considered putative AS events.
Global Identification and Characterization of REAP[+] Exons
REAP was applied to identify AS events in NPs compared to hESCs: Cyt-NP versus Cyt-ES; HUES6-NP versus HUES6-ES; hCNS-SCns versus Cyt-ES, and hCNS-SCns versus HUES6-ES. After removing potential FPs, 11,348 genes containing 158,657 probesets were scored by REAP.
As described above, for each pair of cell lines compared, each probset was represented by five points, where each point was defined a significant outlier if it had a high residual (p
< 0.01), low influence, and high leverage. Points per probeset should be correlated; in other words, if one point was a significant outlier, the other points were expected to be outliers as well. To ensure that this was the case, we counted the number of probesets with N significant outliers, where N was varied from 0 to 5. Next, the identity of the probesets and points derived from them were exchanged with other probesets, keeping constant the total number of points that were considered significant outliers. At N = 0, we observed approximately equal numbers of probesets in the actual versus shuffled controls. In contrast, we observed that there were 1.5 times more probesets with N = 2 significant outliers relative to shuffled controls; 12–31 times more probesets with N = 3; and 17–612 times more probesets that had N = 4 significant outliers (A; see Table S1
). For example, in hCNS-SCns compared to Cyt-ES, approximately 0.39% (490 of 124,604) of probesets had three significant outliers and 0.25% (308 probesets) had four significant outliers, relative to 0.02% and 0% of shuffled controls, respectively.
Next we asked whether the overlap between related comparisons was higher than expected. Comparing the significant probesets between hCNS-SCns versus Cyt-ES and hCNS-SCns versus HUES6-ES revealed 672 significant probesets (N ≥ 2), whereas if we shuffled the associations between probeset identity and significant outliers, only four significant probesets (N ≥ 2) were identified—a 168-fold enrichment (B, Table S1
). A total of 236 significant probesets overlapped when we compared the derived NPs to hESCs (Cyt-NP versus Cyt-ES and HUES6-NP versus HUES6-ES), relative to seven significant probesets (34-fold enrichment).
At a cutoff of two significant outliers, 1,737 probesets contained in internal exons were defined as positive REAP predictions (hereafter called REAP[+]) exons—candidate AS events that distinguished NP from hESC. Surprisingly, we observed that the majority of REAP[+] exons were specific to the pair of hESC and NP that was compared, likely reflecting differences in genetic origins and/or culturing and differentiation conditions of the cell lines: 614 REAP[+] events were unique to hCNS-SCns versus HUE6-ES; 220 were unique to hCNS-SCns versus Cyt-ES; 439 were unique to HUES6-NP versus HUES6-ES; and 250 were unique to Cyt-NP versus Cyt-ES. The shared events between pairs of comparisons made up a minority of the total number identified: 102 REAP[+] events were found to be in common between hCNS-SCns versus Cyt-ES and hCNS-SCns versus HUES6-ES; 48 between hCNS-SCns versus HUES6-ES and HUES6-NP versus HUES6-ES; and only 17 between hCNS-SCns versus Cyt-ES and Cyt-NP versus Cyt-ES (Table S2
Comparison of REAP to EST-Based Method and ACEScan
Traditionally, AS exons were discovered by using EST alignments to genomic loci, and also more recently by computational algorithms that used sequence information extracted from multiple genomes. Here, we compared REAP predictions to both approaches. In the first comparison, publicly available ESTs and mRNA transcripts were aligned to the human genome sequence. 13,934 exons with evidence for exon-skipping and/or inclusion (EST-SE for EST-verified skipped exons) were generated, comprising ~7% of all internal exons. First we analyzed Cyt-ES versus hCNS-SCns. If we required that none of the points per probeset (exon) was significant, 6% (4,402 of 71,731) of exons (after probeset mapping) had evidence for EST-SE (A). Shuffling the mapping between these probesets and exons resulted in 8% (5,777 of 71,731) of exons with evidence for EST-SE (A). These percentages were not significantly different from the 7% of exons with EST evidence for AS observed from using all exons. By raising the requirement that probesets had to contain at least one significant point to five significant points, the percentage of EST-SE increased dramatically from 11% (531 of 4,898 exons) to 26% (33 of 126). In comparison, the shuffled probesets at the same requirements remained at ~8%, rising slightly to 11% at five points, due to small sample sizes. Similar trends were observed with hCNS-SCns versus HUES6-ES and the derived NPs versus hESCs (A). Therefore, we concluded that REAP[+] exons were enriched for AS events independently identified by a transcript-based approach.
Comparison of REAP Predictions for hCNS-SCns versus Cyt-hES, hCNS-SCns versus HUES6-ES, Cyt-NP versus Cyt-ES, and HUES6-NPs versus HUES6-ES with Alternative Exons Identified by an EST-Based Method and ACEScan
Next, we compared REAP predictions to a computational approach of identifying exons with AS conserved in human and mouse, ACEScan [55
]. ACEScan receives as input orthologous human–mouse exon pairs and flanking intronic regions and computes sequence features and integrates the features into a machine-learning algorithm to assign a real-valued score to the exon. A positive score indicated a higher likelihood of being AS in both human and mouse. ACEScan was updated in the following ways. Firstly, instead of relying on orthology information by Ensembl, and then aligning flanking introns in “orthologous” exons, conserved exonic and intronic regions in human and mouse from genome-wide multiple alignments were extracted. Secondly, whereas in our previous analysis exons from the longest transcript in Ensembl were utilized, now we collapsed all the transcripts available at the UCSC genome browser and analyzed all exons in the entire gene loci. ACEScan was utilized to assign ACEScan scores to all ~162,000 internal exons in our genes. Exons annotated as first or last exons in Refseq mRNAs were excluded from our analysis, resulting in 4,487 positive-scoring exons, 2-fold more exons than originally published.
Here we repeated our analysis with exons with positive ACEScan scores (ACE[+]) instead of EST-SEs. If we required that none of the points per probeset (exon) was significant, 2% (1,645 of 71,731) of exons (after probeset mapping) were ACE[+] (B). Shuffling the mapping between these probesets and exons resulted in 3% (2,044 of 71,731) of exons being ACE[+] (B). These percentages were not significantly different from the 2.7% observed from all exons (4,487 of the 162,000 exons that were scored by ACEScan). By raising the requirement that probesets had to contain five significant points, the percentage of ACE[+] exons increased from 4% to 11%. However, the sample sizes were small. In comparison, the shuffled probesets at the same requirements remained at ~4%. Similar overall trends were observed with hCNS-SCns versus HUES6-ES and the derived NPs versus hESCs (B). In total, 7.5% (131 of 1,737) of REAP [+] exons were designated as ACEScan[+] compared to 2.4% (2,328 of 97,437) of REAP[−] exons. This result suggested that a small but significantly enriched fraction of AS events in hESCs versus NPs was likely to be evolutionarily conserved in human and mouse. In conclusion, our results suggested that REAP predictions were congruent with predictions from two independent, orthogonal methods.
Experimental Validation of Alternative Exons
The sensitivity and specificity of REAP in the identification of REAP[+] exons was tested by RT-PCR. To validate REAP[+] alternative exons, RT-PCR primers were designed in the flanking exons to amplify both isoforms. To be a positively validated candidate, the PCR products on a gel had to satisfy all of the following criteria: (i) at least one isoform with the expected size must be visible in each cell type; (ii) the relative abundance of the two isoforms must be altered between two cell types and the direction of change have to be consistent with the REAP studentized residuals: in our study positive residuals implied inclusion in hESCs and skipping in NPs, and negative residuals implied inclusion in NP and skipping in hESCs; and (iii) the results were replicable in at least two experiments.
For simplicity of design, we tested candidates predicted from Cyt-ES versus hCNS-SCns. Fifteen REAP[+] exons with at least two significant outliers (out of five) were randomly chosen as predicted alternative events and thirty-five exons with less than two significant outliers were randomly chosen as constitutive events (Table S3
). Nine of the fifteen exons (60%) were validated as AS events by our criteria. The sensitivity and specificity of the algorithm at the cutoff of two is 69% and 77%. Increasing the cutoff to three increased the specificity to 85%, with a slight decrease in sensitivity to 67% (A). The patterns of AS in hESCs were similar in both Cyt-ES and HUES6-ES for all AS events validated, but the NPs (Cyt-NP, HUES6-NP, and hCNS-SCns) had more varied AS. The pattern of AS in the REAP[+] exons in the SLK
(serine/threonine kinase 2) and POT1
(protection of telomeres 1) genes showed remarkable agreement within derived NPs and hCNS-SCns (B). The AS exon in SLK
was observed to be included in hESCs and completely excluded in NPs; the AS exon in the POT1
gene was included more in hESCs and a smaller isoform persisted in NPs. The AS patterns of the other verified REAP[+] exons were consistently similar in hESCs but were more varied in the NP. Interestingly, the patterns of AS in the derived NPs (Cyt-NP and HUES6-NP) were not always identical to those of hCNS-SCns. For example, the AS exon in the EHBP1
(EH domain binding protein 1) gene was included in hESCs but skipped in hCNS-SCns, and both isoforms were present in the derived NPs (B). As another example, the AS exon in the SORBS1
(sorbin and SH3 domain containing 1) gene was skipped in hESCs and included in hCNS-SCns, but exhibited an intermediate pattern in the derived NPs. However, in some cases, the AS patterns in the derived NPs were different from both hESCs and hCNS-SCns (such as in the AS exon in UNC84A, SIRT1
, and MLLT10
First, given three independent samples each from two conditions, we concluded that REAP was able to identify AS events with high specificity but with moderate sensitivity. Second, AS events in hESCs were more similar, whereas the AS events in derived NPs were consistent with or intermediate to the benchmark hCNS-SCns, likely reflecting differences in the cell lines and/or differentiation protocols. In addition, we tested the AS patterns of REAP[+] exons from EHBP1, SLK, and RAI14 in a panel of differentiated human tissues (C). The REAP[+] alternative exon in the RAI14 (retinoic acid induced 14) gene was observed to have the same AS pattern in NPs as in frontal and temporal cortex and in several other, non-brain adult tissues, such as heart and spleen. The AS pattern of the REAP[+] exon in the SLK gene in NPs was similar to most differentiated tissues; however, the relatively strong inclusion of the exon in hESCs was unique. Even in esophagus, kidney, liver, and prostate, both isoforms were present. The relative ratio of the exon-included to exon-skipped isoforms in SLK likely represents an ESC-specific AS signature. The alternative exon in the EHBP1 gene was unusual. The exon was included in hESCs but also in frontal cortex and temporal cortex, a finding that was unexpected given the exclusion of the exon in hCNS-SCns (C). The AS pattern in hCNS-SCns may represent a transient, early neuronal molecular change.
Functional and Expression Characteristics of REAP[+] Genes
In total, 1,500 genes were identified that contained 1,737 REAP[+] exons, 68% of which lacked prior transcript (EST/cDNA) evidence for AS. To determine whether genes that contained REAP[+] exons, which we refer to as REAP[+] genes, are biased toward particular biological activities, REAP[+] genes were compared to a set of REAP analyzed genes not found to have REAP[+] exons (REAP[−] genes). A Gene Ontology analysis revealed that REAP[+] genes are enriched for GO molecular function categories “ATP binding,” “helicase activity,” “protein serine/theronine kinase activity,” “small GTPase regulatory/interacting protein activity,” and “thyroid hormone receptor binding” (). In terms of GO biological process categories, REAP[+] genes were more frequently involved in “ubiquitin cycle.” Similar results were obtained when we compared REAP[+] genes to all human genes that did not contain REAP[+] exons () [55
Significantly Enriched Gene Ontology Terms in REAP[+] Genes (Cutoff of Two Significant “Outliers” per Probeset)
Next we asked if REAP[+] genes are differentially expressed in hESCs compared to NPs and vice versa. For this analysis, the t-statistics computed above measuring the enrichment of a gene in hESCs relative to NPs was utilized for only REAP-analyzed genes. At a defined absolute-valued cutoff, genes were divided into three categories: “enriched in hESCs,” “enriched in NP,” or “unchanged” (A). Increasing the t-statistic cutoff from one to five, the fraction of REAP[+] genes relative to REAP-analyzed genes remained constant in the “unchanged” categories (B). However, the fraction of REAP[+] exons decreased significantly in “enriched in hESCs” and “enriched in NPs” categories. If we increased the cutoffs on genes that were randomly assigned as REAP[+] and REAP[−], controlling for the same number of genes in each category, we observed that the fraction of REAP[+] exons remained unchanged for all three categories (C). To illustrate, at a cutoff of five, 10% (29 of 267) of enriched NP genes were REAP[+] genes and 8.8% (102 of 1,162) of enriched hESC genes were REAP[+], significantly different (p < 0.000005) from the random control where ~14% of enriched NP and enriched hESC genes were REAP[+]. At a cutoff of five, 14% (1,368 of 9,636) of genes that were expressed at similar levels between hESCs and NPs were REAP[+]. Our results suggested that a strategy of focusing on differentially expressed genes would miss at least 14% of transcriptionally unchanged genes that may nevertheless have functional AS differences between hESCs and NPs.
Analysis of REAP[+] Genes Relative to Transcriptional Differences
Conserved Intronic Splicing Regulatory Elements Proximal to REAP[+] hESC and NP Exons
Many, if not most, alternative exons undergo cell type–specific regulation by the binding of trans-factors to splicing regulatory cis-elements located proximal to or within the exons. As many tissue-specific splicing cis-regulatory elements were localized in intronic regions of AS exons, we focused on the identification of intronic splicing regulatory elements (ISREs) proximal to REAP[+] exons. In addition, we wanted to identify both common and cell type–specific ISREs. Three sets of exons were generated: (i) REAP[+] exons that were predicted to be included in NPs and skipped in hESCs (REAP[+]NP); (ii) REAP[+] exons that were predicted to be included in hESCs and skipped in NPs (REAP[+]hESC); and (iii) all REAP[−] exons. Regions of 400 base pairs flanking the exons were targeted for search. Initially, 5-mers that were significantly enriched between the upstream and downstream intronic regions of REAP[+]NP and REAP[+]ES relative to REAP[−] exons were enumerated. We were not able to identify 5-mers that were statistically significantly different.
Next, we focused on splicing signals that were conserved across mammalian genomes as a way of enhancing the signal of detecting functional splicing regulatory sequences [66
]. Exons that were orthologous across human, dog, rat, and mouse were obtained and the flanking intronic regions were aligned (400 bases upstream and downstream separately; A). We enumerated k-mers that were perfectly conserved across all four genomes in the upstream (and downstream) intronic regions. Each conserved k-mer was attributed a χ score representing its enrichment in a set of exons relative to another set of exons. The higher the score, the more frequent the conserved k-mer was in the first set relative to the second set. As a negative control, the associations between REAP scores and exons were shuffled. The enrichment scores for all downstream intronic 5-mers for shuffled REAP[+]NP
versus set REAP[−] exons (x
-axis), and for shuffled REAP[+]ES
exons versus REAP[−] exons (y
-axis) were displayed (B). At a χ cutoff of three, which corresponded to a p
-value of 0.0015, the majority of 5-mers were not significantly enriched in either shuffled set. Confident that no association of k-mers with shuffled REAP exons were found; we repeated the analyses for upstream and downstream intronic 5-mers for the original unshuffled sets. We identified 68 conserved 5-mers enriched upstream of REAP[+]NP
exons; and 34 5-mers enriched upstream of REAP[+]ES
exons (C; Table S4
). Of the 5-mers that were significantly enriched upstream of REAP[+]NP
exons, we identified a U-rich motif (UUUUU), a GU-rich motif (GUGUG), and a CU-rich motif (CCUCU, CUCUC, UCUCU, GCUCU). It is known that the heterogeneous ribonucleoprotein C (hnRNP C
) binding site obtained by SELEX is five “U”s [67
]. GU-rich sequences in flanking intronic regions were shown to bind to splicing factor ETR-3
to regulate AS [68
]. CU-rich sequences were shown to bind the splicing factor PTB
]. Of the 5-mers enriched upstream of REAP[+]ES
exons, we observed CUAAC, which resembled the splicing branch-signal. Of the six 5-mers that were enriched upstream of both REAP[+]NP
exons, we identified GCAUG, which was previously shown to be an intronic splicing cis
-element for the mammalian fibronectin and calcitonin/CGRP genes [70
]. More recently, both mammalian Fox1
have been demonstrated to regulate alternatively spliced exons via UGCAUG binding sites in neighboring introns in neuronal cell cultures [73
Conserved Intronic cis-Elements Enriched Proximal to REAP[+] Alternative Exons
Eighteen conserved 5-mers were significantly enriched in the downstream introns of REAP[+]ES
exons; and 76 5-mers were enriched downstream of REAP[+]NP
exons (Table S4
, D). We identified a motif CUCAU resembling the Nova binding site YCAY [74
], and a G-rich motif (AGGGG, GGGGA, GGGGC, GGGGG, GGGGU) enriched in the introns downstream of REAP[+]ES
exons. G-rich motifs had previously been shown to be part of a bipartite signal that silences AS exons [75
]. Of the five 5-mers that were enriched downstream of both REAP[+]NP
exons, GCAUG and a U-rich motif (UUUUU) were identified. We concluded that potential ISREs were enriched proximal to a subset of REAP[+] exons; in particular, the Fox1/2
binding site GCUAG may play a regulatory role in controlling AS events in hESCs and NPs.