|Home | About | Journals | Submit | Contact Us | Français|
Pseudogene transcripts can provide a novel tier of gene regulation through generation of endogenous siRNAs or miRNA-binding sites. Characterization of pseudogene expression, however, has remained confined to anecdotal observations due to analytical challenges posed by the extremely close sequence similarity with their counterpart coding genes. Here, we describe a systematic analysis of pseudogene “transcription” from an RNA-Seq resource of 293 samples, representing 13 cancer and normal tissue types, and observe a surprisingly prevalent, genome-wide expression of pseudogenes that could be categorized as ubiquitously expressed or lineage and/or cancer specific. Further, we explore disease subtype specificity and functions of selected expressed pseudogenes. Taken together, we provide evidence that transcribed pseudogenes are a significant contributor to the transcriptional landscape of cells and are positioned to play significant roles in cellular differentiation and cancer progression, especially in light of the recently described ceRNA networks. Our work provides a transcriptome resource that enables high-throughput analyses of pseudogene expression.
Pseudogenes are ancestral copies of protein-coding genes that arise from genomic duplication or retrotransposition of mRNA sequences into the genome followed by accumulation of deleterious mutations due to loss of selection pressure, degenerating eventually into so-called genetic fossils (Sasidharan and Gerstein, 2008). Pseudogenes pervade the genome, representing virtually every coding gene, and due to their extremely close sequence similarity with their cognate genes, complicate whole-genome sequencing and gene expression analyses. A growing body of evidence strongly suggests their potential roles in regulating cognate wild-type gene expression/function by serving as a source of endogenous siRNA (Tam et al., 2008; Watanabe et al., 2008), antisense transcripts (Zhou et al., 1992), competitive inhibitors of translation of wild-type transcripts (Kandouz et al., 2004), and perhaps dominant-negative peptides (Katoh and Katoh, 2003). Pseudogene transcription has also been shown to regulate cognate wild-type gene expression by sequestering miRNAs (Poliseno et al., 2010). The recently described competing endogenous RNA (ceRNA) networks comprising sets of coordinately expressed genes with shared miRNA response elements (MREs) provide an additional dimension of (post-) transcriptional regulation in which the role of pseudogenes might overlap with those of protein-coding genes (Salmena et al., 2011; Sumazin et al., 2011).
Previous genome-wide studies of pseudogenes focused on the identification of their chromosomal coordinates and annotations based on diverse computational approaches (Karro et al., 2007; Zhang and Gerstein, 2004), including PseudoPipe (Zhang et al., 2006), HAVANA (Solovyev et al., 2006), PseudoFinder (Lu and Haussler, 2006, ASHG, conference), and Retrofinder (Zheng and Gerstein, 2006). These individual pipelines were subsequently consolidated into an integrated consensus platform, ENCyclopedia Of DNA Elements (ENCODE), which now serves as the definitive database of manually curated and annotated pseudogenes as well as pseudogene transcripts (Zheng et al., 2007). By contrast, genome-wide analyses of pseudogene expression have been somewhat arbitrary, mainly relying upon evidence of pseudogene transcripts obtained from disparate gene expression platforms, including public mRNA and EST databases, cap analysis gene expression (CAGE) studies, and gene identification signature-paired end tags (GIS-PET) (Ruan et al., 2007). Given the essentially anecdotal observations of pseudogene expression, only 160 expressed human pseudogenes are currently documented in ENCODE. Though this could be due to a general lack of transcription of pseudogenes, as generally presumed, it may also be reflective of an insufficient and uneven depth of coverage afforded by early gene expression analysis tools.
In this context, the recent maturation of next-generation high-throughput sequencing platforms provides unprecedented access to genome-wide expression analyses previously not achievable (Han et al., 2011a; Morozova et al., 2009). Here, we analyzed a compendium of RNA-Seq transcriptome data specifically focusing on pseudogene transcripts from a total of 293 samples encompassing 13 different tissue types, including 248 cancer and 45 benign samples. In order to carry out a systematic analysis of pseudogene expression, we developed a bioinformatics pipeline focused on detecting pseudogene transcription. This integrative approach provided evidence of expression for 2,082 distinct pseudogenes, which displayed lineage-specific, cancer-specific, as well as ubiquitous expression patterns. Taken together, this Resource nominates a multitude of expressed pseudogenes that merit further investigation to determine their roles in biology and in human disease.
Paired-end RNA-Seq data from a compendium of 293 samples, representing both cancer and benign samples from 13 different tissue types recently generated in our laboratory, was utilized to build a pseudogene analysis pipeline (Figure 1 and Figure S1 and Table S1 available online). Sequencing reads were mapped to the human genome (hg18) and University of California Santa Cruz (UCSC) Genes using Efficient Alignment of Nucleotide Databases (ELAND) software of the Illumina Genome Analyzer Pipeline (Table S2). Reads showing mismatches to the reference genes but mapping perfectly to unannotated regions elsewhere in the genome were used as the primary data for pseudogene expression analysis. Two or more unique, high-quality overlapping reads nucleating at the loci of differences between wild-type genes and pseudogenes were used to define de novo “clusters” (ranging from 40 to 5,000 bp). These clusters were employed for gene expression analyses in a way analogous to the “probes” used in microarray gene expression studies, though unlike predesigned and fixed probes used in microarrays, the sequence clusters used here were formed de novo, solely based on the presence (and levels) of transcripts. Thus, one or more clusters (like one or more probes in microarrays) represented a transcript, whereas the number of reads mapping to a cluster (analogous to fluorescence intensity due to probe hybridization on microarrays) provided a measure of expression of the corresponding (pseudo)genes. For example, Figure 2 shows a schematic representation of the cluster alignments for two representative pseudogenes, ATP8A2-Ψ (Figure 2A) and CXADR-Ψ (Figure 2B). As can be seen, mutation-dense regions in the reference sequence provide foci of pseudogene-specific cluster formation. Naturally, pseudogenes with sparse and dispersed mutations nucleate fewer clusters and require higher depth of coverage for reliable detection.
Overall, 2,156 unique pseudogene transcript clusters were identified, and their genomic coordinates (start and end points) were compared with the coordinates of pseudogenes annotated in the ENCODE (Zheng et al., 2007) and Yale pseudogene resources (http://www.pseudogene.org) (Karro et al., 2007), the two most comprehensive pseudogene annotation databases. Genomic coordinates of 934 unique pseudogene transcript clusters in our data set were found to overlap with the pseudogene coordinates annotated in both Yale and ENCODE databases. In addition, 585 clusters overlapped with Yale and 92 with ENCODE databases, displaying a high degree of overall concordance between our data and the authentic resources and highlighting a level of difference between the two reference databases (that necessitated our consideration of both resources). Further, as multiple clusters can sometimes represent one distinct pseudogene transcript, the 2,156 transcript clusters provided evidence for 2,082 distinct transcripts. Of these, 1,506 transcripts overlap with the genomic coordinates of pseudogenes in Yale and/or ENCODE, and up to 576 transcripts are potentially novel (described below) (Figure S2A). The 2,082 pseudogene transcripts, in turn, correspond to 1,437 wild-type genes, clearly indicating that the transcripts of multiple pseudogenes arisen from the same wild-type genes are also detected in our compendium. Taken together, our study provides evidence of widespread transcription of pseudogenes unraveled by high-throughput transcriptome sequencing (Table S3).
Pseudogene clusters across the sample-wise compendium reveal that pseudogenes of housekeeping genes such as ribosomal proteins are widely expressed across tissue types. Additionally, pseudogene transcripts corresponding to CALM2 (calmodulin 2 phosphorylase kinase, delta), TOMM40 (translocase of outer mitochondrial membrane 40), NONO (non-POU domain-containing, octamer-binding), DUSP8 (dual-specificity phosphatase 8), PERP (TP53 apoptosis effector), and YES (v-yes-1 Yamaguchi sarcoma viral oncogene homolog 1), etc. were observed in more than 50 samples each, which were further validated by pseudogene-specific RT-PCR followed by Sanger sequencing (Table S4).
Further, because our RNA-Seq compendium comprises 35- to 45-mer short sequence reads that largely generated short sequence clusters not optimal for available pseudogene analysis tools such as Pseudopipe (Zhang et al., 2006) and Pseudofam (Lam et al., 2009) used in generating ENCODE and Yale databases, we carried out a direct query of individual clusters against the human genome (hg18) using the BLAT tool from UCSC, which is ideally suited for short sequence alignment searches (Kent, 2002). Based on this “custom” analysis, or simply BLAT (Figure S2A), we were able to independently assign 1,888 clusters representing 1,820 unique pseudogenes to unique genomic locations.
Comparing the genomic locations of the pseudogene clusters identified by BLAT analysis to those identified by Yale and ENCODE databases (Figure S2A), 762 clusters were found to be common to all three resources, but a remarkably large set of 585 clusters was uniquely defined by BLAT analysis alone. Some of the pseudogene transcripts thus identified included BAT1, BTBD1, COX7A2L, CTNND1, EIF5, PAPOLA, PARP11, SYT, ZBTB12, and others (n = 25) and were validated by Sanger sequencing (Table S4). Thus, analysis of RNA-Seq data provided a reliable assessment of expressed pseudogenes.
Though designating the BLAT-based pseudogene clusters as novel pseudogenes must await further sequence characterization (such as analysis of ORF structure and potential genesis of novel protein-coding gene family members, etc.), a small subset of clusters was seen to be localized in the vicinity of known pseudogenes. Thus, we found 92 clusters that resided adjacent (within 5 kb) to previously annotated pseudogenes (Figure S2B, left), and we hypothesize that these may represent pseudogenes with inaccurate annotations in the current databases. For example, the chromosomal coordinates of CENTG2-Ψ (OTTHUMT00000085288, Havana processed pseudogene) are defined in ENCODE as Chr1:177822463-177824935. As expected, we observed a cluster mapping to this locus; however, interestingly, we also observed a distinct cluster (Chr1:177825028-177826295) less than 100 base pairs away. Although unannotated in the current databases, the sequence of this adjacent locus shows a high degree of homology to the CENTG2 parental gene (Figure S2B, right), strongly suggesting that this cluster represents an extension of the existing genomic coordinates of CENTG2-Ψ annotation. Similar observations were made with HNRNPA1 and the HNRNPA1-Ψ on Chr6q27 (Figure S2B, right). 493 BLAT derived clusters that were not in close proximity to annotated pseudo-genes likely represent putative pseudogenes currently missing in the database annotations (Table S3B).
Next, we assessed the technical and analytical factors influencing the yield of pseudogene transcripts. As may be expected, a positive correlation was observed between the sequencing depth and total number of pseudogene transcripts (correlation coefficient, +0.65) (Figure S3A). However, no significant correlation was observed between the absolute measure of percent similarity between pseudogene-WT pairs and pseudogene yield. Importantly, the metric of overall percent similarity accounts for gap penalty and mismatches in BLAT search, but it is the “distribution” of the mismatches that is critical in resolving pseudogenes from nearly identical wild-type sequences; for example, a few mismatches, accumulated in a small stretch, are more effective in confidently distinguishing pseudogene expression from wild-types as compared to a higher number of mismatches that are scattered over long stretches of sequence (Figure 2). Thus, three primary factors determine the detection of pseudogene transcription by RNA-Seq: (1) the level of expression of the pseudogenes (i.e., the higher the level of expression, the higher the likelihood of detection), (2) the depth of RNA sequencing, and (3) overall distribution of mismatches with respect to the wild-type.
To explore the loci of transcription regulatory elements associated with pseudogene transcription, we carried out ChIP-Seq analysis of a breast cancer cell line MCF7 probed with H3K4me3, a histone mark associated with transcriptionally active chromosomal loci, and integrated the results with the MCF7 pseudogene transcript data. Interestingly, we observed a statistically significant enrichment of H3K4me3 peaks at expressed pseudogene loci as compared to nonexpressed pseudogenes (p = 0.0054) (Figure S3B), suggesting that the pseudogene transcripts observed by RNA-Seq are associated with transcriptionally active genomic loci. Interestingly, the pseudogene transcripts associated with H3K4me3 peaks encompass both unprocessed and processed pseudogenes, with no discernible differences in the pattern of expression. Considering the role of 3′ UTRs with MREs in ceRNA regulatory networks, we also looked at the frequency of 3′ UTR sequences retained in our set of pseudogene transcripts and observed that at least 71% of all pseudogene transcripts retain distinct 3′ UTR sequences similar to their cognate wild-type genes (Figure S3C). Interestingly, comparing the pseudogene transcripts with a list of genes implicated in ceRNA networks (Han et al., 2011b; Tay et al., 2011), we observed more than 400 overlapping transcripts (Table S5). The presence of noncoding pseudogene transcripts with similar 3′ UTRs (and MREs) adds a further level of complexity to ceRNA regulatory networks.
Next, we assessed a potential correlation between the expression of pseudogenes present within the introns of unrelated, expressed genes with their “host” genes. Interestingly, no significant association was observed, suggesting that pseudogenes are likely subject to independent regulatory mechanisms even when residing within other transcriptionally active genes. Further, our observations with the breast-specific unprocessed pseudogene ATP8A2 (likely arisen from duplication of wild-type ATP8A2, thus likely harboring similar promoter elements) also indicate that there is no apparent correlation between the pseudogene expression with the wild-type gene that is expressed ubiquitously (described later). Thus, in summary, although it is tempting to speculate that pseudogene expression may be regulated by the promoter elements from the cognate gene or the host genes, our data suggest that more complex/indirect factors may be at play. Next, we assessed a possible correlation between the expression of pseudogenes with that of cognate wild-type genes, and intriguingly, no significant pattern of correlation was observed (Figure S3D).
Focusing on the pseudogenes whose genomic coordinates are annotated in the reference databases, we next analyzed the expression profiles of the 1,056 unique transcripts.
Analyzing the expression data from 248 cancer and 45 benign samples from 13 different tissue types (total 293 samples), we observed broad patterns of pseudogene expression, including 1,056 pseudogenes that were detected in multiple samples (Table S6), which supports the hypothesis that transcribed pseudogenes contribute to the typical transcriptional repertoire of cells. In addition, we identified distinct patterns of pseudogene expression, akin to that of protein-coding genes, including 154 highly tissue/lineage-specific and 848 moderately tissue/lineage-specific (or enriched) pseudogenes (Figure 3A). Moreover, we found 165 pseudogenes exhibiting expression in more than 10 of the 13 tissue types examined, and these we classified as ubiquitous pseudogenes whose transcription is characteristic of most cell types (Figure 3A, bottom).
Of the 165 ubiquitous pseudogenes, a majority belonged to housekeeping genes, such as glyceraldehyde 3-phosphate dehydrogenase (GAPDH), ribosomal proteins, several cytokeratins, and other genes widely expressed in most cell types. This is expected, as these genes are known to have numerous pseudogenes, and it is likely that several of these pseudogenes retain the capacity for widespread transcription, mimicking their protein-coding counterparts.
A second set of pseudogenes exhibited near ubiquitous expression but were frequently transcribed at lower levels in most tissues and robustly transcribed in one or two tissues. These pseudogenes were termed “nonspecific,” and this group harbors more than 870 pseudogenes, comprising a large portion of our data set (Figure 3A, middle). Many of the pseudogenes previously shown to be expressed were found in this category, including some pseudogenes reported as tissue specific, such as CYP4Z2P, a pseudogene previously reported to be expressed only in breast cancer tissues (Rieger et al., 2004). Other candidates observed in this category include pseudogenes derived from Oct-4 (Kastler et al., 2010), Connexin-43 (Bier et al., 2009; Kandouz et al., 2004), and BRAF (Zou et al., 2009), among others (Table S6).
Though powerful, our approach is nevertheless limited to pseudogene transcripts that are expressed above the current threshold of detection by RNA-Seq and possess distinct stretches of sequence mismatches compared with their protein-coding parental genes. Thus, for example, PTENP1, a pseudogene of PTEN recently implicated in the biology of the phosphatidylinositol 3-kinase (PI3K) signaling pathway, was not detected in our compendium possibly due to the preponderance of cancer samples in our cohort, which tend to show low expression or deletion of this pseudogene (Poliseno et al., 2010).
Lineage-specific pseudogene transcripts may have the potential for lineage-specific functions and may represent novel elements that facilitate biological characteristics that are unique to distinct tissue types. In this regard, we observed 154 pseudogenes with highly specific expression patterns, including pseudogenes derived from AURKA (kidney samples), RHOB (colon samples), and HMGB1 (myeloproliferative neoplasms [MPNs]) (Figure 3A, top). Interestingly, however, lineage-specific pseudogenes tended to represent a small fraction of all pseudogenes expressed in a given tissue type, and the total number of lineage-specific pseudogenes observed in a tissue type did not show a correlation with the total number of samples analyzed. For example, B-lymphocyte cells (n = 19) and MPNs (n = 9) showed more lineage-specific pseudogenes than breast (n = 64) or prostate (n = 89). Conversely, we did observe more pseudogene transcripts in samples with longer read lengths and deeper coverage, as expected. Together, these data both confirm and formalize previous anecdotal observations of lineage-specific pseudogene expression patterns by exploiting the power of RNA-Seq to resolve individual transcripts (Figure 3B) (Bier et al., 2009; Lu et al., 2006; Rieger et al., 2004; Zou et al., 2009).
Because our sample compendium has a substantial number of cancer samples, we next focused on pseudogenes with cancer-specific expression. Though a majority of the pseudogenes examined were found in both cancer and benign samples, we observed 218 pseudogenes expressed only in cancer samples, of which 178 were observed in multiple cancers and 40 were found to have highly specific expression in a single cancer type only (Figure 4A and Table S7). Consistent with our previous results (Figure 3), we found that the number of cancer-type-specific pseudogenes did not correlate with the number of samples sequenced in a given cancer type. These results suggest that cancer samples harbor transcriptional patterns of pseudogenes that are both lineage and cancer specific.
Among the cancer-specific pseudogenes, a few noteworthy examples included pseudogenes derived from the eukaryotic translation initiation factors EIF4A1 and EIF4H, the heterogeneous nuclear ribonucleoprotein HNRPH2, and the small nuclear ribonucleoprotein SNRPG (Figure 4B). Moreover, we observed pseudogenes corresponding to known cancer-associated genes, including RAB-1, a Ras-related protein; VDAC1, a type-1 voltage-dependent anion-selective channel/porin; RCC2, a regulator of chromosome condensation 2; and PTMA, prothymosin alpha. Interestingly, the parental protein-coding PTMA gene has given rise to five processed pseudogenes that retain consensus TATA elements, individual transcriptional start sites, and intact open reading frames that may potentially code for proteins closely related to the parental PTMA protein. Importantly, we find expression of PTMA-derived pseudogenes in more than 30 cancer samples, but not in any benign cells, and these data suggest that PTMA-derived pseudogenes may not only contribute transcripts to cancer cell biology but potentially proteins as well, warranting further study of these pseudogenes in tumorigenesis.
To investigate individual pseudogenes in greater detail, we focused on pseudogenes associated with prostate and breast cancer, as our compendium has a substantial number of these two cancer types represented. Analysis of lineage-specific pseudogenes restricted to prostate cancers identified numerous pseudogenes, including several derived from parental genes known to be altered or dysregulated in cancer; for example, NDUFA9, which encodes an NADH oxidoreductase component of mitochondrial complex I that is reported to be upregulated in testicular germ cell tumors (Dormeyer et al., 2008); EPCAM, an epithelial cell adhesion molecule involved in cancer and stem cells signaling (Munz et al., 2009); and CES7, known to be expressed only in the male reproductive tract (Gang et al., 2011) (Figure 3B and Table S6). Among the prostate cancer specific pseudogenes, CXADR-Ψ, a processed pseudogene on chromosome 15, was of immediate interest, as the parental CXADR protein demonstrates putative tumor suppressor functions and its loss is implicated in α-catenin silencing (Pong et al., 2003). We therefore selected this pseudogene for further study in prostate cancer and first evaluated custom Taqman assays that could distinguish CXADR-Ψ from parental CXADR. The expression levels showed strong correlation with the RNA-Seq data (Figure S3E). CXADR-Ψ expression was found to be upregulated in ~25% of prostate cancer tissues, with minimal expression seen in benign prostate samples and nonprostate tissues (Figure 5A). No correlation was observed between CXADR-Ψ and parental CXADR expression, although parental CXADR also had some proclivity for prostate cancer-specific expression (Figure 5B). Interestingly, CXADR-Ψ expression was nearly restricted to prostate cancers lacking an ETS gene fusion, with few ETS-positive samples exhibiting expression of this pseudogene. By contrast, parental CXADR gene expression was found in both ETS-positive and ETS-negative samples (Figure 5C). Finally, we interrogated CXADR-Ψ and CXADR parental gene expression in a set of six prostate patients with matched cancer and benign tissues (including four ETS-negative and two ETS-positive pairs). Again, ETS-negative prostate cancer samples displayed marked upregulation of CXADR-Ψ compared to the ETS-positive patients, with parental CXADR expression being fairly constant between this set of patients (Figure 5D). To establish the expression of CXADR-Ψ transcript, we were able to clone CXADR-Ψ cDNA from two RNA-Seq-positive prostate cancer samples (Figure S5A), and as predicted, these clones showed perfect sequence similarity to the pseudogene CXADR-Ψ and only 84% to CXADR wild-type gene (Figure S5B).
In the course of these analyses, we also identified a prostate-cancer-specific readthrough transcript involving KLK4, an androgen-induced gene, and KLKP1, an adjacent pseudogene. This chimeric RNA transcript KLK4-KLKP1, combining the first two exons of KLK4 with the last two exons of KLKP1, retains an open reading frame incorporating 54 amino acids encoded by the KLKP1 pseudogene in the putative chimeric protein (Figure S6A). Curiously, this readthrough was recently described in the prostate cancer cell line LNCaP as a cis sense-antisense chimeric transcript (Lai et al., 2010). Intriguingly, the KLK4-KLKP1 transcript was highly expressed in 30%–50% of prostate cancer tissues, and this expression was lineage and cancer specific, with minimal expression seen in benign prostate and other tissues (Figure S6B). These data suggest that the KLK4-KLKP1 may warrant further study as a potential biomarker of prostate cancer as well as a candidate protein implicated in the biological complexity of this disease.
Among the pseudogene candidates in breast cancer, we identified a unprocessed pseudogene cognate to ATP8A2, a LIM domain-containing protein speculated to be associated with stress response and proliferative activity (Khoo et al., 1997) (Figure 3A, top, and Table S3). Because ATP8A2-Ψ on chromosome 10 displays substantial sequence divergence from the cognate ATP8A2-WT gene on chromosome 13, it lends high confidence to our computational identification, and we selected this candidate for further validation. Taqman assays distinguishing ATP8A2-WT transcripts from ATP8A2-Ψ showed a strong correlation (r2 = 0.98) with the expression pattern obtained by RNA-Seq (Figure S3E), with ATP8A2-Ψ expression found to be restricted to breast samples, the highest levels seen in a subset of breast cancer tissues and cell lines (Figures 6A and 6B). By contrast, ATP8A2-WT expression was highly variable across different tissue types and showed no correlation with ATP8A2-Ψ expression (Figure 6B).
We were further intrigued by the pattern of ATP8A2-Ψ expression within breast tumors, where ~25% of tumors demonstrate extremely high levels of this pseudogene, suggesting that ATP8A2-Ψ may contribute to a particular subtype of breast cancer. We therefore analyzed ATP8A2-Ψ expression with respect to luminal and basal breast subtypes, two prominent categories of breast cancer with distinct molecular and clinical characteristics. Unexpectedly, we found that ATP8A2-Ψ expression was restricted to tumors with luminal histology, whereas basal tumors showed minimal expression of this pseudogene (Figure 6A, right). The wild-type ATP8A2 transcript did not display this pattern of expression.
To investigate a potential role of ATP8A2-Ψ expression in breast cancer, first we carried out siRNA-based knockdown of both the wild-type and pseudogene RNA in two independent breast cancer cell lines that expressed both the transcripts (Figure S7A). Knockdown of ATP8A2-Ψ with two independent siRNAs was found to specifically inhibit the proliferation of over-expressing cell lines Cama-1 and HCC1806 (Figure 6C), but not the cell lines with no detectable levels of ATP8A2-Ψ, for example, the benign breast epithelial cell line H16N2 (Figure 6C, right) and a pancreatic cancer cell line, BXPC3 (Figure S7D). Knockdown of ATP8A2-Ψ (but not ATP8A2-WT) also resulted in reduced cell migration and invasion seen in in vitro Boyden Chamber assays (Figure 6D) as well as in in vivo intravasation and metastasis in chicken chorioallantoic membrane xenograft assay (Figure 6F). In contrast, knockdown of wild-type ATP8A2 had no effect on the proliferation of any of the cell lines tested, suggesting an unexpected growth regulatory role for ATP8A2-Ψ (Figure 6C). Surprisingly, though the knockdown of wild-type ATP8A2 had a minimal effect on the pseudogene transcript levels, ATP8A2-Ψ-specific siRNAs, apart from reducing the ATP8A2-Ψ transcript, also reduced the wild-type protein levels (Figures S7C and S7E). Thus clearly, unlike Oct4 and BRAF pseudogene transcripts having an inverse correlation with the wild-type transcript levels, ATP8A2-Ψ and wild-type ATP8A2 transcripts (Figures 6A and 6B) and protein (Figure S7E) do not seem to be regulated in this manner. Subsequently, to assess the phenotypic effect of ATP8A2-Ψ overexpression in benign cells, we cloned and over-expressed the full-length ATP8A2 pseudogene cDNA in benign breast epithelial cell line TERT-HMEC. Two independent pooled populations of ATP8A2-Ψ-overexpressing TERT-HMEC cells were found to undergo increased proliferation and migration (Figure 6E), indicating the potential oncogenic nature of this breast-specific pseudogene transcript.
The recent advances in high-throughput transcriptome sequencing have revealed widespread expression of noncoding RNAs in the context of development and differentiation (Khachane and Harrison, 2010; Nagalakshmi et al., 2008; Pickrell et al., 2010; Prensner et al., 2011; Wilhelm et al., 2008). These studies, however, do not include pseudogene expression analyses in their purview, likely due to the challenge of extremely close sequence similarity with wild-type cognate genes. Here, we interrogated the potential of RNA-Seq data to unambiguously detect pseudogene transcripts and to assess whether pseudogene expression is more common in the transcriptome than previously realized. Surprisingly, we found evidence of a widespread expression of pseudogenes in our cancer transcriptome resource, including 1,500 pseudogenes annotated in the Yale and ENCODE databases, redefined the genomic coordinates of ~100 pseudogenes in existing databases, and nominated more than 400 potentially novel pseudogenes. In aggregate, our analysis considerably expands the spectrum of expressed pseudogenes documented previously (Harrison et al., 2005; Yao et al., 2006; Zheng et al., 2007).
The extreme sequence similarity between pseudogenes and cognate wild-type genes suggests a functional role for pseudogene transcripts; indeed, pseudogene expression has been associated with both downregulation of cognate wild-type gene, such as eNOS in ovary, as well as a positive effect on the expression of the wild-type gene, as demonstrated recently, wherein PTENP1 expression upregulates PTEN expression in prostate cells (Poliseno et al., 2010). Interestingly, a class of pseudogenes called “unitary pseudogenes” does not have extant cognate wild-type genes (Zhang et al., 2010). Nevertheless, as most pseudogenes do have distinct cognate wild-type genes, we assessed the correlation between expressed pseudogenes and their cognate wild-type genes across multiple samples (of the same tissue type or across diverse tissue types) and did not observe a statistically significant correlation. This is not surprising, partly because our data set is comprised of a heterogeneous set of samples representing diverse tissue types. Further, the sensitivity of detection of individual pseudogene transcripts is limited by the degree and distribution of dissimilarity with the wild-type gene that determines the “effective” depth of coverage; this limits the number of samples showing measurable expression of individual pseudogene-wild-type pairs, making it difficult to conduct robust statistical analyses. Future studies involving larger sample sets with higher depth of coverage and longer read length may be better able to resolve this question.
Taken together, our study provides a systematic approach to analyze expressed pseudogenes using RNA-Seq data, enabling comparisons of cancer versus benign tissues in multiple solid tumors. Our efforts lend additional credence to the capacity of RNA-Seq to “re-define” the functional elements of the genome and “re-annotate” the population of pseudogenes implicated in human cell biology. Our approach overcomes the limitations of previous analyses of pseudogene expression, which were primarily anecdotal and heterogeneous in nature, and our methodologies suggest avenues to reconcile the difficulty in distinguishing pseudogene expression from parental protein-coding gene expression—a facet that is important for all RNA-Seq studies aiming to provide an accurate picture of gene expression. Finally, we describe ATP8A2-Ψ and CXADR-Ψ pseudogenes preferentially associated with distinct subsets of breast cancer and prostate cancer patients, respectively.
The recent description of intricate regulatory networks of protein-coding transcripts called competitive endogenous RNAs (ceRNAs) defined on the basis of coordinated regulation by common sets of microRNA response elements (MREs)—first intimated by Salmena et al. (Salmena et al., 2011) and subsequently supported by experimental results from multiple groups(Cesana et al., 2011; Han et al., 2011b; Karreth et al., 2011; Tay et al., 2011)—implicates potential noncoding functions for many protein-coding transcripts. In this context, pseudogene transcripts could provide an additional layer of complexity in conjunction with their cognate wild-type genes or independently.
The cancer/tissue-specific pseudogene expression signatures described here highlight the need to factor in pseudogene expression in all high-throughput gene expression studies and also show that pseudogene expression merits further exploration in its own right as an additional layer of transcriptional complexity. To facilitate further analyses, we provide here an extensive resource of RNA-Seq data of human cancer-related tissues and cell lines.
Paired-end transcriptome sequence reads (2 × 40 and 2 × 80 base pairs) were obtained from a total of more than 293 samples from 13 tissue types (Figure S1 and Table S1). Each sample was sequenced on an Illumina Genome Analyzer I or II according to protocols provided by Illumina as described earlier (Palanisamy et al., 2010).
Paired-end transcriptome reads were mapped to the human genome (NCBI36/hg18) and University of California Santa Cruz (UCSC) Genes using Efficient Alignment of Nucleotide Databases (ELAND) software of the Illumina Genome Analyzer Pipeline, using 32 bp seed length and allowing up to two mismatches; detailed mapping status is represented in Table S2. Passed purity filter reads obtained from Illumina export and extended output files (as described before) were parsed and binned into three major categories: (1) both of the paired reads map to annotated genes; (2) one or both of the paired reads map to unannotated regions in the genome; and (3) neither of the reads map (these include viral, bacterial, and other contaminant reads, as well as sequencing errors). The paired reads with one or both partners mapping to an unannotated region were clustered based on overlaps of aligned sequences using the chromosomal coordinates of the clusters. Singleton reads that did not cluster or stacked\duplicated reads with the same start and stop genomic coordinates (potential PCR artifacts) were filtered out. Passed filter clusters were defined as units of transcript expression (analogous to a “probe” on microarray platforms). These clusters were screened against two human pseudogene resources, Yale human pseudogene (Build 53, http://pseudogene.org/) (Karro et al., 2007) and Gencode (October 2009, http://genome.ucsc.edu/ENCODE/) (Zheng et al., 2007), to identify and annotate pseudogene clusters. The processed, duplicated, and fragmented categories of pseudogene entries from Yale and the entries corresponding to Level 1+2 (Manual Gene Annotations) and Level 3 (Automated Gene Annotations) from Gencode were used. The clusters were also subjected to homology search using the alignment tool BLAT (http://www.soe.ucsc.edu/~kent) (Kent, 2002) for an independent annotation. Sequence reads from individual samples were queried against the resultant clusters defined by the union of Yale, ENCODE, and BLAT output to assess the expression of pseudogenes (Figure 1 and Table S3). The cutoff value for pseudogene expression in a sample was set at five or more reads mapping to at least one cluster in a putative pseudogene transcript. Pseudogene transcripts (one or more probes overlapping with either Yale or ENCODE) detected in two or more samples in a tissue type and absent in all other tissue types were defined as tissue/lineage specific. Pseudogene probes detected in 10 out of 13 samples were designated as ubiquitous. All other cases were described as an intermediate category. Pseudogene transcripts detected in three or more cancer samples and absent in all benign samples were designated as cancer specific.
We carried out multiple correlation analyses (Figure S3), including: (1) passed filter reads (sequence yield) with total number of pseudogene transcripts observed per sequencing run (pseudogene transcript coverage); (2) expression of genes and pseudogenes carried out using 173 gene-pseudogene pairs in 64 samples that each show nonzero expression in at least ten samples across the data set; (3) expression levels of ATP8A2 and CXADR pseudogenes obtained from RNA-Seq and qPCR; (4) ChIP-Seq analysis of a breast cancer cell line MCF7 that was probed with H3K4me3 and compared with MCF7 pseudogene transcript data; and (5) pseudogene transcripts with 3′ UTR sequences (± 2 kb) that were compared with 3′ UTR sequences of their cognate genes using BLAT.
Pseudogene transcripts showing an overlap with transcripts involved in ceRNA network genes reported previously were tabulated (Sumazin et al., 2011 and Tay et al., 2011) (Table S5). The entire sequence data set will be submitted to dbGAP after securing requisite approvals.
Total RNA was isolated using Trizol and an RNeasy Kit (Invitrogen) with DNase I digestion according to the manufacturer’s instructions. RNA integrity was verified on an Agilent Bioanalyzer 2100 (Agilent Technologies, Palo Alto, CA). cDNA was synthesized from total RNA using Superscript III (Invitrogen) and random primers (Invitrogen).
Quantitative real-time PCR (qPCR) was performed using Taqman or SYBR green-based assays (Applied Biosystems, Foster City, CA) on an Applied Biosystems 7900HT Real-Time PCR System, according to standard protocols. The Taqman assays for CXADR and ATP8A2 assays were custom designed based on regions of differences between the wild-type and pseudogene sequences (Figure S4). Oligonucleotide primers for SYBR green assays were obtained from Integrated DNA Technologies (Coralville, IA). The housekeeping gene GAPDH was used as a loading control. Fold changes were calculated relative to GAPDH and normalized to the median value of the benign samples.
CXADR-Ψ _F CGGTTTCAGTGCTCTATGTTGTTTG; CXADR-Ψ _R TAAATT TAGGATTACATGTTTCTAGAACA; CXADR-Ψ _M 6FAM ATGCCATCCAA AACCA; ATP8A2-Ψ _F CTGGTGTTCTTTGGCATCTACTCA; ATP8A2-Ψ _R CAGCTCAGGATCACAGTTGCT; ATP8A2-Ψ _M 6FAM CTGGTCCACCATT CTC; ATP8A2-WT_F ATCCTATTGAAGGAGGACTCTTTGGA; ATP8A2-WT_R CCAGCAAATTCCCAAGGTCAGT; ATP8A2-WT_M 6FAM AAGGGCAGCCAT TACT; KLK4-KLKP1_F ATGGAAAACGAATTGTTCTG; and KLK4-KLKP1_R CAGTGTTCCGGGTGATGCAG.
Additionally, inventoried Taqman assays for CXADR-WT (Hs00154661_m1) and ATP8A2-WT (assay ID hs00185259_m1) were used.
Sequence stretches unique to pseudogene transcripts were identified by aligning the candidate pseudogene sequences with their corresponding wild-type genes. PCR primers specific to pseudogene transcripts (Table S4) were used to amplify pseudogene cDNAs from index samples followed by Sanger sequencing of the PCR products. The resultant sequences were analyzed using ClustalW to compare the identity between pseudogene and cognate wild-type sequences.
Experimental cells were transfected with siRNAs using oligofectamine reagent (Life Sciences), and 3 days posttransfection, the cells were plated for proliferation assays. At the indicated times, cell numbers were measured using Coulter Counter.
For the wound healing assay, vector control or ATP8A2 pseudogene-overexpressing cells were plated at high density, and 6 hr later, uniform scratch wounds were made using Woundmaker (Incucyte). Relative migration potential of the cells was assessed by confluence measurements at regular time intervals as indicated over the wound area.
The ATP8A2 pseudogene cDNA from breast cancer cell line HCC1806 was cloned into pENTR-D-TOPO entry vector (Invitrogen) following manufacturer’s instructions. Sequence-confirmed entry clones in correct orientation were recombined into Gateway pcDNA-DEST26 mammalian expression vector (Invitrogen) by LR Clonase II enzyme reaction following manufacturer’s instructions. HMEC-TERT cells were transfected using Fugene 6, and polyclonal populations of cells expressing ATP8A2 pseudogene cDNA or empty vector constructs were selected using geneticin. At the indicated times, cell numbers were measured using Coulter Counter.
Chicken chorioallantoic membrane (CAM) assay for tumor growth was carried out as follows. Fertilized eggs were incubated in a humidified incubator at 38°C for 10 days, and then CAM was dropped by drilling two holes: a small hole through the eggshell into the air sac and a second hole near the allantoic vein that penetrates the eggshell membrane but not the CAM. Subsequently, a cutoff wheel (Dremel) was used to cut a 1 cm2 window encompassing the second hole near the allantoic vein to expose the underlying CAM. When ready, CAM was gently abraded with a sterile cotton swab to provide access to the mesenchyme, and 2 × 106 cells in 50 μl volume were implanted on top. The windows were subsequently sealed and the eggs returned to the incubator. After 7 days, extraembryonic tumors were isolated and weighed. Five to ten eggs per group were used in each experiment.
The authors thank Ram-Shankar Mani, Kalpana Ramnarayanan, Joseph Mierwaza, and Sooryanarayana Varambally for technical help and Javed Siddiqui for providing samples. This work was supported, in part, by the National Institutes of Health (R01CA132874), Early Detection Research Network grant UO1 CA111275, Prostate SPORE grant P50CA69568, and the Department of Defense Era of Hope grant BC075023 (A.M.C.). A.M.C. is supported by the Doris Duke Charitable Foundation Clinical Scientist Award and the Prostate Cancer Foundation. A.M.C. is an American Cancer Society Research Professor and A. Alfred Taubman Scholar.
S.K.-S., C.K.-S., and A.M.C. conceived the experiments. S.K.-S. designed and developed the pseudogene analysis pipeline and carried out bioinformatics analyses. C.K.-S., D.R.R., N.P., X.C., and Y.-M.W. performed transcriptome sequencing. S.K.-S. and T.B. carried out ELAND mapping and provided database support. C.K.-S., V.K., and S.M.D. carried out pseudogene validations. S.S., C.K.-S., S.M.D., and I.A.A. performed the experiments. J.R.P and A.S. helped in manuscript review. S.K.-S. and R.J.L. carried out gene-pseudogene correlation analyses. S.K.-S. and M.K.I. performed ChIP-Seq analysis. C.K.-S., S.K.-S., and A.M.C. wrote the manuscript, which was reviewed by all authors.