|Home | About | Journals | Submit | Contact Us | Français|
MicroRNAs (miRNAs) are ~22-nt long, non-coding RNAs that regulate gene silencing. It is known that many human miRNAs are deregulated in numerous types of tumors. Here we report the sequencing of small RNAs (17–25 nt) from 23 breast, bladder, colon and lung tumor samples using high throughput sequencing. We identified 49 novel miRNA and miR-sized small RNAs. We further validated the expression of 10 novel small RNAs in 31 different types of blood, normal and tumor tissue samples using two independent platforms, namely microarray and RT–PCR. Some of the novel sequences show a large difference in expression between tumor and tumor-adjacent tissues, between different tumor stages, or between different tumor types. We also report the identification of novel small RNA classes in human: highly expressed small RNA derived from Y-RNA and endogenous siRNA. Finally, we identified dozens of new miRNA sequence variants that demonstrate the existence of miRNA-related SNP or post-transcriptional modifications. Our work extends the current knowledge of the tumor small RNA transcriptome and provides novel candidates for molecular biomarkers and drug targets.
MicroRNAs (miRNAs) are ~22-nt long non-coding RNA species that negatively regulate gene expression. In recent years the role miRNAs play in cancer has been extensively explored and these non-coding genes were implicated in numerous types of cancer as either oncogenes or tumor-suppressor genes (1). miRNAs are already used as diagnostic biomarkers in clinical assays designed for several types of cancer, such as lung cancer and cancer of unknown primary (2,3).
Next generation sequencing methods, also known as ‘deep sequencing’, have been widely used in recent years. These high throughput and highly sensitive sequencing methods include Roche Applied Sciences (454) GS, Illumina’s Solexa 1G sequencer, and Applied Biosystem’s SOLiD system. Deep sequencing can be used for the discovery of novel miRNA species and other small RNAs that are missed by traditional sequencing of small RNA libraries. Human miRNAs were previously identified using deep sequencing (4–8). However, the miRNA content of solid human tumors has only been partially explored using these methods and yet-unknown miRNAs and other small RNAs may be part of the tumor transcriptome.
Here, we present deep sequencing analysis of miRNAs from 23 solid tumor specimens of four different types: breast, bladder, colon and lung. A computational approach was used to identify known miRNA sequences, miRNA sequence variants (isomiRs), and novel small RNA species in these tumors. Forty-nine novel miRNA and small RNA candidates were identified including several novel miRNA sized RNA sequences: a human endogenous siRNA candidate and highly abundant 22–25 bp small RNAs derived from Y RNA. We also provide sequencing evidence for the existence of the recently identified human miRNA–offset RNAs (MORs) in human tumors. Subsequently, 31 normal and tumor samples from various tissue types were hybridized to a miRNA-microarray containing the novel miRNAs. Some of the novel miRNAs are abundantly expressed in different types of tumors and others are expressed differently between tumor and non-tumor samples, between different tumor stages or between different types of tumors. In addition, using RT–PCR as a third platform we confirmed the expression of five of the novel small RNAs in normal human serum. We identified numerous, abundant, new isomiRs (sequence variants of miRNAs) that may be related to carcinogenesis and may be derived from DNA alterations (SNPs or cancer-related mutations) or post-transcriptional modifications. These new cancer miRNA candidates can potentially be used as diagnostic biomarkers or therapeutic targets in different types of cancer.
Total RNA was extracted twice from each sample of 23 human formalin-fixed paraffin-embedded (FFPE) samples derived from cancerous tissue. RNA was isolated using 10 10-µm-thick tissue sections using the miRdicatorTM extraction protocol developed at Rosetta Genomics. Briefly, the sample was incubated repeatedly in xylene at 57°C to remove excess paraffin, followed by washing in ethanol. Proteins were degraded by incubation in proteinase K solution at 45°C for a few hours. The RNA was extracted with acid phenol:chloroform followed by ethanol precipitation and DNAse digestion. Total RNA quantity and quality were checked by spectrophotometry (Nanodrop ND-1000). Pools of samples of the small RNA fraction (~200 nt and lower) within the total RNA were labeled and hybridized on arrays. After ensuring the presence and expression of more than 100 miRNAs per cancerous tissue pool, tissues were pooled together, resulting in a bladder+breast tumor pool and a colon+lung pool. Array expression (Supplementary Figure S1) revealed the presence of 157 miRNAs from bladder cancer FFPEs, 260 miRNAs from breast cancer FFPEs, 135 miRNAs from lung cancer FFPEs, and 239 miRNAs from colon cancer FFPEs. Total RNA (75 µg) of seven different colon cancer FFPEs were pooled together with 75 µg of six different lung cancer FFPEs, while 75 µg total RNA of five different bladder cancer FFPEs were pooled together with 75 µg of five different breast cancer FFPEs.
The 3′ and 5′ cloning linkers (3′ Linker: 5′-rAppCTGTAGGCACCATCAAT/3ddC/-3′; 5′ Linker: 5′-TGGAATrUrCrUrCrGrGrGrCrArCrCrArArGrGrU-3′) were ligated to purified small RNA species in preparation for cDNA synthesis and amplification, using miRCat™ miRNA Cloning Kit according to the user manual, as described below.
Reverse transcription of the linkered RNA species was carried out followed by PCR amplification. Primer sequences were as follows:
The libraries were built using the MirCat kit, with several modifications, as described below. Enrichment of small RNA was carried out by recovering the small RNA fraction (17–25 nt), identified by internal size markers, from a slice of a 12% denaturing (7 M Urea) polyacrylamide gel. The synthetic RNA size markers run in the lane adjacent to the cancer samples were 15 nt (5′-GCAAAGCACACGGCC-3′), 22 nt (5′-UAUGUAUCGAAUUUAAGCUCAA-3′) and 38 nt (5′-GCAAGGAUGACACGCAAAUUCGUGAAGCGUUCCAUAUU-3′).
Cleaning the desired RNA from the gel was carried out by GeBAflex-tube-midi column (GeBAflex-tube Gel ext & Dialysis Kit Midi, Manufactured by DNR http://www.dnr-is.com/Product.asp?Par=2.215.219&id=228) using an electric current of 300 V for 40 min until the nucleic acid exited from the gel slice, followed by applying reverse polarity of the current for 120 s. This step releases the nucleic acid from the membrane. Isolated RNA was precipitated by adding 8 µl of glycogen, a one-tenth volume of NaOAc 3 M, pH 5.2, and three volumes of cold 100% ETOH, with vortexing after each addition. The isolated RNA was precipitated overnight at −20°C, centrifuged for 1 h at 4°C at 14 000 rpm, followed by washing with 1 ml cold 85% ETOH and subsequent centrifugation for 5 min at 14 000 rpm.
Following recovery of the enriched small RNA fraction from the acrylamide gel slice, the small RNAs were ligated with a 3′ and a 5′ linker in two separate reactions. First, 3′ ligation was performed in which the 3′ linker was ligated to the small RNAs using T4 RNA ligase in the absence of ATP in order to avoid circularization of the RNA fragments, as described in (9). The ligated product was purified by recovering the desired band, identified using size markers, from a slice of a 12% denaturing (7 M Urea) polyacrylamide gel. Two synthetic RNAs (24 and 38 nt, described previously) and two synthetic RNA transcripts (53 and 83 bp) were run adjacent to the cancer samples. Purification and precipitation were carried out as described previously.
The 5′ linker is ligated to the 3′ linkered small RNAs in the presence of 1.0 mM ATP, followed by recovering the desired band from a slice of a 12% denaturing (7 M Urea) polyacrylamide gel with the same size markers. Purification and precipitation done as described before.
The 5′ and 3′ ligated RNAs contained both RNA and DNA regions which were converted to DNA using reverse transcriptase with RT primer, according to MirCat protocol (http://eu.idtdna.com/CATALOG/smallRNAcloning/page1.aspx?display=mircatkit).
The PCR amplification step was carried out using primers different from those provided by the MirCat kit since the primers provided cause strong self- and heterodimers. PCR was carried out using PfuUltra high fidelity DNA polymerase (Stratagene #600380) and pairs of longer PCR primers (40–42-mers) containing sequences complementary to the linkers, tag sequences (small letters) and sequences which were suitable for the 454 platform (underlined). Tag1-flagged colon and lung library and Tag2-flagged breast and bladder library followed by 454 sequences that will convert the small RNA libraries made to ones that can be directly sequenced on the 454 platform.
Samples from five PCR reactions were pooled, extracted with phenol:chloroform, followed by recovery of the desired band from slices of an 8% native polyacrylamide gel. The resulting library was sent for sequencing on the 454 platform.
The deep sequencing process yielded over 200 000 sequences from both libraries. Adaptors were removed using a Perl script allowing internal polyN sequences within the adaptors and one mismatch. About 1000 sequences were removed since they were too short after adaptor removal (<10 bp). The sequences were mapped to the human genome (UCSC hg18 build) using BLAST, allowing maximum 3 bp mismatched to the genome and maximum insertion/deletion (indels) of 3 bp. For each aligned sequence the highest scoring hit was retrieved. All sequences with position overlap were clustered together using a Perl script. We assigned each genomic cluster of sequences the most abundant sequence in this cluster and demanded that for candidate miRNAs, the most abundant sequence was mapped precisely to the genome (not allowing any mismatches/indels). The next step was to annotate known sequences. The following datasets were used for this task: RNA genes, sno/miRNA, RefSeq genes and RepeatMasker tables were downloaded from the UCSC table browser (10), and known miRNA precursors were downloaded from miRBase (11) in order to mark whether the sequence is part of a non-coding gene, a snoRNA, a protein-coding gene exon, a genomic repeat, or a known miRNA precursor, respectively. The sequences of the novel miRNA candidates were extended by several hundred base pairs within their chromosomes in order to predict possible miRNA precursors. An extended sequence was intended to predict the folding of a pri-miRNA that contains a hairpin-folded pre-miRNA. The candidate pri-miRNAs were folded using the Vienna package (12) or mfold (13) programs. All hairpin structures that had at least 6 bp, were at least 55 nt long and had a loop not longer than 20 nt were extracted from the minimum free energy fold of the predicted pri-miRNA (excluding overlapping hairpins). Each hairpin was assigned a hairpin score (Palgrade) and conservation score as described before (14). Predicted miRNA precursors have either Palgrade > 0 meaning it has structural and sequence characteristics of known miRNA) or have absolute value of conservation score > 0.9 (conserved in mammals). These criteria have a sensitivity of 86% for known miRNA precursors from miRBase 13.0. In addition, only sequences with 10 or less genomic copies, with a length of 17–25 bp and a GC content in known miRNA range (15–90%) were chosen as miRNA candidates. Most (75%) miRbase miRs detected in this analysis were detected three or more times. Therefore we consider sequences counted three times or more to have a good chance of being functional. All sequences were deposited in the GEO, accession GSE20418.
Custom microarrays (Biochips) were manufactured by Agilent Technologies (http://www.agilent.com) by in situ synthesizing DNA oligonucleotide probes to 949 known miRNAs and 876 sequences from deep sequencing, printed in triplicate. Out of 49, 44 novel miRNA and small RNAs were used in the microarray (five sequences were identified as novel miRNAs/small RNAs after the design of the microarray). Sequences from deep sequencing were characterized by:
Each probe comprised an antisense sequence of the relevant sequence, followed by a tail sequence (GCAATGCTAGCTATTGCTTGCTATTAAAAA), and was trimmed to the length of 45 nt. Seventeen negative control probes were designed using the sense sequences of different miRNAs. Two groups of positive control probes were designed to hybridize to the array: (i) synthetic small RNA that were spiked to the sample RNA before labeling to verify the labeling efficiency, and (ii) probes for abundant small nuclear RNAs that were spotted on the array to verify RNA quality.
The microarray was hybridized to 38 different samples (Supplementary Table S9). The samples were from 17 different tissue types and blood samples and were divided to normal (n = 8), tumor (n = 15), tumor adjacent (n = 5) and metastasis indications (n = 8).
A total of 2–2.5 µg of total RNA was labeled by ligation of an RNA-linker, p-rCrU-Cy/dye (Eurogentec S.A.; Cy3 or Cy5), to the 3′ end. Synthetic small RNA was spiked into the RNA before labeling to verify the labeling efficiency. Slides were incubated with the labeled RNA for 12–16 h at 55°C and then washed according to Agilent GE washes. Arrays were scanned using Agilent DNA Microarray Scanner Bundle (Agilent Technologies, Santa Clara, CA) at a resolution of 5 µm, dual pass at 100 and 10% PMT power. Array images were analyzed using Agilent Feature Extraction software.
Array images were analyzed using the Feature Extraction software (FE) 9.5.1 (Agilent, Santa Clara, CA). Triplicate spots were combined to produce one signal for each probe by taking the logarithmic mean of reliable spots. All data were log-transformed (natural base) and the analysis was performed in log-space. A reference data vector for normalization R was calculated by taking the median expression level of a subset of all probes (all miRNAs in miRBase 10) across samples. For each sample data vector S, a second degree polynomial F was found so as to provide the best fit between the sample data and the reference data, such that R ≈ F(S). For each probe in the sample (element Si in the vector S), the normalized value (in log-space) Mi was calculated from the initial value Si by transforming it with the polynomial function F, so that Mi = F(Si). P-values were calculated using a two-sided t-test on the log-transformed normalized fluorescence signal. The fold-difference (ratio of the median normalized fluorescence) was calculated for each miRNA. The signal of a sequence is defined as differential between sample ‘A’ and sample ‘B’ if the fold change between the signal in sample A and sample B is either larger than the 95th percentile of fold changes of all sequences expressed in both samples, or >8. Results of all microarray experiments were deposited in Gene Expression Omnibus GSE20418.
RNA was subjected to a polyadenylation reaction as described previously (15). Briefly, RNA was incubated in the presence of poly (A) polymerase (PAP; Takara-2180A), MnCl2 and ATP for 1 h at 37°C. Then, using an oligodT primer harboring a consensus sequence, reverse transcription was performed on total RNA using SuperScript II RT (Invitrogen). Next, the cDNA was amplified by real time PCR; this reaction contained a miRNA-specific forward primer, and universal TaqMan probe complementary to the 3′ end of the oligodT plus part of the tail, and a universal reverse primer complementary to the consensus 3′ sequence of the oligodT tail. For each miR, expression signals were calculated by the formula 42 – Ct(miR-X).
Dual-luciferase assay was conducted using psiCHECK-2 dual luciferase plasmid (Promega) harboring the relevant sequencing in its 3′UTR (GeneScript). Hep3B cells were plated at a density of 4000 cells per well, on white collagen coated plates (with a transparent bottom). Cells were transfected the next day with plasmid baring the relevant 3′UTR, and with or without antisense oligo for the relevant small RNA sequence. Transfection was carried out using 0.3 µl Lipofectamine 2000/well (Invitrogen, Cat# 11668027). Luminescence was assayed 24 h after Transfection, using the Dual Luciferase reporter assay kit (Promega, Cat#E1961). Luminescence was read on Luminoskan Ascent (Thermo). Firefly luciferase from the same plasmid was used for normalization of transfection efficiency. A plasmid vector without 3′UTR alteration in the renilla 3′UTR was used as a reference for constitutive luciferase expression. Results were shown as the ratio between the various treatments and cells transfected with an empty vector.
In order to identify new cancer-related miRNAs, 23 formalin-fixed paraffin-embedded (FFPE) samples of primary solid tumors were obtained from the following tumor tissues: breast (n = 5), bladder (n = 5), colon (n = 7) and lung (n = 6) (samples are described in Supplementary Table S1). Total RNA enriched of small RNA was extracted, RNA quality was examined by hybridization to a custom miRNA microarray. Tissue-specific miRNAs of the tissues sampled were clearly expressed in the microarray experiment, supporting the high quality of the RNA (Supplementary Figure S1). Next, the RNA samples from breast and bladder were pooled together, and RNA samples from colon and lung were pooled together. Small RNA (17–25 nt) was separated and small RNA libraries were prepared and sequenced using 454 Life Sciences technology.
The sequencing process yielded 141 023 sequences from the bladder+breast tumor pool and 90 986 sequences from the colon+lung pool. After combining identical sequences, 27 968 unique sequences remained, 81% of which are 18- to 26-bp long, demonstrating that the libraries are highly enriched in small RNAs. This small RNA size fraction accounts for 93% of all redundant sequences. Sequences were mapped to the human genome using the Blast program (16), allowing up to three mismatched base pairs and indels of up to 3 bp. This process yielded 723 485 genomic loci of mapped sequences. Eighty-three percent of the unique sequences were mapped to the human genome using these criteria and 59% of the unique sequences were mapped with maximum 1 nt mismatch. We postulate that some mismatches, mainly in the 5′ and 3′ edges of the sequences, could result from inaccurate removal of the sequence-flanking adaptors. Subsequently, 565 224 clusters of sequences with genomic position overlap were created. For example, if sequence X was mapped to positions 1-20 within the plus strand of chromosome 1 and a sequence Y was mapped to positions 15–35 on the same chromosome and strand, then the two sequences were unified in the same genomic cluster of chromosome 1, plus strand, positions 1–35. The clusters of sequences represent segments of expressed genes. The statistics of the sequencing and genome mapping procedures are summarized in Table 1.
We first mapped the sequenced reads to known miRNAs from miRBase database (11) according to genomic position overlap, inclusion of sequenced reads in mature miRNA sequences or inclusion of mature miRNA sequences in the sequenced reads. The small RNA libraries were found to be enriched with human miRNAs. Known miRNAs occupy 61% (140 255/230 740) of the total small RNA reads. Out of 885, 387 (44%) human miRNAs were sequenced in at least one read in the different tumor libraries. However, the expression of known miRNAs is very heterogeneous, ranging from 1 read to 25 780 reads. The 10 most abundantly expressed miRNAs in both tumor libraries, are presented in Table 2. Most of these miRNAs were indeed expected to be highly expressed in the tumor tissues screened. hsa-miR-21 is a well known oncogene that is expected to be abundant in solid tumors. We also observed high expression of the miR-141-200 family that is known to be expressed in epithelial tumors similarly to the tumors we screened (3). In addition, we sequenced dozens of rare miRNAs that were only found to be lowly expressed in specific tissues and lack independent validation in other tissues or by other methods. These rare miRNAs include miRNAs that were identified specifically in chronic lymphocytic leukemia cells (17), embryonic stem cells (5,7), colorectal cancer cells (18), or cervical cancer cells (19), detailed in Supplementary Table S2. The expression of these rare miRNAs in tumor tissues enhances their reliability as true miRNAs.
Most miRNAs were sequenced in several sequence variants that were previously referred to as isomiRs (7). The different isomiRs were predominantly variable in the 3′ end of the mature miRNA sequence, a region which is less precisely defined than the miRNA 5′ end (20). Although we expected to find different isomiRs for most miRNAs, we found two surprising phenomena. First, for 74 known miRNAs the most abundant isomiR in our cancer tissue survey was much more abundant (at least 20%) than the reference miRNA sequence from miRBase database (Supplementary Table S3). This suggests that the relative abundance of isomiRs may be inherently different between normal tissue and tumors. Second, 59 known miRNAs had an abundant isomiR with at least one mismatch to the human genome sequence, suggestive of the discovery of novel miRNA-related SNPs/cancer mutations or post-transcriptional modification of the miRNAs (Supplementary Table S4). All these isomiRs were expressed in at least the same number of reads as the miRBase isomiR and seven of these miRNAs were supported by at least 10 reads. Examples are shown in Figure 1. Most of the sequence modifications (69%) occurred in the 3′-end of the miRNA and involved either DNA base modification, 3′ uridylation (Figure 1A and B) or 3′ adenylation (Figure 1C and D). 3′ additions of G or C were completely absent in our data. The high abundance and the specificity of the 3′ terminal single nucleotide insertions suggest that these are regulated post-transcriptional modifications and not DNA-level changes (SNPs/mutations), which are expected to occur in a more random manner. Interestingly, the common 3′ uridylation of hsa-miR-143 (Figure 1A) and the common 3′ adenylation of hsa-miR-100 (Figure 1C) were also identified by other independent methods of small RNA library preparation and sequencing (7,18,21) supporting the idea that these modifications are the result of meaningful cellular processes and not merely a technical artifact. The 3′ adenylation/uridylation described is unlikely to be the result of incorrect adaptor removal, as the 3′ adaptors that were used start with cytosine. We also noted several sequence modifications that occur internally within the miRNA sequence (Figure 1E). These isomiRs demonstrated primarily (77%) C → T or A → T nucleotide modifications, again suggesting involvement of post-transcriptional RNA editing by cytidine deaminase or ADAR enzymes, respectively, contrary to DNA level changes. We cannot exclude the possibility that the rare sequence modifications, sequenced only in few reads, are sequencing errors.
Next, we set out to identify novel miRNAs expressed in the tested cancer tissues. Our miRNA discovery pipeline (Figure 2) led to the identification of four different groups of novel miRNAs and miRNA sized small RNAs: (i) miRNAs derived from known miRNA precursors, (ii) miRNAs derived from novel miRNA precursors, (iii) miRNA sized small RNAs derived from annotated small RNA and genomic repeats, (iv) Endogenous siRNA sequences. The first group (Supplementary Table S5) mainly consisted of the complementary miRNA (miRNA star) sequences of known human miRNAs. These are ~22 nt RNA species nearly complementary to a known miRNA, which are located within the miRNA precursor and which may have an inhibitory activity (22). We identified 18 such novel miRNA star species. In several cases, e.g. hsa-miR-1307-5p and hsa-miR-412-5p, the novel complementary mature miRNA was more abundant than the known miRNA, suggesting that the miRNA identified here is the major active product of the miRNA precursor, at least in the tested tumor samples. In addition we identified seven cases of miRNA-offset RNAs (MORs), a miRNA-like group that was recently characterized (23). MORs are part of the miRNA precursor and are processed from a ~22 bp dsRNA region directly upstream to the miRNA-miRNA star dsRNA region (an example of a sequenced 5′-MOR can be seen in Figure 3A). All MORs sequenced in the human tumors are highly conserved, derived exclusively from the 5′ stem of the miRNA precursor directly upstream to the 5′ miRNA, and lowly expressed relative to the main miRNA product of the precursor. These findings are in accordance with previous results (24). The MORs identified here tend to be located in a region of lower dsRNA stability than the main miRNA-miRNA star pair of the miRNA precursor (Supplementary Figure S2). Therefore, the miRNA precursor of a MOR may switch between different folded RNA structures, only part of which accommodates the MOR in a dsRNA region that would be processed by the canonical miRNA pathway. This may explain the relatively low expression of MORs in comparison to the main mature miRNAs of the precursors. In the case of hsa-miR-410 and hsa-miR-326 we sequenced the 5′ MORs, whereas the miRNA stars of the precursors were not sequenced. Several of the new miRNA star and MOR species identified here were recently identified by an independent study (24), however they do not yet appear in miRBase.
The second group contained completely novel miRNAs from novel miRNA precursors. In this case we used only reads that were exactly mapped to the genome. Reads that were mapped to more than 10 loci were filtered out, since human miRNAs rarely map to more than a few genomic loci. Other reasons for which sequences were discarded include their rare occurrence (i.e. very few reads), length exceeding normal miRNA length and GC content (%GC) higher than the %GC of known miRNAs. After filtering out by these criteria, as well as filtering sequences located within already annotated sequences (known miRNAs, other small RNAs, transposons, coding exons), we predicted miRNA precursors by folding several hundred base pairs flanking the final miRNA candidates using RNAfold (25). In order to reduce the number of false positive predictions, we kept only predicted miRNA precursors that were either evolutionarily conserved or had structural features of known miRNAs. Such structural features include limited lengths of bulges and loops and low folding energy. A miRNA precursor score was computed by integrating these parameters (14). This process resulted in the identification of 20 novel miRNAs (Supplementary Table S6). An example of a novel miRNA precursor, which yields the mature miRNA MID-20989, can be seen in Figure 3B. miRNA MID-20989 is more abundant (16 reads) than hsa-miR-338 (five reads), the product of hsa-mir-338, which is located antisence to the MID-20989 precursor. Although we filtered out sequences within annotated small RNAs, we noticed that two new miRNAs that correspond to a miRNA–miRNA star pair of the same miRNA precursor (MID-16049 and MID-18078) are located within a HBII-82B snoRNA. Several miRNA have previously been identified by other groups as having been derived from snoRNAs (26–28), which makes it plausible that these two sequences indeed function as miRNAs.
The third group contained miRNA sized sequences derived from annotated small RNAs and genomic repeats. Several miRNAs (e.g. hsa-miR-28, hsa-miR-548 family) were previously described as having been derived from such genetic elements (26–30). Sequences whose length exceeded the conventional size of miRNAs (17–25 bp) were discarded. MiRNA precursors were predicted using RNAfold and mFold and the precursor score described above. Finally, only sequences with at least 10 reads were taken, in order to ensure that the identified novel miRNAs were likely to be consistent products of enzymatic excision and not rare degradation products. This strict criterion was used for this group only as these derive from known RNA species that are often highly expressed and their degradation products are expected to be found in the cell, therefore their re-annotation as miRNAs needs stronger evidence. This process revealed three novel miRNA sized sequences (Supplementary Table S7). One of the candidates in this group, MID-24078, is derived from a local hairpin-fold of an Alu repeat. The other two (MID-19433, MID-19434) are, interestingly, derived from Y RNAs. Y RNAs are relatively unexplored non-coding RNA species that are implicated in chromosomal DNA replication (31) and RNA quality control (32). MID-19434 is a 25-nt long RNA derived from a ~100-nt long hY3 RNA like sequence. MID-19434 was highly expressed, with 200 sequenced reads, which is more abundant than over 300 known miRNAs sequenced in the tumor samples analyzed here. The predicted well-folded precursor of this miRNA was precisely aligned to the hY3 RNA (Figure 3C), suggesting that the Y RNA is processed, possibly by Dicer, to yield a 25-bp mature miRNA. MID-19433 is derived from hairpin-folded hY1 Y RNA. Following this finding, we went back to miRBase and found that two relatively newly known miRNAs, hsa-miR-1975 and hsa-miR-1979, which were also sequenced in this study, are actually Y RNA-derived miRNAs. We next set out to explore whether these Y RNA-derived miRNA candidates had gene silencing activity, similarly to known miRNAs. Therefore, we designed a Luciferase assay experiment. This experiment aimed to verify reduced Luciferase activity of a transfected Luciferase gene carrying a complete complementary sequence to the Y RNA-derived miRNA candidates within its 3′ UTR. However, this experiment failed to show gene silencing activity of the miRNA candidates tested, at least in Hep3B cells that were endogenously expressing these sequences (Supplementary Figure S3). This result challenges our hypothesis that the Y RNA derived miRNA sized sequences (including hsa-miR-1975 and hsa-miR-1979) are bona fide miRNAs. A plausible explanation for the high abundance of these miRNA sized RNA species is that these are the specific 20–25 bp Y RNA apoptotic degradation products that remain intact since they are protected by the Ro60 protein (33). However, the high abundance and sequence stability of these small RNAs may suggest that these small RNAs may have a yet unknown function.
Finally, we have also identified a candidate human siRNA. Endogenous siRNA were recently described in mouse oocytes (34), but have not yet been identified in the human transcriptome. These are ~21-bp long RNA species that are processed from a dsRNA by Dicer and assembled in the RNA induced silencing complex (RISC). Figure 3D depicts the candidate human endogenous siRNA we identified in the studied cancer libraries (see also Supplementary Table S8). This is a ~20-nt dsRNA that could be derived from bi-directional transcription of the same locus. Six sequenced reads are transcribed in the same orientation as a mitochondrial tRNA as well as a tRNA-derived pseudogene in several chromosomes, which is the more likely source of the siRNA sequences. Their transcription starts in the transcription start site of the tRNA, suggesting that these sequences are processed from the tRNA transcripts. Two antisense reads create a dsRNA with a short 5′ overhang, as opposed to common siRNA which are characterized by a 3′ overhang. The sense and antisense reads are mapped to nine different genomic loci. Therefore, it is also possible that the complementary sequences were derived from independent single-stranded RNAs and not from a hybridized dsRNA.
In order to verify the expression of the novel miRNAs in another independent platform, and to identify differentially expressed miRNAs in specific tissues or tumors, we designed a custom microarray. The microarray contained probes designed for deep sequencing reads that passed minimal criteria (see Materials and methods section), including 44 of the novel miRNA and small RNAs identified above. The microarray also contained probes for known miRNAs. The microarray was hybridized to 38 different samples (Supplementary Table S9) from 17 different tissue types that included tumor, tumor adjacent, normal and metastasis samples, as well as blood samples. A total of 684 probes were expressed in at least one sample, out of which 584 had %GC<0.75, which would ensure more reliable hybridization, and 244 had a deep sequencing count of 2 or above (in the two libraries combined). Table 3 shows the distribution of miRNAs by each of the above criterion, for known miRNAs, and miRNAs detected by deep sequencing. Only 23% of the known miRNAs were detected in both platforms, i.e. by the microarray and the deep sequencing. Thus, although only a small fraction of the newly defined miRNAs were detected by both platforms, a larger fraction (roughly four times this fraction) could nonetheless be identified as true miRNAs.
The correlation between deep sequencing and microarray signals is significant for all sequence types, but is much higher for known miRNAs (Table 4). Sequences identified as novel miRNAs have a higher correlation than the overall correlation of deep sequencing reads. This strengthens the notion that many of these are indeed true miRNAs.
The expression pattern of the newly identified miRNAs in different tissues is shown in Figure 4. Some of the new miRNAs are expressed in similar levels as miRNAs with a known role in cancer, suggesting that some of the new miRNAs may also play an important role in cancer. For example, comparison of colon tumor versus adjacent normal colon tissue (Figure 4A) supports the known upregulation of hsa-miR-20a (35–37) and hsa-miR-17 (37) in colon cancer. In addition, nine new miRNAs were expressed in both colon tumor and tumor-adjacent colon tissues in high and comparable levels to known miRNAs. Four of these were expressed at least 1.5-fold higher in colon tumor. Figure 4B displays a comparison between lung cancers versus nine non-lung tumor types. hsa-miR-200c, hsa-miR-141 and hsa-miR-205 are miRNA that are known biomarkers of epithelial tumors (3). A new miRNA, MID-19667, which is the star sequence of hsa-miR-663, demonstrates lung tumor tissue-specificity, together with six additional novel miRNAs. Figure 4C depicts a comparison between primary breast tumor and breast metastasis to the lymph node. hsa-mir-130b is known to be upregulated in breast metastasis (38). Three novel miRNAs were expressed higher in non-metastatic breast tumor. Finally, several novel miRNAs were abundantly expressed across many tumor types (Figure 5). MID-19433 and MID-19434 (the Y RNA derived small RNAs) are, notably, expressed in comparable levels to the expression of hsa-miR-21, a well-established oncomiR, supporting their potential role as oncogenic small RNAs. Five of the new miRNAs were also expressed in the serum of healthy people (Figure 6) using a third platform (RT–PCR). Four of the novel miRNAs and miRNA sized small RNAs (MID-19433, MID-19434, MID-17356 and MID-16489) were expressed in all three platforms used.
Finally, we set out to identify the target genes of several of the novel miRNAs. We used the TargetScan program (39) in order to predict target genes. Since the novel miRNAs were identified in tumor tissues, we hypothesized that these may have a potential role in carcinogenesis. Therefore, we crossed the predicted target genes with cancer-related genes from the cancer gene census (40). This analysis yielded a list of 16 402 predicted targets for the 49 candidate miRNA and miRNA sized small RNAs from deep sequencing (Supplementary Table S10). We identified that some novel miRNAs have multiple binding sites in known tumor suppressor genes or oncogenes, e.g. TP53 (MID-22124, MID-16049, MID-18078 and MID-20830), NRAS (MID-16770) and KRAS (MID-16489, MID-17963, MID-24263 and MID-32019). All predicted targets, not only cancer related genes, are shown in Supplementary Table S10.
In this study a large set of 23 human tissue samples from four different tumor types was screened for small RNAs using deep sequencing. Nearly 400 known miRNAs were detected and 49 novel miRNAs and miRNA sized small RNA sequences were identified. Further support is provided for the expression of 10 novel sequences in a different platform (10 by microarray and five also by RT–PCR) and in a broader range of blood, normal and cancer tissues that were not surveyed by deep sequencing. Some of the novel sequences are expressed differently between different tissues, such as tumor and adjacent normal tissue. Novel types of miRNA sized sequences are reported here, revealing new small RNA groups, including Y-RNA-derived small RNAs, and putative endogenous siRNAs. In addition we identified a large variety of new isomiRs, many of which demonstrate DNA change (SNP/cancer-related mutation) or post-transcriptional modification.
Deep sequencing is a useful tool that was previously used to uncover unknown small RNA groups, such as piRNAs, murine endogenous siRNAs, and mirtrons (34,41–43). Here we report the identification of two novel small RNA groups from human solid tumors. We estimate that these novel sequences were ignored in former deep sequencing analyses due to the following reasons: (i) Low expression—the putative endogenous siRNA were sequenced in <10 reads each. (ii) Location within annotated sequences—as part of deep sequencing analysis, annotated reads are as a rule filtered out in order to identify new sequences. The endo-siRNA is located within tRNA-derived pseudogene; Y-RNA derived miRNAs are located within annotated small cytoplasmic Y-RNA. This former annotation may have caused the overlooking of these sequences when analyzed by others before. (iii) Deep sequencing of unexplored tumor tissues. Relatively few human tumor tissues were deeply sequenced till now. Specific expression of the identified sequences in these tissues may suggest that the described sequence novelties are related to carcinogenesis of solid tissues, e.g. Y-RNA derived small RNAs may be the result of aberrant processing of Y-RNA in cancerous tissue. In addition we identified other recently identified human miRNA sized species such as: MORs (24), antisense transcribed miRNA (4), and snoRNA-derived miRNA (26–28).
The two novel small RNAs that are most abundantly expressed in different tumors in all platforms (high throughput sequencing, microarray, and RT–PCR), MID-19433 and MID-19434, are derived from small cytoplasmic Y RNA. However, the microarray and RT–PCR cannot differentiate between intact Y RNA and small RNA derived from Y RNA. Therefore, further validation is needed in order to rule out the possibility that the bulk of the signals recorded in these platforms are due to unprocessed Y RNAs. The potential importance of these two novel miRNA sized sequences in tumorgenesis is supported by a recent work reporting that the Y RNAs hY1 and hY3, that are the unprocessed Y RNAs of MID-19433 and MID-19434, respectively, are overexpressed in carcinomas of the bladder, cervix, colon, kidney, lung and prostate (44). Therefore, measuring the highly expressed small RNAs from these Y RNAs can be used for molecular diagnostics of these cancers. The fact that these were also detected in serum confers their potential usage in non-invasive assays.
In this survey, we report the identification of dozens of new highly abundant isomiRs, including some isomiRs that demonstrate either existence of novel SNP or clear RNA modification. Several small-scale variations in miRNA genes were implicated before with carcinogenesis of various tissues, such as breast cancer (45,46), pancreatic cancer (47), lung cancer (48) and thyroid cancer (49). Additional data of isomiRs expression in normal tissues is needed in order to gain insight whether the novel isomiRs are related to the carcinogenesis of solid tumors. Some of the novel isomiRs were found to be significantly more abundant in this study than their reference miRNA sequences in miRBase database. Therefore, we suggest that in these cases the reference miRNA sequence should be considered for revision to the most abundantly expressed isomiR described here.
Although the number of novel miRNAs identified in this study is relatively high, it is important to note that the novel miRNAs are expressed on average in much lower levels than known miRNAs, with the exception of the Y-RNA derived small RNAs that are abundantly expressed in both tissue pools. Our work joins recently published works that reported a similar result regarding scarcity of newly identified miRNAs, which calls for additional verification in order to establish functional relevance (50). The fact that miRNAs discovered in the last two years using very sensitive methods are generally of low expression or are tissue-specific, suggests that the more abundant and non-specific miRNAs have mostly been identified already. However, as this work demonstrates, using a sensitive method of next generation sequencing on RNA extracted from tumors, in addition to careful computational analysis and followed by verification experiments can identify yet unknown sequences such as the new miRNAs, MORs, Y-RNA derived sequences and endogenous siRNAs presented in this analysis. The identification of such tumor-specific small RNAs, could lead to the development of new therapeutic targets, which may be utilized as a treatment more specific than the set of tools currently available.
GEO accession number: GSE20418.
Funding for open access charge: Internal funding.
Conflict of interest statement. Authors affiliated with Rosetta Genomics are full-time employees and/or hold equity in the company, which develops microRNA based diagnostic products and may stand to gain by publications of these findings.
Supplementary Data are available at NAR Online.
The authors thank the technicians and researchers at Rosetta Genomics for their assistance and contributions, and Ranit Aharonov, Moshe Hoshen and Shai Rosenwald from Rosetta Genomics for scientific discussions and comments on the manuscript.