To search for evidence of additional microRNAs in H1 human embryonic stem cell cultures (hESC), we prepared RNA samples for deep sequencing using the SOLiD system for massively-parallel sequencing by ligation 
. The differentiation status of hESC cultures was assessed by culture morphology and by immunohistochemistry for standard markers (). ESC cultures () were positive for SSEA4 and Oct4 but not SSEA1, consistent with pluripotency. NSC cultures (, grown in neural induction medium including bFGF) expressed nestin but lacked β-III tubulin (TUJ1), and NRP cultures (, grown in neuronal differentiation medium lacking bFGF but including BDNF) exhibited long processes and mostly expressed β-III tubulin and not nestin. To support this interpretation, Illumina human-6 microarrays were run on single biological replicates of three cell lines (H1, HSF1, and HSF6) under two conditions (ESC, NSC). HSF1 and HSF6 were previously studied for their divergent NSC differentiation patterns 
are consistent with a pluripotent status for ESC and a neural precursor status for NSC. Focusing on 60 genes that were significantly different between ESC and NSC conditions (t-test, 5% false discovery rate [FDR], ≥1.5-fold different), individual cell lines clustered appropriately by their culture condition (). Within the full set of results (available from NIH GEO accession GSE15206
), pluripotent-specific genes such as POU5F1 (OCT4), NANOG, LIN28, DNMT3B (ICF), GABRB3 (MGC9051), and GDF3 were all reproducibly expressed at higher levels in ESC than NSC cultures, agreeing with previous expression studies in hESC lines 
. NSC cultures expressed higher levels of many mRNAs previously identified as specific for human NSC, including PAX6, FOXG1B, TH, NKX2-2, CHAT, TPH2, HDAC6, IL11RA, COL2A1 
(including genes not found to be significantly different due to cell line to cell line variance). The highest relative NSC expression markers included NR2F1, NR2F2, SEPT5, ZFHX4, and GPR162. We conclude that the cultures sampled as ESC and NSC exhibit the appropriate phenotypes.
Characterization of hESC cultures.
Small RNA sequences were determined using SOLiD on libraries prepared from ESC or NSC staged H1 hESC cultures, producing an average of 55 million 25-nucleotide sequences for each of the two conditions (Table S1
, Experiment 1). To search for previously known microRNAs within these results, we ran a simple text-matching query to count the number of sequences perfectly matching each of the human mature microRNAs found in miRBase v11 
. Of the 733 available known mature microRNA and “star” sequences, 598 microRNAs were identified among our results.
The resulting counts of small RNA sequences detected ideally reflect expression levels of individual microRNAs. This technique of digital gene expression (DGE) has been found to be a sensitive method for detecting microRNAs but one with some sequence bias based on library preparation methods 
. Comparing various DGE protocols, ratios among counts of sequences were found to be consistent, meaning that differences between multiple samples assayed by the same method should be accurate 
. A total of four SOLiD assays were run on multiple hESC cell lines and stages (Table S1
). A second slide made use of the multiplexing capability of the SOLiD system to sequence multiple samples at about 7.5 million sequences each by using primers having different barcodes. A third slide sequenced three stages of the novel hESC line RG7, including a “terminally differentiated” (NRP) culture, with an average of 46 million sequences per sample. Is counting of SOLiD-detected sequences an accurate method for measuring microRNA concentrations? We assayed two developmental stages (ESC and NSC) of one cell line (RG7) using both SOLiD and TaqMan microRNA qPCR (File S1
), which is generally accepted as an accurate method for measuring relative RNA concentrations. We found good correlation between methods when comparing fold-change from ESC to NSC (r
0.730, Pearson). We conclude that counting SOLiD sequences exhibits similar accuracy as qPCR over a broad range of concentrations. Therefore, this method is acceptable for judging relative quantities of sequences as potential mature or star strands within a precursor structure.
We performed hierarchical clustering to characterize cell lines and culture stages using counts of all detected known microRNAs. Selecting a list of 82 microRNAs as being significantly different between ESC and NSC cultures across all cell lines (t-test, 5% false discovery rate [FDR], ≥1.5-fold), results were displayed as a heat map, retaining the dendrogram of relationships among samples calculated from all expressed microRNAs, to demonstrate that cultures were distinctly different by culture stage more than by cell line (Figure S1
). Within each culture condition group, the RG7 NRP sample was most dissimilar compared with NSC and NPC cultures, but the H1 embryoid body (EB) culture was more like H1 ESC than other ESC cultures, consistent with a predominant cell line difference. The H1 cell line has been found to contain several genomic deletions (see Methods
) and therefore may represent a true biological outlier compared with the other lines. This identifies known microRNAs that are differentially expressed between the NSC/NPC precursor stage and the terminally differentiated NRP stage.
H1 hESC small RNA sequencing results from Experiment 1 were used to predict genomic microRNA precursors with miRDeep 
. This algorithm was designed to screen deep sequencing reads that have been aligned to genome for model properties consistent with Drosha/Dicer substrate processing. SOLiD sequencing produces reads in “colorspace,” where adjacent pairs of nucleotides are detected using a color-coded fluorescent signal indicating a 2-nt sequence selected by ligation 
. The color-coded reads are interpreted as specifying the “interval” between two adjacent bases. In alignments with genomic sequences, two successive colors must be interpreted consistently since they share a common nucleotide that is interrogated twice. Inconsistent color patterns cannot be aligned to reference genome and thus are discarded, improving the quality of the aligned colorspace sequences. We aligned reads to human genome in colorspace using SHRiMP, which rapidly identifies potential homologies and extends them with a Smith Waterman algorithm 
. The Smith-Waterman method is a well-established dynamic programming algorithm for finding the optimal local alignment between two sequences using a substitution matrix and a gap-scoring system. Approximately 110 million reads (Table S1
, experiment 1) produced 591 million alignments. Alignments were distributed across all chromosomes (Figure S2
To focus on high-confidence alignments, we conservatively selected those having 10 or more overlapping SOLiD small RNA reads aligned with a single genomic location. This was done in two separate datasets—one derived from H1 ESC and the other from H1 NSC samples, reasoning that expression and alignment patterns of one condition might obscure those from the other condition. We do not believe that the sequences selected were due to PCR over-amplification because most sequences aligning in overlapping positions varied in their terminal nucleotide position (not shown). A large number of reads were excluded by selecting 10 or more reads per genomic location—31,773,857 reads were excluded for ESC and 37,779,985 for NSC. Selected alignments, condensed by location, were used to extract surrounding putative precursor sequences from genome. These precursors were folded with RNAFold 
to identify characteristic hairpin structures. Individual small RNA reads, now translated from colorspace to DNA sequence, were re-aligned with putative precursor segments using BLAST 
of all analyses were input to miRDeep 
. Using a cut-off log-odds score of 1.0 (that is, results that are 10-fold more likely than random sequence to match the form of a predicted microRNA precursor, according to the miRDeep algorithm 
), we obtained 1,216 raw candidates from ESC and 4,494 candidates from NSC (). A table of all consolidated sequences comprising all the putative precursors along with genomic alignment positions is found in File S2
. These predictions covered all chromosomes with NSC predictions exhibiting higher frequencies on chromosomes 1 and X (Figure S2
). One example alignment and predicted precursor position is shown in . The grayscale wiggle track of SHRiMP-aligned colorspace sequences identifies two adjacent regions—an interval with more aligned sequences (darker) near an interval with less aligned sequences (lighter), potentially representing mature and star sequences derived from a microRNA precursor. Indeed, after RNAFold and miRDeep computation, a single predicted microRNA and precursor (bottom) is selected.
As an internal control for the prediction process, we chose not to remove sequence reads matching known microRNAs during the prediction phase. Among the miRDeep-derived candidates there are 179 and 238 known microRNAs or snoRNAs from ESC and NSC, respectively. These were removed from predictions using the Tables function of the UCSC genome browser (http://genome.ucsc.edu
). Similarly, prediction sites matching the RNA Genes track of the browser were removed, eliminating fragments of rRNA, tRNA, snoRNA, or other classes of small RNA. Finally, prediction sites matching the RepeatMasker track of the genome browser were removed so that we could focus on non-repeated genomic loci. After filtering, a total of 818 predicted microRNA genes were identified from the two conditions, ESC and NSC. Of these, 75 predictions appeared in both conditions, leaving 287 predictions specific for ESC and 456 specific for NSC (). Comparing miRDeep log-odds score distributions for recovered known microRNAs and predicted microRNAs, there is a large degree of overlap but the predicted microRNA population includes a lower range of scores (Figure S3
). Perhaps some of these 818 predicted microRNAs could be other classes of small ncRNA.
As expected, a larger proportion of these 818 predicted microRNA genes were identified from data unique to a single developmental stage than was true for known microRNAs (, Files S2
). This agrees with the hypothesis that previously identified microRNAs are more commonly expressed in multiple tissues or stages of development. Newly-identified microRNAs are likely to be more specific to tissue or developmental stage.
If these 818 predicted microRNA genes are expressed and/or regulated during early stem cell differentiation, their genomic positions should be phased with respect to chromatin marks, particularly those characteristic of developmental regulation. Using publicly-available ChIP-Seq data from the ESC stage of the H9 cell line 
, we overlapped the start site and direction of all predicted precursors from each class (ESC or NSC; ). For this summary analysis we did not examine potential clusters of microRNAs, presence of microRNA-encoding loci in introns, or the possibility of distant transcription start sites, all of which could confound a more detailed interpretation. However, the precursors predicted from ESC had a phased peak of H3K4me3 with local maximum near ~100 bp upstream of the 5′ end of the precursor and local minima ~500 bp upstream and ~800 bp downstream. No peaks were observed for H3K27me3 or H3K36me3. This pattern is similar to a plot summarizing ChIP-Seq marks for 93 known microRNAs exclusively expressed in hESC (not shown). The ChIP alignment pattern was quite distinct for NSC-derived predictions in ESC cultures. The H3K4me3 track had local maxima at ~500 bp upstream and ~400 bp downstream of the precursor 5′ end with a dip in the track near the 5′ end. To a lesser degree, H3K27me3 and H3K36me3 tracks appear to peak upstream of the precursor. Since H3K4me3 is normally considered to be indicative of active expression, the presence of marks overlapping predicted transcriptional precursors for the ESC predictions in ESC cells is consistent with gene expression. However, there is no clear indication of transcription by the H3K36me3 marks for this class of predictions. On the other hand, increased numbers of marks for H3K27me3 for the NSC predictions in ESC might indicate repression of at least a subset of the predicted precursors. Since H3K36me3 is also found surrounding predicted precursor sites, another subset may be transcribed or it may represent stalled transcription. Overall, a clear presence of epigenetic marks aligning with predicted precursor locations on genome is consistent with regulated expression during early differentiation.
As a validation strategy for potential biological function of these predicted microRNAs, we searched for small RNAs that could be associated with RISC by immunoprecipitation (IP) with anti-Ago2. Ago2 is one member of the Argonaute family found in RISC complexes 
. Available evidence suggests that binding among Ago family members is not sequence-specific 
and so Ago2 binding was expected to serve as a proxy for binding with all Ago proteins. We prepared small RNAs from immunoprecipitated cytoplasmic extracts of hESC or NSC cultures from the RG7 line. Preliminary experiments used qPCR and primers specific for a subset of the predicted microRNAs to test Ago2 IP samples. Out of 87 predicted microRNAs tested, 46 were found to be enriched by Ago2 IP over IgG IP (File S3
). Based on this pilot experiment we expanded our search using deep sequencing of the IP samples. Small RNAs were sequenced and those predicted microRNAs that were enriched at least 4-fold over IgG controls were selected, yielding 146 new microRNAs (Files S2
). A similar analysis of known microRNAs found 609 that were at least 4-fold enriched by Ago2 IP (File S2
, also see Methods S1
). Interestingly, while the majority of Ago2 IP-selected microRNAs were expressed linearly with their concentration within the Ago2 IP sample, several outliers were observed (Figure S4
). From this observation we conclude that Ago2 IP does not always merely reflect cellular concentration and therefore it may be more indicative of microRNA function than measurements of microRNA concentrations in the cell.
If Ago2 IP reliably selected microRNAs among other varieties of small RNA sequences, we expect that these sequences would be reproducibly expressed across biological replicates—in the case of human stem cells, in several independently-isolated cell lines. We quantified sequencing results from small RNA libraries (not immunoprecipitated) from samples of either ESC or NSC staged cultures of H1, HSF1, HSF6, or RG7. Results
were adjusted for sequencing depth by converting to counts per million sequences detected (cpm). All 755 Ago2 IP-selected RNAs (146 predicted and 609 known) were detected (File S2
). Additionally, the computationally-predicted microRNAs would be expected to be expressed in the stage from which it was originally identified (i.e., ESC or NSC). For the 44 microRNAs predicted based on expression in ESC, 41 were present in at least 3 of the 4 ESC cell lines and only 1 was detected in only a single line. Similarly, from the 116 microRNAs predicted based on NSC expression, 115 were detected in at least 3 of the 4 NSC lines and one was found in only two lines. The measured expression levels clearly distinguish cultures by stage. This was independently observed for all 755 RNAs, only the 609 Ago2 IP-enriched known microRNAs and the 146 Ago2 IP-enriched predicted RNAs (). Examining the distribution of expression levels for known vs. predicted microRNAs in each culture stage, we find that in ESC the predicted microRNAs include lower concentrations than the known populations but the distributions mostly overlap in NSC (Figure S5
). This matches our prediction that newly-described microRNAs would be found at lower concentrations in conditions searched previously such as ESC, but that transient developmental stages such as NSC would be a rich source of new microRNAs at any concentration. We conclude that the 146 computationally-predicted microRNAs that are Ago2 IP-enriched are consistently expressed in a biologically-diverse group of samples.
To test for regulation of predicted microRNAs over development, we analyzed expression levels for 755 microRNA (609 known and 146 predicted) that were selected by the Ago2 IP enrichment strategy. RNA was prepared from RG7 hESC cultures at three stages (ESC, NSC, and NRP), induced pluripotent cells obtained from WiCell cultured at two stages (ESC, NSC), and three adult human tissues (brain, heart and kidney). Sequencing libraries were constructed and sequenced (Table S1
). Counted observations were normalized by calculating the counts per million beads detected (File S2
). In general, microRNA expression levels correlated samples according to their developmental stage (). That is, ESC samples from RG7 hESC were similar to an iPS culture obtained from WiCell. RG7 NSC samples were more like NRP and iPS NSC cultures than other tissues. Similar results were obtained by comparing iPS cultures with other hESC lines (not shown).Three adult tissues were least correlated. This pattern was similar for the Ago2 IP-enriched known microRNAs and for the Ago2 IP-enriched predicted microRNAs, except that the predicted microRNAs grouped brain and iPS NSC samples together. We observed that these iPS cells were more likely to spontaneously differentiate based on culture morphology and that these cells were more likely to produce neural markers upon differentiation (not shown), consistent with the increased correlation of iPS NSC cultures with adult brain with predicted (correlation coefficient of 0.830) but not known microRNAs (0.385, ). This is likely due to a more selective expression of newly-identified microRNAs in different tissues or developmental stages.
were also K-means clustered to group similar expression patterns for all 755 Ago2 IP-selected microRNAs, using the Pearson correlation of log values as the metric. From the optimal fit of 11 clusters (Figure S6
), we observe distinct expression patterns for developmental stages or tissues (). Figure S7
includes all individual expression patterns for all 755 microRNAs analyzed, color-coded by cluster membership. Two clusters (2 and 6) included microRNAs primarily expressed in pluripotent cultures and these clusters included mir-302 family members as expected. Other clusters are primarily specific for adult tissues or for neural precursor cultures. Each cluster included both known and predicted microRNAs. Therefore, the 146 predicted microRNAs selected by Ago2 IP are found to be regulated over differentiation as predicted, but in patterns similar to those of known microRNAs.
Since microRNAs were originally identified by strong evolutionary homology, we surmised that newly-described microRNAs observed in hESC could be less conserved, consistent with a more recent evolutionary appearance, as has been predicted 
. We employed a novel metric, SiPhy 
, to assess the substitution rate of each new microRNA across a genomic alignment of 29 mammalian species (manuscript in preparation
). The substitution rates (ω) were determined relative to the substitution rates of ancestral repeats, a model for neutral evolution (). As expected, miRDeep predicted microRNAs (, red) as a class demonstrate higher substitution rates than previously classified ‘known’ microRNAs (, black), suggesting that there is less evolutionary constraint on these sequences. A similar distribution of ω is observed for the subset of new microRNA sequences that were confirmed through direct association with Ago2 (, blue). The new microRNA sequences described here demonstrate a reduced substitution rate from a neutral model (, grey), but are significantly less conserved than the population of known microRNAs. Since previous microRNA annotation efforts have required a high degree of conservation 
, it is clear that the majority of these new sequences would not have met this stringent criterion, despite their physical association with the RISC complex.