It has long been assumed that DNA sequences and corresponding RNA transcripts are almost identical; a recent discovery, however, revealed widespread RNA-DNA differences (RDDs), which represent a largely unexplored aspect of human genome variation. It has been speculated that RDDs can affect disease susceptibility and manifestations; however, almost nothing is known about how RDDs are related to disease. Here, we show that RDDs are rarer in proto-oncogenes than in tumor suppressor genes; the number of RDDs in coding exons, but not in 3′UTR and 5′UTR, is significantly lower in the former than the latter, and this trend is especially pronounced in non-synonymous RDDs, i.e., those cause amino acid changes. A potential mechanism is that, unlike proto-oncogenes, the requirement of tumor suppressor genes to have both alleles affected to cause tumor ‘buffers' these genes to tolerate more RDDs.
For the many years, the central dogma of molecular biology has been that RNA functions mainly as an informational intermediate between a DNA sequence and its encoded protein. But one of the great surprises of modern biology was the discovery that protein-coding genes represent less than 2% of the total genome sequence, and subsequently the fact that at least 90% of the human genome is actively transcribed. Thus, the human transcriptome was found to be more complex than a collection of protein-coding genes and their splice variants. Although initially argued to be spurious transcriptional noise or accumulated evolutionary debris arising from the early assembly of genes and/or the insertion of mobile genetic elements, recent evidence suggests that the non-coding RNAs (ncRNAs) may play major biological roles in cellular development, physiology and pathologies. NcRNAs could be grouped into two major classes based on the transcript size; small ncRNAs and long ncRNAs. Each of these classes can be further divided, whereas novel subclasses are still being discovered and characterized. Although, in the last years, small ncRNAs called microRNAs were studied most frequently with more than ten thousand hits at PubMed database, recently, evidence has begun to accumulate describing the molecular mechanisms by which a wide range of novel RNA species function, providing insight into their functional roles in cellular biology and in human disease. In this review, we summarize newly discovered classes of ncRNAs, and highlight their functioning in cancer biology and potential usage as biomarkers or therapeutic targets.
Non-coding RNAs; microRNAs; siRNAs; piRNAs; lncRNAs; Cancer
RNA editing is an important cellular process by which the nucleotides in a mature RNA transcript are altered to cause them to differ from the corresponding DNA sequence. While this process yields essential transcripts in humans and other organisms, it is believed to occur at a relatively small number of loci. The rarity of RNA editing has been challenged by a recent comparison of human RNA and DNA sequence data from 27 individuals, which revealed that over 10,000 human exonic sites appear to exhibit RNA-DNA differences (RDDs). Many of these differences could not have been caused by either of the two previously known human RNA editing mechanisms—ADAR-mediated A→G substitutions or APOBEC1-mediated C→U switches—suggesting that a previously unknown mechanism of RNA editing may be active in humans. Here, we reanalyze these data and demonstrate that genomic sequences exist in these same individuals or in the human genome that match the majority of RDDs. Our results suggest that the majority of these RDD events were observed due to accurate transcription of sequences paralogous to the apparently edited gene but differing at the edited site. In light of our results it seems prudent to conclude that if indeed an unknown mechanism is causing RDD events in humans, such events occur at a much lower frequency than originally proposed.
It has recently been shown that RNA 3′ end formation plays a more widespread role in controlling gene expression than previously thought. In order to examine the impact of regulated 3′ end formation genome-wide we applied direct RNA sequencing to A. thaliana. Here we show the authentic transcriptome in unprecedented detail and how 3′ end formation impacts genome organization. We reveal extreme heterogeneity in RNA 3′ ends, discover previously unrecognized non-coding RNAs and propose widespread re-annotation of the genome. We explain the origin of most poly(A)+ antisense RNAs and identify cis-elements that control 3′ end formation in different registers. These findings are essential to understand what the genome actually encodes, how it is organized and the impact of regulated 3′ end formation on these processes.
Comparative RNA-seq analysis of two related pathogenic and non-pathogenic bacterial strains reveals a hidden layer of divergence in the non-coding genome as well as conserved, widespread regulatory structures called ‘Excludons', which mediate regulation through long non-coding antisense RNAs.
Comparative transcriptome sequencing of two closely related bacterial strains reveals a hidden layer of divergence in the non-coding genome.Pathogen-specific non-coding RNAs, which might contribute to virulence, are revealed.The Listeria genome contains a class of unusually long antisense RNAs (lasRNAs) which spans divergent genes and repress expression of the genes located opposite to them while activating the other. The genetic organization of these lasRNAs and operon was named an excludon.The exhaustive transcriptome information from this publication is provided as an open resource with a web-accessible transcriptome browser.
Listeria monocytogenes is a human, food-borne pathogen. Genomic comparisons between L. monocytogenes and Listeria innocua, a closely related non-pathogenic species, were pivotal in the identification of protein-coding genes essential for virulence. However, no comprehensive comparison has focused on the non-coding genome. We used strand-specific cDNA sequencing to produce genome-wide transcription start site maps for both organisms, and developed a publicly available integrative browser to visualize and analyze both transcriptomes in different growth conditions and genetic backgrounds. Our data revealed conservation across most transcripts, but significant divergence between the species in a subset of non-coding RNAs. In L. monocytogenes, we identified 113 small RNAs (33 novel) and 70 antisense RNAs (53 novel), significantly increasing the repertoire of ncRNAs in this species. Remarkably, we identified a class of long antisense transcripts (lasRNAs) that overlap one gene while also serving as the 5′ UTR of the adjacent divergent gene. Experimental evidence suggests that lasRNAs transcription inhibits expression of one operon while activating the expression of another. Such a lasRNA/operon structure, that we named ‘excludon', might represent a novel form of regulation in bacteria.
comparative genomics; Listeria monocytogenes; RNA-seq; transcriptome; TSS map
In recent years, the introduction of massively parallel sequencing platforms for Next Generation Sequencing (NGS) protocols, able to simultaneously sequence hundred thousand
DNA fragments, dramatically changed the landscape of the genetics studies. RNA-Seq for transcriptome studies, Chip-Seq for DNA-proteins interaction,
CNV-Seq for large genome nucleotide variations are only some of the intriguing new
applications supported by these innovative platforms. Among them RNA-Seq
is perhaps the most complex NGS application. Expression levels of specific genes,
differential splicing, allele-specific expression of transcripts can be accurately determined by RNA-Seq experiments to address many biological-related issues. All these attributes are not readily achievable from previously widespread
hybridization-based or tag sequence-based approaches. However, the unprecedented level
of sensitivity and the large amount of available data produced by NGS platforms provide
clear advantages as well as new challenges and issues. This technology brings the
great power to make several new biological observations and discoveries, it also requires
a considerable effort in the development of new bioinformatics tools to deal with these
massive data files. The paper aims to give a survey of the RNA-Seq
methodology, particularly focusing on the challenges that this application presents both
from a biological and a bioinformatics point of view.
Neurons modulate gene expression with subcellular precision through excitation-coupled local protein synthesis, a process that is regulated in part through the involvement of microRNAs (miRNAs), a class of small non-coding RNAs. The biosynthesis of miRNAs is reviewed, with special emphasis on miRNA families, the subcellular localization of specific miRNAs in neurons, and their potential roles in the response to drugs of abuse. For over a decade, DNA microarrays have dominated genome-wide gene expression studies, revealing widespread effects of drug exposure on neuronal gene expression. We review a number of recent studies that explore the emerging role of miRNAs in the biochemical and behavioral responses to cocaine. The more powerful next-generation sequencing technology offers certain advantages and is supplanting microarrays for the analysis of complex transcriptomes. Next-generation sequencing is unparalleled in its ability to identify and quantify low-abundance transcripts without prior sequence knowledge, facilitating the accurate detection and quantification of miRNAs expressed in total tissue and miRNAs localized to postsynaptic densities (PSDs). We previously identified cocaine-responsive miRNAs, synaptically enriched and depleted miRNA families, and confirmed cocaine-induced changes in protein expression for several bioinformatically predicted target genes. The miR-8 family was found to be highly enriched and cocaine-regulated at the PSD, where its members may modulate expression of cell adhesion molecules. An integrative approach that combines mRNA, miRNA, and protein expression profiling in combination with focused single gene studies and innovative behavioral paradigms should facilitate the development of more effective therapeutic approaches to treat addiction.
cocaine; RNA-Seq; postsynaptic density; cell adhesion; miR-8; microRNAs; synaptic plasticity
Foot-and-mouth disease virus (FMDV) uses a highly conserved Arg-Gly-Asp (RGD) triplet for attachment to host cells and this motif is believed to be essential for virus viability. Previous sequence analyses of the 1D-encoding region of an FMDV field isolate (Asia1/JS/CHA/05) and its two derivatives indicated that two viruses, which contained an Arg-Asp-Asp (RDD) or an Arg-Ser-Asp (RSD) triplet instead of the RGD integrin recognition motif, were generated serendipitously upon short-term evolution of field isolate in different biological environments. To examine the influence of single amino acid substitutions in the receptor binding site of the RDD-containing FMD viral genome on virus viability and the ability of non-RGD FMDVs to cause disease in susceptible animals, we constructed an RDD-containing FMDV full-length cDNA clone and derived mutant molecules with RGD or RSD receptor recognition motifs. Following transfection of BSR cells with the full-length genome plasmids, the genetically engineered viruses were examined for their infectious potential in cell culture and susceptible animals.
Amino acid sequence analysis of the 1D-coding region of different derivatives derived from the Asia1/JS/CHA/05 field isolate revealed that the RDD mutants became dominant or achieved population equilibrium with coexistence of the RGD and RSD subpopulations at an early phase of type Asia1 FMDV quasispecies evolution. Furthermore, the RDD and RSD sequences remained genetically stable for at least 20 passages. Using reverse genetics, the RDD-, RSD-, and RGD-containing FMD viruses were rescued from full-length cDNA clones, and single amino acid substitution in RDD-containing FMD viral genome did not affect virus viability. The genetically engineered viruses replicated stably in BHK-21 cells and had similar growth properties to the parental virus. The RDD parental virus and two non-RGD recombinant viruses were virulent to pigs and bovines that developed typical clinical disease and viremia.
FMDV quasispecies evolving in a different biological environment gained the capability of selecting different receptor recognition site. The RDD-containing FMD viral genome can accommodate substitutions in the receptor binding site without additional changes in the capsid. The viruses expressing non-RGD receptor binding sites can replicate stably in vitro and produce typical FMD clinical disease in susceptible animals.
In the mammalian cortex, neurons and glia form a patterned structure across six layers whose complex cytoarchitectonic arrangement is likely to contribute to cognition. We sequenced transcriptomes from layers 1-6b of different areas (primary and secondary) of the adult (postnatal day 56) mouse somatosensory cortex to understand the transcriptional levels and functional repertoires of coding and noncoding loci for cells constituting these layers. A total of 5,835 protein-coding genes and 66 noncoding RNA loci are differentially expressed (“patterned”) across the layers, on the basis of a machine-learning model (naive Bayes) approach. Layers 2-6b are each associated with specific functional and disease annotations that provide insights into their biological roles. This new resource (http://genserv.anat.ox.ac.uk/layers) greatly extends currently available resources, such as the Allen Mouse Brain Atlas and microarray data sets, by providing quantitative expression levels, by being genome-wide, by including novel loci, and by identifying candidate alternatively spliced transcripts that are differentially expressed across layers.
► Online atlas of genome-wide transcription across neocortical layers ► Significant, replicated associations between disease genes and specific layers ► Widespread isoform switching across layers ► LincRNAs conserved, coexpressed across layers with neighboring protein-coding genes
Splicing is a cellular mechanism, which dictates eukaryotic gene expression by removing the noncoding introns and ligating the coding exons in the form of a messenger RNA molecule. Alternative splicing (AS) adds a major level of complexity to this mechanism and thus to the regulation of gene expression. This widespread cellular phenomenon generates multiple messenger RNA isoforms from a single gene, by utilizing alternative splice sites and promoting different exon–intron inclusions and exclusions. AS greatly increases the coding potential of eukaryotic genomes and hence contributes to the diversity of eukaryotic proteomes. Mutations that lead to disruptions of either constitutive splicing or AS cause several diseases, among which are myotonic dystrophy and cystic fibrosis. Aberrant splicing is also well established in cancer states. Identification of rare novel mutations associated with splice-site recognition, and splicing regulation in general, could provide further insight into genetic mechanisms of rare diseases. Here, disease relevance of aberrant splicing is reviewed, and the new methodological approach of starting from disease phenotype, employing exome sequencing and identifying rare mutations affecting splicing regulation is described. Exome sequencing has emerged as a reliable method for finding sequence variations associated with various disease states. To date, genetic studies using exome sequencing to find disease-causing mutations have focused on the discovery of nonsynonymous single nucleotide polymorphisms that alter amino acids or introduce early stop codons, or on the use of exome sequencing as a means to genotype known single nucleotide polymorphisms. The involvement of splicing mutations in inherited diseases has received little attention and thus likely occurs more frequently than currently estimated. Studies of exome sequencing followed by molecular and bioinformatic analyses have great potential to reveal the high impact of splicing mutations underlying human disease.
Chlamydia trachomatis is an obligate intracellular pathogenic bacterium that has been refractory to genetic manipulations. Although the genomes of several strains have been sequenced, very little information is available on the gene structure of these bacteria. We used deep sequencing to define the transcriptome of purified elementary bodies (EB) and reticulate bodies (RB) of C. trachomatis L2b, respectively. Using an RNA-seq approach, we have mapped 363 transcriptional start sites (TSS) of annotated genes. Semi-quantitative analysis of mapped cDNA reads revealed differences in the RNA levels of 84 genes isolated from EB and RB, respectively. We have identified and in part confirmed 42 genome- and 1 plasmid-derived novel non-coding RNAs. The genome encoded non-coding RNA, ctrR0332 was one of the most abundantly and differentially expressed RNA in EB and RB, implying an important role in the developmental cycle of C. trachomatis. The detailed map of TSS in a thus far unprecedented resolution as a complement to the genome sequence will help to understand the organization, control and function of genes of this important pathogen.
The transcriptome of a cell is represented by a myriad of different RNA molecules with and without protein-coding capacities. In recent years, advances in sequencing technologies have allowed researchers to more fully appreciate the complexity of whole transcriptomes, showing that the vast majority of the genome is transcribed, producing a diverse population of non-protein coding RNAs (ncRNAs). Thus, the biological significance of non-coding RNAs (ncRNAs) have been largely underestimated. Amongst these multiple classes of ncRNAs, the long non-coding RNAs (lncRNAs) are apparently the most numerous and functionally diverse. A small but growing number of lncRNAs have been experimentally studied, and a view is emerging that these are key regulators of epigenetic gene regulation in mammalian cells. LncRNAs have already been implicated in human diseases such as cancer and neurodegeneration, highlighting the importance of this emergent field. In this article, we review the catalogs of annotated lncRNAs and the latest advances in our understanding of lncRNAs.
non-coding RNAs; regulation; long non-coding RNA; epigenetics
Human papillomaviruses (HPV) cause diseases ranging from benign warts to invasive tumours. A subset of these viruses termed “high risk” infects the cervix where persistent infection can lead to cervical cancer. Although many HPV genomes have been sequenced, knowledge of virus gene expression and its regulation is still incomplete. This is due in part to lack, until recently, of suitable systems for virus propagation in the laboratory. HPV gene expression is polycistronic initiating from multiple promoters. Gene regulation occurs at transcriptional, but particularly post-transcriptional levels, including RNA processing, nuclear export, mRNA stability and translation. A close association between the virus replication cycle and epithelial differentiation adds a further layer of complexity. Understanding HPV mRNA expression and its regulation in the different diseases associated with infection may lead to development of novel diagnostic approaches and will reveal key viral and cellular targets for development of novel antiviral therapies.
Human papillomavirus; gene expression; transcription; RNA processing; translation; diagnostics; antiviral therapy
Drosophila melanogaster is one of the most well studied genetic model organisms, nonetheless its genome still contains unannotated coding and non-coding genes, transcripts, exons, and RNA editing sites. Full discovery and annotation are prerequisites for understanding how the regulation of transcription, splicing, and RNA editing directs development of this complex organism. We used RNA-Seq, tiling microarrays, and cDNA sequencing to explore the transcriptome in 30 distinct developmental stages. We identified 111,195 new elements, including thousands of genes, coding and non-coding transcripts, exons, splicing and editing events and inferred protein isoforms that previously eluded discovery using established experimental, prediction and conservation-based approaches. Together, these data substantially expand the number of known transcribed elements in the Drosophila genome and provide a high-resolution view of transcriptome dynamics throughout development.
The role of long non-coding RNAs (lncRNAs) in controlling gene expression has garnered increased interest in recent years. Sequencing projects, such as Fantom3 for mouse and H-InvDB for human, have generated abundant data on transcribed components of mammalian cells, the majority of which appear not to be protein-coding. However, much of the non-protein-coding transcriptome could merely be a consequence of ‘transcription noise’. It is therefore essential to use bioinformatic approaches to identify the likely functional candidates in a high throughput manner.
We derived a scheme for classifying and annotating likely functional lncRNAs in mammals. Using the available experimental full-length cDNA data sets for human and mouse, we identified 78 lncRNAs that are either syntenically conserved between human and mouse, or that originate from the same protein-coding genes. Of these, 11 have significant sequence homology. We found that these lncRNAs exhibit: (i) patterns of codon substitution typical of non-coding transcripts; (ii) preservation of sequences in distant mammals such as dog and cow, (iii) significant sequence conservation relative to their corresponding flanking regions (in 50% cases, flanking regions do not have homology at all; and in the remaining, the degree of conservation is significantly less); (iv) existence mostly as single-exon forms (8/11); and, (v) presence of conserved and stable secondary structure motifs within them. We further identified orthologous protein-coding genes that are contributing to the pool of lncRNAs; of which, genes implicated in carcinogenesis are significantly over-represented.
Our comparative mammalian genomics approach coupled with evolutionary analysis identified a small population of conserved long non-protein-coding RNAs (lncRNAs) that are potentially functional across Mammalia. Additionally, our analysis indicates that amongst the orthologous protein-coding genes that produce lncRNAs, those implicated in cancer pathogenesis are significantly over-represented, suggesting that these lncRNAs could play an important role in cancer pathomechanisms.
Mitochondrial genomes are a valuable source of data for analysing phylogenetic relationships. Besides sequence information, mitochondrial gene order may add phylogenetically useful information, too. Sipuncula are unsegmented marine worms, traditionally placed in their own phylum. Recent molecular and morphological findings suggest a close affinity to the segmented Annelida.
The first complete mitochondrial genome of a member of Sipuncula, Sipunculus nudus, is presented. All 37 genes characteristic for metazoan mtDNA were detected and are encoded on the same strand. The mitochondrial gene order (protein-coding and ribosomal RNA genes) resembles that of annelids, but shows several derivations so far found only in Sipuncula. Sequence based phylogenetic analysis of mitochondrial protein-coding genes results in significant bootstrap support for Annelida sensu lato, combining Annelida together with Sipuncula, Echiura, Pogonophora and Myzostomida.
The mitochondrial sequence data support a close relationship of Annelida and Sipuncula. Also the most parsimonious explanation of changes in gene order favours a derivation from the annelid gene order. These results complement findings from recent phylogenetic analyses of nuclear encoded genes as well as a report of a segmental neural patterning in Sipuncula.
Gene expression in mitochondria of kinetoplastid protozoa requires RNA editing, a post-transcriptional process which involves insertion or deletion of uridine residues at specific sites within mitochondrial pre-mRNAs. Sequence specificity of the RNA editing process is mediated by oligo-uridylated small, non-coding RNAs, designated as guide RNAs (gRNAs). In this study, we have analyzed the small ncRNA transcriptome from kinetoplast mitochondria of Leishmania tarentolae by generating specialized cDNA libraries encoding size-selected RNA species. Through this screen, a significant number of novel oligo-uridylated RNA species, which we have termed oU-RNAs, has been identified. Most novel oU-RNAs are present as stable RNA species in mitochondria as assessed by northern blot analysis. Thereby, novel oU-RNAs show similar expression levels and sizes as previously reported for canonical gRNAs. Several oU-RNAs are transcribed from both strands of the maxicircle and minicircles components of the mitochondrial genome, from regions where up till now no transcription has been reported. Two stable oU-RNAs exhibit an anchor sequence in antisense orientation to known gRNAs and thus might regulate editing of respective pre-mRNAs. A number of oU-RNAs map in antisense orientation to non-edited protein-coding genes suggesting that they might function by a different mechanism. In addition, our screen shows that all kinetoplast-derived RNAs are prone to some degree of uridylation.
Backtranslation is the process of decoding a sequence of amino acids into the corresponding codons. All synthetic gene design systems include a backtranslation module. The degeneracy of the genetic code makes backtranslation potentially ambiguous since most amino acids are encoded by multiple codons. The common approach to overcome this difficulty is based on imitation of codon usage within the target species.
This paper describes EasyBack, a new parameter-free, fully-automated software for backtranslation using Hidden Markov Models. EasyBack is not based on imitation of codon usage within the target species, but instead uses a sequence-similarity criterion. The model is trained with a set of proteins with known cDNA coding sequences, constructed from the input protein by querying the NCBI databases with BLAST. Unlike existing software, the proposed method allows the quality of prediction to be estimated. When tested on a group of proteins that show different degrees of sequence conservation, EasyBack outperforms other published methods in terms of precision.
The prediction quality of a protein backtranslation methis markedly increased by replacing the criterion of most used codon in the same species with a Hidden Markov Model trained with a set of most similar sequences from all species. Moreover, the proposed method allows the quality of prediction to be estimated probabilistically.
The exploration of the non-protein-coding RNA (ncRNA) transcriptome is currently focused on profiling of microRNA expression and detection of novel ncRNA transcription units. However, recent studies suggest that RNA processing can be a multi-layer process leading to the generation of ncRNAs of diverse functions from a single primary transcript. Up to date no methodology has been presented to distinguish stable functional RNA species from rapidly degraded side products of nucleases. Thus the correct assessment of widespread RNA processing events is one of the major obstacles in transcriptome research. Here, we present a novel automated computational pipeline, named APART, providing a complete workflow for the reliable detection of RNA processing products from next-generation-sequencing data. The major features include efficient handling of non-unique reads, detection of novel stable ncRNA transcripts and processing products and annotation of known transcripts based on multiple sources of information. To disclose the potential of APART, we have analyzed a cDNA library derived from small ribosome-associated RNAs in Saccharomyces cerevisiae. By employing the APART pipeline, we were able to detect and confirm by independent experimental methods multiple novel stable RNA molecules differentially processed from well known ncRNAs, like rRNAs, tRNAs or snoRNAs, in a stress-dependent manner.
It was recently shown that a new class of small nuclear RNAs is encoded in introns of protein-coding genes and that they originate by processing of the pre-mRNA in which they are contained. Little is known about the mechanism and the factors involved in this new type of processing. The L1 ribosomal protein gene of Xenopus laevis is a well-suited system for studying this phenomenon: several different introns encode for two small nucleolar RNAs (snoRNAs; U16 and U18). In this paper, we analyzed the in vitro processing of these snoRNAs and showed that both are released from the pre-mRNA by a common mechanism: endonucleolytic cleavages convert the pre-mRNA into a precursor snoRNA with 5' and 3' trailer sequences. Subsequently, trimming converts the pre-snoRNAs into mature molecules. Oocyte and HeLa nuclear extracts are able to process X. laevis and human substrates in a similar manner, indicating that the processing of this class of snoRNAs relies on a common and evolutionarily conserved mechanism. In addition, we found that the cleavage activity is strongly enhanced in the presence of Mn2+ ions.
Non-coding RNA (ncRNA) transcripts are RNA molecules that do not code for proteins, but elicit function by other mechanisms. The vast majority of RNA produced in a cell is non-coding ribosomal RNA, produced from relatively few loci, however more recently complementary DNA (cDNA) cloning, tag sequencing, and genome tiling array studies suggest that ncRNAs also account for the majority of RNA species produced by a cell. ncRNA based regulation has been referred to as a ‘hidden layer’ of signals or ‘dark matter’ that control gene expression in cellular processes by poorly described mechanisms. These terms have appeared as ncRNAs until recently have been ignored by expression profiling and cDNA annotation projects and their mode of action is diverse (e.g. influencing chromatin structure and epigenetics, translational silencing, transcriptional silencing). Here, we highlight recent functional genomics strategies toward identifying and assigning function to ncRNA transcription.
non-coding RNA; Sequencing; transcription; annotation
The transmission of information from DNA to RNA is a critical process. We compared RNA sequences from human B cells of 27 individuals to the corresponding DNA sequences from the same individuals and uncovered more than 10,000 exonic sites where the RNA sequences do not match that of the DNA. All 12 possible categories of discordances were observed. These differences were nonrandom as many sites were found in multiple individuals and in different cell types, including primary skin cells and brain tissues. Using mass spectrometry, we detected peptides that are translated from the discordant RNA sequences and thus do not correspond exactly to the DNA sequences. These widespread RNA-DNA differences in the human transcriptome provide a yet unexplored aspect of genome variation.
The availability of sequencing technology has enabled understanding of transcriptomes through genome-wide approaches including RNA-sequencing. Contrary to the previous assumption that large tracts of the eukaryotic genomes are not transcriptionally active, recent evidence from transcriptome sequencing approaches have revealed pervasive transcription in many genomes of higher eukaryotes. Many of these loci encode transcripts that have no obvious protein-coding potential and are designated as non-coding RNA (ncRNA). Non-coding RNAs are classified empirically as small and long non-coding RNAs based on the size of the functional RNAs. Each of these classes is further classified into functional subclasses. Although microRNAs (miRNA), one of the major subclass of ncRNAs, have been extensively studied for their roles in regulation of gene expression and involvement in a large number of patho-physiological processes, the functions of a large proportion of long non-coding RNAs (lncRNA) still remains elusive. We hypothesized that some lncRNAs could potentially be processed to small RNA and thus could have a dual regulatory output.
Integration of large-scale independent experimental datasets in public domain revealed that certain well studied lncRNAs harbor small RNA clusters. Expression analysis of the small RNA clusters in different tissue and cell types reveal that they are differentially regulated suggesting a regulated biogenesis mechanism.
Our analysis suggests existence of a potentially novel pathway for lncRNA processing into small RNAs. Expression analysis, further suggests that this pathway is regulated. We argue that this evidence supports our hypothesis, though limitations of the datasets and analysis cannot completely rule out alternate possibilities. Further in-depth experimental verification of the observation could potentially reveal a novel pathway for biogenesis.
This article was reviewed by Dr Rory Johnson (nominated by Fyodor Kondrashov), Dr Raya Khanin (nominated by Dr Yuriy Gusev) and Prof Neil Smalheiser. For full reviews, please go to the Reviewer’s comment section.
With the advent of transcriptome data, it has become clear that mRNA-like noncoding RNAs (mlncRNAs) are widespread in eukaryotes. Although their functions are poorly understood, these transcripts may play an important role in development and could thus be involved in determining developmental complexity and phenotypic diversification. However, few studies have assessed their potential roles in the divergence of closely related species. Here, we identify and study patterns of sequence and expression divergence in ten novel candidate mlncRNAs from Drosophila pseudoobscura and its close relative D. persimilis. The candidate mlncRNAs were identified by randomly sequencing a group of 734 cDNA clones from a microarray that showed either no difference in expression (187 clones) or differential expression (547 clones) in comparisons between D. pseudoobscura and D. persimilis and between these two species and their F1 hybrids. Candidate mlncRNAs are overrepresented among differentially expressed transcripts between males of D. pseudoobscura and D. persimilis, and although they have high sequence conservation between these two species, seven of them have no putative homologs in any of the other ten Drosophila species whose genomes have been sequenced. Expression of eight of the ten candidate mlncRNAs was detected either in whole bodies (adults) or testes using a custom-designed oligonucleotide microarray. Three of the ten candidate mlncRNAs are highly expressed (in the top 4% of the male transcriptome), differentially expressed between species, and show extreme levels of sex-bias, with one transcript having the highest level of male bias in the whole transcriptome. Proteomic data from testes show no traces of any predicted peptides from the candidate mlncRNAs. Our results suggest that these mlncRNAs may be important in male-specific processes related to sexual dimorphism and species divergence in this species group.
mlncRNA; noncoding RNA; Drosophila pseudoobscura; species divergence; sex-bias
MicroRNAs are a recently discovered class of small noncoding functional RNAs. These molecules mediate post-transcriptional regulation of gene expression in a sequence specific manner. MicroRNAs are now known to be key players in a variety of biological processes and have been shown to be deregulated in a number of cancers. The discovery of viral encoded microRNAs, especially from a family of oncogenic viruses, has attracted immense attention towards the possibility of microRNAs as critical modulators of viral oncogenesis. The host-virus crosstalk mediated by microRNAs, messenger RNAs and proteins, is complex and involves the different cellular regulatory layers. In this commentary, we describe models of microRNA mediated viral oncogenesis.