We compare the sets of experimentally validated long intergenic non-coding (linc)RNAs from human and mouse and apply a maximum likelihood approach to estimate the total number of lincRNA genes as well as the size of the conserved part of the lincRNome. Under the assumption that the sets of experimentally validated lincRNAs are random samples of the lincRNomes of the corresponding species, we estimate the total lincRNome size at approximately 40,000 to 50,000 species, at least twice the number of protein-coding genes. We further estimate that the fraction of the human and mouse euchromatic genomes encoding lincRNAs is more than twofold greater than the fraction of protein-coding sequences. Although the sequences of most lincRNAs are much less strongly conserved than protein sequences, the extent of orthology between the lincRNomes is unexpectedly high, with 60 to 70% of the lincRNA genes shared between human and mouse. The orthologous mammalian lincRNAs can be predicted to perform equivalent functions; accordingly, it appears likely that thousands of evolutionarily conserved functional roles of lincRNAs remain to be characterized.
Genome analysis of humans and other mammals reveals a surprisingly small number of protein-coding genes, only slightly over 20,000 (although the diversity of actual proteins is substantially augmented by alternative transcription and alternative splicing). Recent analysis of the mammalian genomes and transcriptomes, in particular, using the RNAseq technology, shows that, in addition to protein-coding genes, mammalian genomes encode many long non-coding RNAs. For some of these transcripts, various regulatory functions have been demonstrated, but on the whole the repertoire of long non-coding RNAs remains poorly characterized. We compared the identified long intergenic non-coding (linc)RNAs from human and mouse, and employed a specially developed statistical technique to estimate the size and evolutionary conservation of the human and mouse lincRNomes. The estimates show that there are at least twice as many human and mouse lincRNAs than there are protein-coding genes. Moreover, about two third of the lincRNA genes appear to be conserved between human and mouse, implying thousands of conserved but still uncharacterized functions.
Messenger RNA is a key component of an intricate regulatory network of its own. It accommodates numerous nucleotide signals that overlap protein coding sequences and are responsible for multiple levels of regulation and generation of biological complexity. A wealth of structural and regulatory information, which mRNA carries in addition to the encoded amino acid sequence, raises the question of how these signals and overlapping codes are delineated along non-synonymous and synonymous positions in protein coding regions, especially in eukaryotes. Silent or synonymous codon positions, which do not determine amino acid sequences of the encoded proteins, define mRNA secondary structure and stability and affect the rate of translation, folding and post-translational modifications of nascent polypeptides. The RNA level selection is acting on synonymous sites in both prokaryotes and eukaryotes and is more common than previously thought. Selection pressure on the coding gene regions follows three-nucleotide periodic pattern of nucleotide base-pairing in mRNA, which is imposed by the genetic code. Synonymous positions of the coding regions have a higher level of hybridization potential relative to non-synonymous positions, and are multifunctional in their regulatory and structural roles. Recent experimental evidence and analysis of mRNA structure and interspecies conservation suggest that there is an evolutionary tradeoff between selective pressure acting at the RNA and protein levels. Here we provide a comprehensive overview of the studies that define the role of silent positions in regulating RNA structure and processing that exert downstream effects on proteins and their functions.
The 1,021,348 base pair genome sequence of the Acanthamoeba polyphaga moumouvirus, a new member of the Mimiviridae family infecting Acanthamoeba polyphaga, is reported. The moumouvirus represents a third lineage beside mimivirus and megavirus. Thereby, it is a new member of the recently proposed Megavirales order. This giant virus was isolated from a cooling tower water in southeastern France but is most closely related to Megavirus chiliensis, which was isolated from ocean water off the coast of Chile. The moumouvirus is predicted to encode 930 proteins, of which 879 have detectable homologs. Among these predicted proteins, for 702 the closest homolog was detected in Megavirus chiliensis, with the median amino acid sequence identity of 62%. The evolutionary affinity of moumouvirus and megavirus was further supported by phylogenetic tree analysis of conserved genes. The moumouvirus and megavirus genomes share near perfect orthologous gene collinearity in the central part of the genome, with the variations concentrated in the terminal regions. In addition, genomic comparisons of the Mimiviridae reveal substantial gene loss in the moumouvirus lineage. The majority of the remaining moumouvirus proteins are most similar to homologs from other Mimiviridae members, and for 27 genes the closest homolog was found in bacteria. Phylogenetic analysis of these genes supported gene acquisition from diverse bacteria after the separation of the moumouvirus and megavirus lineages. Comparative genome analysis of the three lineages of the Mimiviridae revealed significant mobility of Group I self-splicing introns, with the highest intron content observed in the moumouvirus genome.
moumouvirus; mimivirus; giant virus; megavirus; Mimiviridae; Megavirales; horizontal gene transfer; viral genome; nucleo-cytoplasmic large DNA viruses
Small hairpin RNAs (shRNAs) became an important research tool in cell biology. Reliable design of these molecules is essential for the needs of large functional genomics projects. To optimize the design of efficient shRNAs, we performed comparative, thermodynamic, and correlation analyses of ~18,000 miR30-based shRNAs with known functional efficiencies, derived from the Sensor Assay project (Fellmann et al., 2011). We identified features of the shRNA guide strand that significantly correlate with the silencing efficiency and performed multiple regression analysis, using 4/5 of the data for training purposes and 1/5 for cross validation. A model that included the position-dependent nucleotide preferences was predictive in the cross-validation data subset (R = 0.39). However, a model, which in addition to the nucleotide preferences included thermodynamic shRNA features such as a thermodynamic duplex stability and position-dependent thermodynamic profile (dinucleotide free energy) was performing better (R = 0.53). Software “miR_Scan” was developed based upon the optimized models. Calculated mRNA target secondary structure stability showed correlation with shRNA silencing efficiency but failed to improve the model. Correlation analysis demonstrates that our algorithm for identification of efficient miR30-based shRNA molecules performs better than approaches that were developed for design of chemically synthesized siRNAs (Rmax = 0.36).
shRNA design; computational models; thermodynamic parameters; miR30-based shRNA
The majority of mammalian genes produce multiple transcripts resulting from alternative splicing (AS) and/or alternative transcription initiation (ATI) and alternative transcription termination (ATT). Comparative analysis of the number of alternative nucleotides, isoforms, and introns per locus in genes with different types of alternative events suggests that ATI and ATT contribute to the diversity of human and mouse transcriptome even more than AS. There is a strong negative correlation between AS and ATI in 5′ untranslated regions (UTRs) and AS in coding sequences (CDSs) but an even stronger positive correlation between AS in CDSs and ATT in 3′ UTRs. These observations could reflect preferential regulation of distinct, large groups of genes by different mechanisms: 1) regulation at the level of transcription initiation and initiation of translation resulting from ATI and AS in 5′ UTRs and 2) posttranslational regulation by different protein isoforms. The tight linkage between AS in CDSs and ATT in 3′ UTRs suggests that variability of 3′ UTRs mediates differential translational regulation of alternative protein forms. Together, the results imply coordinate evolution of AS and alternative transcription, processes that occur concomitantly within gene expression factories.
alternative splicing; alternative transcription initiation; alternative transcription termination; gene expression factories
Prediction of efficient oligonucleotides for RNA interference presents a serious challenge, especially for the development of genome-wide RNAi libraries which encounter difficulties and limitations due to ambiguities in the results and the requirement for significant computational resources. Here we present a fast and practical algorithm for shRNA design based on the thermodynamic parameters. In order to identify shRNA and siRNA features universally associated with high silencing efficiency, we analyzed structure-activity relationships in thousands of individual RNAi experiments from publicly available databases (ftp://ftp.ncbi.nlm.nih.gov/pub/shabalin/siRNA/si_shRNA_selector/). Using this statistical analysis, we found free energy ranges for the terminal duplex asymmetry and for fully paired duplex stability, such that shRNAs or siRNAs falling in both ranges have a high probability of being efficient. When combined, these two parameters yield a ∼72% success rate on shRNAs from the siRecords database, with the target RNA levels reduced to below 20% of the control. Two other parameters correlate well with silencing efficiency: the stability of target RNA and the antisense strand secondary structure. Both parameters also correlate with the short RNA duplex stability; as a consequence, adding these parameters to our prediction scheme did not substantially improve classification accuracy. To test the validity of our predictions, we designed 83 shRNAs with optimal terminal asymmetry, and experimentally verified that small shifts in duplex stability strongly affected silencing efficiency. We showed that shRNAs with short fully paired stems could be successfully selected by optimizing only two parameters: terminal duplex asymmetry and duplex stability of the hypothetical cleavage product, which also relates to the specificity of mRNA target recognition. Our approach performs at the level of the best currently utilized algorithms that take into account prediction of the secondary structure of the target and antisense RNAs, but at significantly lower computational costs. Based on this study, we created the si-shRNA Selector program that predicts both highly efficient shRNAs and functional siRNAs (ftp://ftp.ncbi.nlm.nih.gov/pub/shabalin/siRNA/si_shRNA_selector/).
Comparison of expression levels and breadth and evolutionary rates of intronless and intron-containing mammalian genes shows that intronless genes are expressed at lower levels, tend to be tissue specific, and evolve significantly faster than spliced genes. By contrast, monomorphic spliced genes that are not subject to detectable alternative splicing and polymorphic alternatively spliced genes show similar statistically indistinguishable patterns of expression and evolution. Alternative splicing is most common in ancient genes, whereas intronless genes appear to have relatively recent origins. These results imply tight coupling between different stages of gene expression, in particular, transcription, splicing, and nucleocytosolic transport of transcripts, and suggest that formation of intronless genes is an important route of evolution of novel tissue-specific functions in animals.
alternative splicing; intronless genes; monomorphic genes; polymorphic genes; mammalian gene evolution
Small, hydrophobic proteins whose synthesis is repressed by small RNAs (sRNAs), denoted type I toxin–antitoxin modules, were first discovered on plasmids where they regulate plasmid stability, but were subsequently found on a few bacterial chromosomes. We used exhaustive PSI-BLAST and TBLASTN searches across 774 bacterial genomes to identify homologs of known type I toxins. These searches substantially expanded the collection of predicted type I toxins, revealed homology of the Ldr and Fst toxins, and suggested that type I toxin–antitoxin loci are not spread by horizontal gene transfer. To discover novel type I toxin–antitoxin systems, we developed a set of search parameters based on characteristics of known loci including the presence of tandem repeats and clusters of charged and bulky amino acids at the C-termini of short proteins containing predicted transmembrane regions. We detected sRNAs for three predicted toxins from enterohemorrhagic Escherichia coli and Bacillus subtilis, and showed that two of the respective proteins indeed are toxic when overexpressed. We also demonstrated that the local free-energy minima of RNA folding can be used to detect the positions of the sRNA genes. Our results suggest that type I toxin–antitoxin modules are much more widely distributed among bacteria than previously appreciated.
Burkholderia mallei (Bm), the causative agent of the
predominately equine disease glanders, is a genetically uniform species that is
very closely related to the much more diverse species Burkholderia
pseudomallei (Bp), an opportunistic human pathogen and the primary
cause of melioidosis. To gain insight into the relative lack of genetic
diversity within Bm, we performed whole-genome comparative analysis of seven Bm
strains and contrasted these with eight Bp strains. The Bm core genome (shared
by all seven strains) is smaller in size than that of Bp, but the inverse is
true for the variable gene sets that are distributed across strains.
Interestingly, the biological roles of the Bm variable gene sets are much more
homogeneous than those of Bp. The Bm variable genes are found mostly in
contiguous regions flanked by insertion sequence (IS) elements, which appear to
mediate excision and subsequent elimination of groups of genes that are under
reduced selection in the mammalian host. The analysis suggests that the Bm
genome continues to evolve through random IS-mediated recombination events, and
differences in gene content may contribute to differences in virulence observed
among Bm strains. The results are consistent with the view that Bm recently
evolved from a single strain of Bp upon introduction into an animal host
followed by expansion of IS elements, prophage elimination, and genome
rearrangements and reduction mediated by homologous recombination across IS
bacterial evolution; comparative genomics; genome erosion; bacterial virulence
Small interfering RNAs (siRNAs) and genome-encoded microRNAs (miRNAs) silence genes via complementary interactions with mRNAs. With thousands of miRNA genes identified and genome sequences of diverse eukaryotes available for comparison, the opportunity emerges for insights into origin and evolution of RNA interference (RNAi). The miRNA repertoires of plants and animals appear to have evolved independently. However, conservation of the key proteins involved in RNAi suggests that the last common ancestor of modern eukaryotes possessed siRNA-based mechanisms. Prokaryotes have a RNAi-like defense system that is functionally analogous but not homologous to eukaryotic RNAi. The protein machinery of eukaryotic RNAi seems to have been pieced together from ancestral proteins of archaeal, bacterial and phage origins that are involved in DNA repair and RNA-processing pathways.
Alternative splicing (AS) in protein-coding sequences has emerged as an important mechanism of regulation and diversification of animal gene function. By contrast, the extent and roles of alternative events including AS and alternative transcription initiation (ATI) within the 5'-untranslated regions (5'UTRs) of mammalian genes are not well characterized.
We evaluated the abundance, conservation and evolution of putative regulatory control elements, namely, upstream start codons (uAUGs) and open reading frames (uORFs), in the 5'UTRs of human and mouse genes impacted by alternative events. For genes with alternative 5'UTRs, the fraction of alternative sequences (those present in a subset of the transcripts) is much greater than that in the corresponding coding sequence, conceivably, because 5'UTRs are not bound by constraints on protein structure that limit AS in coding regions. Alternative regions of mammalian 5'UTRs evolve faster and are subject to a weaker purifying selection than constitutive portions. This relatively weak selection results in over-abundance of uAUGs and uORFs in the alternative regions of 5'UTRs compared to constitutive regions. Nevertheless, even in alternative regions, uORFs evolve under a stronger selection than the rest of the sequences, indicating that some of the uORFs are conserved regulatory elements; some of the non-conserved uORFs could be involved in species-specific regulation.
The findings on the evolution and selection in alternative and constitutive regions presented here are consistent with the hypothesis that alternative events, namely, AS and ATI, in 5'UTRs of mammalian genes are likely to contribute to the regulation of translation.
Catechol-O-methyltransferase (COMT) is an enzyme that plays a key role in the modulation of catechol-dependent functions such as cognition, cardiovascular function, and pain processing. Three common haplotypes of the human COMT gene, divergent in two synonymous and one nonsynonymous (val158met) position, designated as low (LPS), average (APS), and high pain sensitive (HPS), are associated with experimental pain sensitivity and risk of developing chronic musculoskeletal pain conditions. APS and HPS haplotypes produce significant functional effects, coding for 3- and 20-fold reductions in COMT enzymatic activity, respectively. In the present study, we investigated whether additional minor single nucleotide polymorphisms (SNPs), accruing in 1 to 5% of the population, situated in the COMT transcript region contribute to haplotype-dependent enzymatic activity. Computer analysis of COMT ESTs showed that one synonymous minor SNP (rs769224) is linked to the APS haplotype and three minor SNPs (two synonymous: rs6267, rs740602 and one nonsynonymous: rs8192488) are linked to the HPS haplotype. Results from in silico and in vitro experiments revealed that inclusion of allelic variants of these minor SNPs in APS or HPS haplotypes did not modify COMT function at the level of mRNA folding, RNA transcription, protein translation, or enzymatic activity. These data suggest that neutral variants are carried with APS and HPS haplotypes, while the high activity LPS haplotype displays less linked variation. Thus, both minor synonymous and nonsynonymous SNPs in the coding region are markers of functional APS and HPS haplotypes rather than independent contributors to COMT activity.
Evolution of protein sequences is largely governed by purifying selection, with a small fraction of proteins evolving under positive selection. The evolution at synonymous positions in protein-coding genes is not nearly as well understood, with the extent and types of selection remaining, largely, unclear. A statistical test to identify purifying and positive selection at synonymous sites in protein-coding genes was developed. The method compares the rate of evolution at synonymous sites (Ks) to that in intron sequences of the same gene after sampling the aligned intron sequences to mimic the statistical properties of coding sequences. We detected purifying selection at synonymous sites in ∼28% of the 1,562 analyzed orthologous genes from mouse and rat, and positive selection in ∼12% of the genes. Thus, the fraction of genes with readily detectable positive selection at synonymous sites is much greater than the fraction of genes with comparable positive selection at nonsynonymous sites, i.e., at the level of the protein sequence. Unlike other genes, the genes with positive selection at synonymous sites showed no correlation between Ks and the rate of evolution in nonsynonymous sites (Ka), indicating that evolution of synonymous sites under positive selection is decoupled from protein evolution. The genes with purifying selection at synonymous sites showed significant anticorrelation between Ks and expression level and breadth, indicating that highly expressed genes evolve slowly. The genes with positive selection at synonymous sites showed the opposite trend, i.e., highly expressed genes had, on average, higher Ks. For the genes with positive selection at synonymous sites, a significantly lower mRNA stability is predicted compared to the genes with negative selection. Thus, mRNA destabilization could be an important factor driving positive selection in nonsynonymous sites, probably, through regulation of expression at the level of mRNA degradation and, possibly, also translation rate. So, unexpectedly, we found that positive selection at synonymous sites of mammalian genes is substantially more common than positive selection at the level of protein sequences. Positive selection at synonymous sites might act through mRNA destabilization affecting mRNA levels and translation.
synonymous sites; nonsynonymous sites; positive selection; purifying selection; introns
The μ-opioid receptor (OPRM1) is the principal receptor target for both endogenous and exogenous opioid analgesics. There are substantial individual differences in human responses to painful stimuli and to opiate drugs that are attributed to genetic variations in OPRM1. In searching for new functional variants, we employed comparative genome analysis and obtained evidence for the existence of an expanded human OPRM1 gene locus with new promoters, alternative exons and regulatory elements. Examination of polymorphisms within the human OPRM1 gene locus identified strong association between single nucleotide polymorphism (SNP) rs563649 and individual variations in pain perception. SNP rs563649 is located within a structurally conserved internal ribosome entry site (IRES) in the 5′-UTR of a novel exon 13-containing OPRM1 isoforms (MOR-1K) and affects both mRNA levels and translation efficiency of these variants. Furthermore, rs563649 exhibits very strong linkage disequilibrium throughout the entire OPRM1 gene locus and thus affects the functional contribution of the corresponding haplotype that includes other functional OPRM1 SNPs. Our results provide evidence for an essential role for MOR-1K isoforms in nociceptive signaling and suggest that genetic variations in alternative OPRM1 isoforms may contribute to individual differences in opiate responses.
Protein kinase (PK) genes comprise the third largest superfamily that occupy ∼2% of the human genome. They encode regulatory enzymes that control a vast variety of cellular processes through phosphorylation of their protein substrates. Expression of PK genes is subject to complex transcriptional regulation which is not fully understood.
Our comparative analysis demonstrates that genomic organization of regulatory PK genes differs from organization of other protein coding genes. PK genes occupy larger genomic loci, have longer introns, spacer regions, and encode larger proteins. The primary transcript length of PK genes, similar to other protein coding genes, inversely correlates with gene expression level and expression breadth, which is likely due to the necessity to reduce metabolic costs of transcription for abundant messages. On average, PK genes evolve slower than other protein coding genes. Breadth of PK expression negatively correlates with rate of non-synonymous substitutions in protein coding regions. This rate is lower for high expression and ubiquitous PKs, relative to low expression PKs, and correlates with divergence in untranslated regions. Conversely, rate of silent mutations is uniform in different PK groups, indicating that differing rates of non-synonymous substitutions reflect variations in selective pressure. Brain and testis employ a considerable number of tissue-specific PKs, indicating high complexity of phosphorylation-dependent regulatory network in these organs. There are considerable differences in genomic organization between PKs up-regulated in the testis and brain. PK genes up-regulated in the highly proliferative testicular tissue are fast evolving and small, with short introns and transcribed regions. In contrast, genes up-regulated in the minimally proliferative nervous tissue carry long introns, extended transcribed regions, and evolve slowly.
PK genomic architecture, the size of gene functional domains and evolutionary rates correlate with the pattern of gene expression. Structure and evolutionary divergence of tissue-specific PK genes is related to the proliferative activity of the tissue where these genes are predominantly expressed. Our data provide evidence that physiological requirements for transcription intensity, ubiquitous expression, and tissue-specific regulation shape gene structure and affect rates of evolution.
Adrenergic receptor β2 (ADRB2) is a primary target for epinephrine. It plays a critical role in mediating physiological and psychological responses to environmental stressors. Thus, functional genetic variants of ADRB2 will be associated with a complex array of psychological and physiological phenotypes. These genetic variants should also interact with environmental factors such as physical or emotional stress to produce a phenotype vulnerable to pathological states. In this study, we determined whether common genetic variants of ADRB2 contribute to the development of a common chronic pain condition that is associated with increased levels of psychological distress and low blood pressure, factors which are strongly influenced by the adrenergic system. We genotyped 202 female subjects and examined the relationships between three major ADRB2 haplotypes and psychological factors, resting blood pressure, and the risk of developing a chronic musculoskeletal pain condition - Temporomandibular Joint Disorder (TMD). We propose that the first haplotype codes for lower levels of ADRB2 expression, the second haplotype codes for higher ADRB2 expression, and the third haplotype codes for higher receptor expression and rapid agonist-induced internalization. Individuals who carried one haplotype coding for high and one coding for low ADRB2 expression displayed the highest positive psychological traits, had higher levels of resting arterial pressure, and were about 10 times less likely to develop TMD. Thus, our data suggest that either positive or negative imbalances in ADRB2 function increase the vulnerability to chronic pain conditions such as TMD through different etiological pathways that imply the need for tailored treatment options.
adrenergic receptor β2; haplotype; SNPs; chronic pain; blood pressure; somatization; depression; anxiety; negative moods
Current literature describes several methods for the design of efficient siRNAs with 19 perfectly matched base pairs and 2 nt overhangs. Using four independent databases totaling 3336 experimentally verified siRNAs, we compared how well several of these methods predict siRNA cleavage efficiency. According to receiver operating characteristics (ROC) and correlation analyses, the best programs were BioPredsi, ThermoComposition and DSIR. We also studied individual parameters that significantly and consistently correlated with siRNA efficacy in different databases. As a result of this work we developed a new method which utilizes linear regression fitting with local duplex stability, nucleotide position-dependent preferences and total G/C content of siRNA duplexes as input parameters. The new method's discrimination ability of efficient and inefficient siRNAs is comparable with that of the best methods identified, but its parameters are more obviously related to the mechanisms of siRNA action in comparison with BioPredsi. This permits insight to the underlying physical features and relative importance of the parameters. The new method of predicting siRNA efficiency is faster than that of ThermoComposition because it does not employ time-consuming RNA secondary structure calculations and has much less parameters than DSIR. It is available as a web tool called ‘siRNA scales’.
Single-stranded mRNA molecules form secondary structures through complementary self-interactions. Several hypotheses have been proposed on the relationship between the nucleotide sequence, encoded amino acid sequence and mRNA secondary structure. We performed the first transcriptome-wide in silico analysis of the human and mouse mRNA foldings and found a pronounced periodic pattern of nucleotide involvement in mRNA secondary structure. We show that this pattern is created by the structure of the genetic code, and the dinucleotide relative abundances are important for the maintenance of mRNA secondary structure. Although synonymous codon usage contributes to this pattern, it is intrinsic to the structure of the genetic code and manifests itself even in the absence of synonymous codon usage bias at the 4-fold degenerate sites. While all codon sites are important for the maintenance of mRNA secondary structure, degeneracy of the code allows regulation of stability and periodicity of mRNA secondary structure. We demonstrate that the third degenerate codon sites contribute most strongly to mRNA stability. These results convincingly support the hypothesis that redundancies in the genetic code allow transcripts to satisfy requirements for both protein structure and RNA structure. Our data show that selection may be operating on synonymous codons to maintain a more stable and ordered mRNA secondary structure, which is likely to be important for transcript stability and translation. We also demonstrate that functional domains of the mRNA [5′-untranslated region (5′-UTR), CDS and 3′-UTR] preferentially fold onto themselves, while the start codon and stop codon regions are characterized by relaxed secondary structures, which may facilitate initiation and termination of translation.
All archaeal and many bacterial genomes contain Clustered Regularly Interspaced Short Palindrome Repeats (CRISPR) and variable arrays of the CRISPR-associated (cas) genes that have been previously implicated in a novel form of DNA repair on the basis of comparative analysis of their protein product sequences. However, the proximity of CRISPR and cas genes strongly suggests that they have related functions which is hard to reconcile with the repair hypothesis.
The protein sequences of the numerous cas gene products were classified into ~25 distinct protein families; several new functional and structural predictions are described. Comparative-genomic analysis of CRISPR and cas genes leads to the hypothesis that the CRISPR-Cas system (CASS) is a mechanism of defense against invading phages and plasmids that functions analogously to the eukaryotic RNA interference (RNAi) systems. Specific functional analogies are drawn between several components of CASS and proteins involved in eukaryotic RNAi, including the double-stranded RNA-specific helicase-nuclease (dicer), the endonuclease cleaving target mRNAs (slicer), and the RNA-dependent RNA polymerase. However, none of the CASS components is orthologous to its apparent eukaryotic functional counterpart. It is proposed that unique inserts of CRISPR, some of which are homologous to fragments of bacteriophage and plasmid genes, function as prokaryotic siRNAs (psiRNA), by base-pairing with the target mRNAs and promoting their degradation or translation shutdown. Specific hypothetical schemes are developed for the functioning of the predicted prokaryotic siRNA system and for the formation of new CRISPR units with unique inserts encoding psiRNA conferring immunity to the respective newly encountered phages or plasmids. The unique inserts in CRISPR show virtually no similarity even between closely related bacterial strains which suggests their rapid turnover, on evolutionary scale. Corollaries of this finding are that, even among closely related prokaryotes, the most commonly encountered phages and plasmids are different and/or that the dominant phages and plasmids turn over rapidly.
We proposed previously that Cas proteins comprise a novel DNA repair system. The association of the cas genes with CRISPR and, especially, the presence, in CRISPR units, of unique inserts homologous to phage and plasmid genes make us abandon this hypothesis. It appears most likely that CASS is a prokaryotic system of defense against phages and plasmids that functions via the RNAi mechanism. The functioning of this system seems to involve integration of fragments of foreign genes into archaeal and bacterial chromosomes yielding heritable immunity to the respective agents. However, it appears that this inheritance is extremely unstable on the evolutionary scale such that the repertoires of unique psiRNAs are completely replaced even in closely related prokaryotes, presumably, in response to rapidly changing repertoires of dominant phages and plasmids.
This article was reviewed by: Eric Bapteste, Patrick Forterre, and Martijn Huynen.
Open peer review
Reviewed by Eric Bapteste, Patrick Forterre, and Martijn Huynen.
For the full reviews, please go to the Reviewers' comments section.
Small interfering RNAs (siRNAs) have become an important tool in cell and molecular biology. Reliable design of siRNA molecules is essential for the needs of large functional genomics projects.
To improve the design of efficient siRNA molecules, we performed a comparative, thermodynamic and correlation analysis on a heterogeneous set of 653 siRNAs collected from the literature. We used this training set to select siRNA features and optimize computational models. We identified 18 parameters that correlate significantly with silencing efficiency. Some of these parameters characterize only the siRNA sequence, while others involve the whole mRNA. Most importantly, we derived an siRNA position-dependent consensus, and optimized the free-energy difference of the 5' and 3' terminal dinucleotides of the siRNA antisense strand. The position-dependent consensus is based on correlation and t-test analyses of the training set, and accounts for both significantly preferred and avoided nucleotides in all sequence positions. On the training set, the two parameters' correlation with silencing efficiency was 0.5 and 0.36, respectively. Among other features, a dinucleotide content index and the frequency of potential targets for siRNA in the mRNA added predictive power to our model (R = 0.55). We showed that our model is effective for predicting the efficiency of siRNAs at different concentrations.
We optimized a neural network model on our training set using three parameters characterizing the siRNA sequence, and predicted efficiencies for the test siRNA dataset recently published by Novartis. On this validation set, the correlation coefficient between predicted and observed efficiency was 0.75. Using the same model, we performed a transcriptome-wide analysis of optimal siRNA targets for 22,600 human mRNAs.
We demonstrated that the properties of the siRNAs themselves are essential for efficient RNA interference. The 5' ends of antisense strands of efficient siRNAs are U-rich and possess a content similarity to the pyrimidine-rich oligonucleotides interacting with the polypurine RNA tracks that are recognized by RNase H. The advantage of our method over similar methods is the small number of parameters. As a result, our method requires a much smaller training set to produce consistent results. Other mRNA features, though expensive to compute, can slightly improve our model.
The progress in genome sequencing has led to a rapid accumulation in GenBank submissions of uncharacterized ‘hypothetical’ genes. These genes, which have not been experimentally characterized and whose functions cannot be deduced from simple sequence comparisons alone, now comprise a significant fraction of the public databases. Expression analyses of Haemophilus influenzae cells using a combination of transcriptomic and proteomic approaches resulted in confident identification of 54 ‘hypothetical’ genes that were expressed in cells under normal growth conditions. In an attempt to understand the functions of these proteins, we used a variety of publicly available analysis tools. Close homologs in other species were detected for each of the 54 ‘hypothetical’ genes. For 16 of them, exact functional assignments could be found in one or more public databases. Additionally, we were able to suggest general functional characterization for 27 more genes (comprising ∼80% total). Findings from this analysis include the identification of a pyruvate-formate lyase-like operon, likely to be expressed not only in H.influenzae but also in several other bacteria. Further, we also observed three genes that are likely to participate in the transport and/or metabolism of sialic acid, an important component of the H.influenzae lipo-oligosaccharide. Accurate functional annotation of uncharacterized genes calls for an integrative approach, combining expression studies with extensive computational analysis and curation, followed by eventual experimental verification of the computational predictions.
Computer programs for the generation of multiple sequence alignments such as "Clustal W" allow detection of regions that are most conserved among many sequence variants. However, even for regions that are equally conserved, their potential utility as hybridization targets varies. Mismatches in sequence variants are more disruptive in some duplexes than in others. Additionally, the propensity for self-interactions amongst oligonucleotides targeting conserved regions differs and the structure of target regions themselves can also influence hybridization efficiency. There is a need to develop software that will employ thermodynamic selection criteria for finding optimal hybridization targets in related sequences.
A new scheme and new software for optimal detection of oligonucleotide hybridization targets common to families of aligned sequences is suggested and applied to aligned sequence variants of the complete HIV-1 genome. The scheme employs sequential filtering procedures with experimentally determined thermodynamic cut off points: 1) creation of a consensus sequence of RNA or DNA from aligned sequence variants with specification of the lengths of fragments to be used as oligonucleotide targets in the analyses; 2) selection of DNA oligonucleotides that have pairing potential, greater than a defined threshold, with all variants of aligned RNA sequences; 3) elimination of DNA oligonucleotides that have self-pairing potentials for intra- and inter-molecular interactions greater than defined thresholds. This scheme has been applied to the HIV-1 genome with experimentally determined thermodynamic cut off points. Theoretically optimal RNA target regions for consensus oligonucleotides were found. They can be further used for improvement of oligo-probe based HIV detection techniques.
A selection scheme with thermodynamic thresholds and software is presented in this study. The package can be used for any purpose where there is a need to design optimal consensus oligonucleotides capable of interacting efficiently with hybridization targets common to families of aligned RNA or DNA sequences. Our thermodynamic approach can be helpful in designing consensus oligonucleotides with consistently high affinity to target variants in evolutionary related genes or genomes.
Many non-coding sequences transcribed from the mammalian genome are proving to have important regulatory roles, but the functions of the majority remain mysterious.
For decades, researchers have focused most of their attention on protein-coding genes and proteins. With the completion of the human and mouse genomes and the accumulation of data on the mammalian transcriptome, the focus now shifts to non-coding DNA sequences, RNA-coding genes and their transcripts. Many non-coding transcribed sequences are proving to have important regulatory roles, but the functions of the majority remain mysterious.
Sequencing of multiple, nearly complete eukaryotic genomes creates opportunities for detecting previously unnoticed, subtle functional signals in non-coding regions. A genome-wide comparative analysis of orthologous sets of mammalian and yeast mRNAs revealed distinct patterns of evolutionary conservation at the boundaries of the untranslated regions (UTRs) and the coding region (CDS). Elevated sequence conservation was detected in ∼30 nt regions around the start codon. There seems to be a complementary relationship between sequence conservation in the ∼30 nt regions of the 5′-UTR immediately upstream of the start codon and that in the synonymous positions of the 5′-terminal 30 nt of the CDS: in mammalian mRNAs, the 5′-UTR shows a greater conservation than the CDS, whereas the opposite trend holds for yeast mRNAs. Unexpectedly, a ∼30 nt region downstream of the stop codon shows a substantially lower level of sequence conservation than the downstream portions of the 3′-UTRs. However, the sequence in this poorly conserved 30 nt portion of the 3′-UTR is non-random in that it has a higher GC content than the rest of the UTR. It is hypothesized that the elevated sequence conservation in the region immediately upstream of the start codon is related to the requirement for initiation factor binding during pre-initiation ribosomal scanning. In contrast, the poorly conserved region downstream of the stop codon could be involved in the post- termination scanning and dissociation of the ribosomes from the mRNA, which requires only the mRNA–ribosome interaction. Additionally, it was found that the choice of the stop codon in mammals, but not in yeasts, and the context in the immediate vicinity of the stop codons in both mammals and yeasts are subject to strong selection. Thus, genome-wide analysis of orthologous gene sets allows detection of previously unrecognized patterns of sequence conservation, which are likely to reflect hidden functional signals, such as ribosomal filters that could regulate translation by modulating the interaction between the mRNA and ribosomes.
Post-transcriptional regulation and the formation of mRNA 3′ ends are crucial for gene expression in eukaryotes. Interspecies conservation of many sequences within 3′UTRs reveals selective constraint due to similar function. To study the pattern of conservation within 3′UTRs, we compiled and aligned 50 sets of complete orthologous 3′UTRs from four orders of mammals. We observed a mosaic pattern of conservation, with alternating regions of high (phylogenetic footprints) and low similarity. Conservation in 3′UTRs correlates with their base composition and also with the synonymous substitution rate in corresponding coding regions. The non-uniform distribution of conservation is more pronounced for 3′UTRs with a moderate or low level of overall conservation, where invariant nucleotides are more numerous, and their runs of lengths 4–7 occur more frequently than if conservation were random. Many runs of invariant nucleotides are AU-rich or pyrimidine-rich. Some of these runs coincide with known functional cis- elements of eukaryotic mRNAs, such as the U-rich upstream element, polyadenylation signal and DICE regulatory signal. More divergent regions of multiple alignments of 3′UTRs are often more G- and/or C-rich. Our results provide evidence on the importance of moderately conserved regions in 3′UTRs and suggest that regulatory functions of 3′UTRs might utilize gene-specific information in these regions.