|Home | About | Journals | Submit | Contact Us | Français|
Mammalian genomes contain numerous genes for long noncoding RNAs (lncRNAs). The functions of the lncRNAs remain largely unknown but their evolution appears to be constrained by purifying selection, albeit relatively weakly. To gain insights into the mode of evolution and the functional range of the lncRNA, they can be compared with much better characterized protein-coding genes. The evolutionary rate of the protein-coding genes shows a universal negative correlation with expression: highly expressed genes are on average more conserved during evolution than the genes with lower expression levels. This correlation was conceptualized in the misfolding-driven protein evolution hypothesis according to which misfolding is the principal cost incurred by protein expression. We sought to determine whether long intergenic ncRNAs (lincRNAs) follow the same evolutionary trend and indeed detected a moderate but statistically significant negative correlation between the evolutionary rate and expression level of human and mouse lincRNA genes. The magnitude of the correlation for the lincRNAs is similar to that for equal-sized sets of protein-coding genes with similar levels of sequence conservation. Additionally, the expression level of the lincRNAs is significantly and positively correlated with the predicted extent of lincRNA molecule folding (base-pairing), however, the contributions of evolutionary rates and folding to the expression level are independent. Thus, the anticorrelation between evolutionary rate and expression level appears to be a general feature of gene evolution that might be caused by similar deleterious effects of protein and RNA misfolding and/or other factors, for example, the number of interacting partners of the gene product.
Traditionally, genomes have been perceived mostly as repositories of protein-coding genes. Although this might be largely true in the case of prokaryotes and unicellular eukaryotes, numerous recent studies on the genomes of multicellular eukaryotes, particularly animals, have revealed a vast RNome, that is, a collection of genes for noncoding RNAs (ncRNAs) (Carninci et al. 2005; Mattick and Makunin 2006; Ponting et al. 2009). Strikingly, the total number of genes for ncRNAs that are expressed from a mammalian genome seems to exceed the number of protein-coding genes severalfold (Mattick and Makunin 2006). The classification of ncRNAs, whether these loci should be considered genes or not, and the validation of their functionality remain matters of intensive investigation and debate (van Bakel and Hughes 2009; Ponting and Belgard 2010).
Among many distinct classes of ncRNAs, the long ncRNA (lncRNA) are probably the most enigmatic group. The definition of lncRNA is based solely on the transcript size: lncRNAs are defined as non-coding RNAs longer than 200 nucleotides (Mattick and Makunin 2006; Ponting et al. 2009). Many lncRNAs are spliced, 5′ capped, and polyadenylated (Okazaki et al. 2002; Carninci et al. 2005; Kapranov et al. 2007; Ponjavic et al. 2007). Based on the localization in the genome, lncRNAs can be divided into two distinct classes: 1) transcripts that overlap protein-coding genes and 2) long intergenic noncoding (linc) RNAs that are transcribed from genome regions separating protein-coding genes (Ponting et al. 2009). Many lncRNAs that overlap protein-coding genes are likely to be involved in a sense–antisense regulation (Chen et al. 2005). The current knowledge on the functions of lincRNAs is scarce because very few of the lincRNAs have been assigned a function experimentally, but their functional range is believed to be broad on the basis of indirect evidence (Bertone et al. 2004; Ponjavic et al. 2007; Mercer et al. 2009; Ponting and Belgard 2010). It has been suggested that lincRNAs could be involved in the regulation of many cellular processes (Mattick and Makunin 2006). For example, they can affect transcription locally at the gene level (Martens et al. 2004; Martianov et al. 2007; Osato et al. 2007; Hirota et al. 2008) as well as target transcription regulators and thus affect transcription of many genes (Feng et al. 2006; Goodrich and Kugel 2006; Huarte et al. 2010). They can also target RNA polymerase II in human and mouse (Espinoza et al. 2007; Mariner et al. 2008) and thus act on an even broader range of genes. Furthermore, lincRNAs participate in the regulation of splicing (Munroe and Lazar 1991; Beltran et al. 2008) and translation (Wang et al. 2005; Centonze et al. 2007). Well-characterized examples of lincRNAs involved in epigenetic processes are Xist (Clemson et al. 1996), Kcnq1ot1 (Umlauf et al. 2004; Pandey et al. 2008) and Air (Nagano et al. 2008) (also see the review by Ponting et al. 2009).
Compared with protein-coding sequences and small RNAs (e.g., miRNA and snoRNA), lncRNAs are weakly conserved: only approximately 5% of the bases have been estimated to be evolutionarily constrained (Marques and Ponting 2009). Accordingly, early studies called these RNAs “transcriptional dark matter” that was considered to be largely nonfunctional although a low level or even lack of appreciable conservation do not necessarily imply lack of function (Pang et al. 2006). A well-studied example is the Xist RNA, which is weakly conserved but is essential for the X chromosome dosage compensation in mammals (Nesterova et al. 2001). Moreover, the limited overall conservation notwithstanding, many of the lncRNAs still contain strongly conserved regions (Siepel et al. 2005). In general, lncRNAs show reduced substitution and insertion/deletion rates compared with random expectation, which has been interpreted as a signature of purifying selection (Ponjavic et al. 2007; Guttman et al. 2009). The recent study by Chodroff et al. (2010) reports the first attempt to characterize lncRNA orthologs present in eutheria, marsupials, and birds. Several lncRNAs analyzed in this work have been shown to possess conserved transcript structures and expression patterns.
In general, the expression levels of lncRNAs tend to be lower than those of protein-coding genes (Mattick and Makunin 2006). A comparative analysis of the expression patterns of intergenic transcripts in brain, heart, testis, and lymphoblastoid cell lines of human and chimpanzee has revealed a tissue-specific conservation pattern, which is similar to that of protein-coding genes. Altogether, approximately half of the transcripts that showed differences in expression between the two species come from the intergenic regions. Thus, lincRNAs might have played an important role in the phenotypic differentiation between these two primates (Khaitovich et al. 2006).
Some lncRNA genes might have evolved from protein-coding genes as exemplified by the thoroughly characterized Xist RNA (Duret et al. 2006; Elisaphenko et al. 2008), suggesting the possibility that some properties of lncRNAs and their genes might be similar to those of protein-coding genes. Protein-coding genes that are expressed highly and broadly across tissues on average are more evolutionarily conserved than genes that have lower expression level and breadth; a significant negative correlation between the expression and evolutionary rate of protein-coding genes has been revealed for all model organisms for which extensive expression data are available (Duret and Mouchiroud 2000; Pal et al. 2001; Krylov et al. 2003; Zhang and Li 2004). This negative correlation extends also to 3′UTRs of protein-coding genes although not to 5′UTRs (Jordan et al. 2004). The universal anticorrelation between the evolutionary rate and the expression level observed for the protein-coding genes has been explained within the framework of the misfolding-driven concept of protein evolution according to which the selective pressure to minimize the misfolding is the strongest for highly expressed proteins (Drummond and Wilke 2008, 2009; Wolf et al. 2010; Yang et al. 2010). Here, we show that the universal dependency between the evolution and the expression holds also for the lincRNA genes and is comparable in magnitude to the anticorrelation detected for protein-coding genes.
Complete mouse and human probe sets were downloaded from NRED database (Dinger et al. 2009) in the tab-delimited and BED (Browser Extensible Data, containing genomic coordinates) formats. The probe sets from platform GNF Atlas 2 (Mouse and Human), with target classification “Noncoding Only” were used for further analysis. This resulted in 5444 mouse and 917 human probe sets. Only the probe sets that mapped to intergenic regions of the mouse and human genomes (i.e., between two adjacent protein-coding genes) were used and analyzed. This was achieved by selecting only the probe sets with the value “Intergenic” in both ProbeGenomicContextSense and ProbeGenomicContextAntisense fields of the tab-delimited file.
As a next step, one-to-many list of probe sets and their corresponding Target Accession IDs (NCBI GenBank IDs of RNAs; TargetAccessionID column of NRED file) was transformed into a one-to-one list, where accession ID corresponded to the single probe set, preferentially the one not with _s_at or _x_at suffix which, according to Affymetrix, non-uniquely map to the genomes in several locations. This list of lincRNAs was further filtered: genes shorter than 200 nucleotides were removed. This procedure yielded the final set of NCBI GenBank Accession IDs of 2390 mouse and 589 human RNAs and their corresponding microarray expression probe sets. There were alternatively spliced isoforms in these data sets, the fraction of such isoforms was <10%.
To control for the possibility of contamination of the lincRNA data set with protein-coding genes, two tests were performed: 1) mouse lincRNA sequences were searched against protein sequence databases using the BlastX program, 2) the coding potential of lincRNAs was predicted using the SYNCODE program (Rogozin et al. 1999).
To analyze the transcriptome of normal tissues without bias from cancerous tissues, only data for normal (non-cancerous) tissues (73 human and 61 mouse tissues) were used. Log2-normalized expression levels (A-values) <7.0 (threshold representing a conservative expression level above background, according to the NRED database) were ignored, and median values of expression for each probe set across the tissues were calculated and designated as probe set’s Median Expression Level. Expression Breadth, the number of tissues where the probe set has been expressed above the A-value threshold (7.0), was also calculated. The probe sets with _s_at and _x_at suffixes were discarded as promiscuous because they map to several regions in the genome, and it would be incorrect to associate their expression levels to one particular lincRNA ID.
As the final step, median expression level and breadth values of probe sets were associated with their corresponding lincRNA IDs and used throughout the entire work. For mouse, this data set contained 2013 lincRNAs; human data set contained 519 lincRNAs.
As an alternative method of measuring expression levels, counting of the Expressed Sequence Tags (ESTs) was employed. The sequences of lincRNAs were extracted from the UCSC Table Browser (see Sequences, Alignments, and Evolutionary Distances in the Materials and Methods section). These sequences were searched against the human and mouse subsets of EST database (archives est_human.tar.gz and est_mouse.tar.gz, available from the NCBI FTP site at ftp://ftp.ncbi.nih.gov/blast/db/) using the program BlastN 2.2.24+ (Camacho et al. 2009) from the Blast package. The number of ESTs with >97% identity and alignment length of at least 200 nucleotides or longer was counted. The procedure was repeated twice, with and without masking the human/mouse repeat sequences; median value was calculated and assigned to lincRNAs as an expression level based on EST count.
The mouse RNA-seq data for eight tissues (the ENCODE project; modENCODE Consortium, 2009) was downloaded from the UCSC genome browser Web site (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/encodeDCC/wgEncodeLicrRnaSeq/) and pooled together. The number of RNA-seq hits (M) was calculated for each mouse lincRNA. The overall expression of each lincRNA was estimated using a log2 normalization: Exp-RNA-seq = log2[(M + 1)/L] + 10, where L is the length of the lincRNA. This normalization has been suggested as a robust estimator of expression using RNA-seq data (Lee et al. 2011).
The genomic coordinates and sequences of exons and introns of Target RNA Accession IDs (NCBI GenBank IDs of RNAs; TargetAccessionID column of NRED file) were downloaded from the UCSC Table Browser (Karolchik et al. 2004), from all_mrna tables of mouse mm8 and human hg18 assemblies. Multiple alignments of these regions were fetched from Galaxy (Blankenberg et al. 2010; Goecks et al. 2010). Two different 17-way multiZ alignments—with human (hg18) and mouse (mm8) reference genomes—were used; only mouse (mm8), human (hg18), chimp (panTro1), macaque (rheMac2), rat (rn4), and dog (canFam2) alignments were downloaded. Alignments of exons and introns <100 nt were discarded. The exon and intron alignments for each lincRNA were concatenated, and two alignments were produced per lincRNA: “stitched” exons and “stitched” introns.
Calculation of percentage of insertions/deletions (indels) in the alignments was performed using an in-house tool written in C++. The program employed the following algorithm: the number of indel positions of pairwise alignments and the alignment length were computed. Finally, the ratio/percentage of indels and alignment length were calculated. In order to eliminate unreliable alignments containing an excess of indels, only alignments with the total length of indels below a threshold were used for subsequent analysis; three indel thresholds (15%, 30%, and 45%) were applied. Pairwise evolutionary distance matrices for concatenated alignments of exons or introns from human and mouse linc RNA genes were calculated using the DNADIST program from the PHYLIP package (Felsenstein 1996), with the Kimura nucleotide substitution model.
The C.A.MAM program (Bohning et al. 1992, 1998) was used to reveal outliers in the distributions of evolutionary distances and expression. This program attempts to decompose each distribution into two or more normal or log-normal distributions. If the decomposition procedure produced one distribution, no outliers were removed. If the decomposition produced several distributions, only one distribution with the largest number of data points was used for further correlation analysis. The Pearson (r), Spearman (rho), and Kendall (tau) correlation coefficients and their corresponding P values were calculated using an ad hoc R-language script. To eliminate the potential effect of contamination by the protein-coding genes on the observed correlations, all the lincRNAs with a similarity to the protein-coding genes were removed from the lincRNA set and the correlation coefficients were recalculated. To obtain the list of lincRNAs similar to proteins, exons (separate as well as concatenated) were compared with the mouse RefSeq proteins using the BlastX program. Hits with the E-value <10−4 and alignment length >20 amino acids were considered suspect. The two lists of suspect lincRNAs originated from the analyses of concatenated and separate exons (853 and 860 lincRNAs, respectively) were merged into the final set of 907 lincRNAs potentially containing protein-coding regions, which were removed from the mouse lincRNA set, and the correlation coefficients were recalculated for the remaining lincRNAs.
To compare the results obtained for lincRNA genes with the trends observed for protein-coding genes, a control gene set was compiled consisting of 7,711 well-annotated orthologous human and mouse protein-coding genes that yielded high-quality genome alignments. For each human and mouse pair from the UCSC list of the human-mouse orthologs (http://genome.ucsc.edu/cgi-bin/hgTables), we identified the best Blast hit and estimated the overlap of the protein-coding sequences. Only full-length protein-coding transcripts with links to the RefSeq database and with up to 75% of protein-coding region aligned were included in the control group and used for the subsequent analysis. Mouse genome sequences were downloaded from ftp://hgdownload.cse.ucsc.edu/goldenPath/mm8/chromosomes/. Genomic coordinates of extended human gene loci were transferred to the mouse genome sequence using the UCSC Lift Genome Annotations tool (http://genome.ucsc.edu/cgi-bin/hgLiftOver). Mammalian genomic repeats were masked, and extended genomic loci of orthologous human–mouse genes were aligned using the OWEN program (Ogurtsov et al. 2002) and annotated. In case of alternatively spliced forms, the longest CDSs and UTRs were considered. For the protein-coding regions, the alignment of nucleotide sequences was guided by the amino acid sequence alignment. Core hits with E-values <10−3 produced by OWEN program were extracted for analysis as described previously (Ogurtsov et al. 2008). Synonymous and nonsynonymous divergence (Ks and Ka, respectively) were calculated using the PAML program (ftp://abacus.gene.ucl.ac.uk/pub/paml) with default parameters and the yn00 estimation method (Yang 1997).
RNA secondary structures were predicted using two methods, which are based on the global and local free energy estimations, respectively. The lincRNAs were computationally “folded” and the predicted minimum free energy of the secondary structure was calculated, using our implementation of the algorithm that employs nearest neighbor parameters for evaluation of free energy (Zuker 2003). Energy minimization was performed by the dynamic programming method that finds the secondary structure with the minimum free energy with sums contributing from stacking loop length using an improved algorithm for evaluation of internal loops; this program “folds” sequences up to the 28,000 nucleotide long (Ogurtsov et al. 2006). Local free energy was estimated for the pairs of highly similar slow evolvind sequences, extracted from the human–mouse alignments (Kondrashov and Shabalina 2002). The secondary structure of the expressed lincRNAs was inferred by intersecting their chromosomal positions with the positions of the RNAz structural predictions made across the entire mouse genome, as previously described in the NRED database (Washietl et al. 2005; Mercer et al. 2008; Dinger et al. 2009). Conserved RNA secondary structures were considered significant at the confidence threshold level of P > 0.5, where P is the significance of the classification, which is quantified as “RNA-class probability” (Gruber et al. 2007).
To avoid potential complications caused by the coordinated expression of protein-coding genes and lncRNAs, we chose to analyze only the sets of mammalian lincRNAs. The data of 5,444 “Noncoding Only” mouse probe sets were downloaded from NRED database (Dinger et al. 2009). After discarding the probe sets that did not map to intergenic space and establishing one-to-one relationship between RNA IDs and their corresponding probe set IDs (see Materials and Methods), we obtained the final set of 2,390 mouse lincRNAs (NCBI GenBank Accession IDs of RNAs) of which 977 contained introns. After discarding the probe sets with very low median expression levels and those with equivocal genome mapping, the final set of 2,013 mouse lincRNAs, including 918 intron-containing ones, was obtained (for details, see Materials and Methods). For humans, the data for 917 probe sets were downloaded, and the same procedure of removing low-expressed or equivocally mapped lincRNAs yielded the final set of 519 lincRNAs, including 211 intron-containing genes.
Thus, the current set of experimentally verified human lincRNAs was several times smaller than the corresponding mouse set, in agreement with previous observations (e.g., Rearick et al. 2010). Most likely, this difference reflects the different states of lncRNA annotations rather than a genuine excess of lincRNAs in rodents.
Table 1 summarizes the characteristics of the sets of mouse and human lincRNAs analyzed here.
The traditional gauge of selection in protein-coding genes is the ratio of nonsynonymous (Ka) over synonymous (KS) substitutions. Ka/KS < 1 is thought to indicate purifying selection, whereas Ka/KS > 1 is construed as the signature of positive selection (Hurst 2002). In the case of lncRNA genes, the substitution rate of exons (Ke) may be considered analogous to Ka, whereas the substitution rate in intronic sequences (Ki) is a logical choice of the proxy for neutral evolution, analogously to the traditional use of KS. Indeed, apart from pseudogenes, introns are among the best candidates for neutrally evolving sequences (Louie et al. 2003; Hoffman and Birney 2007; Resch et al. 2007). Purifying selection in lncRNA exons potentially can be defined by Ke/Ki < 1.
Substitution rates for exons and introns in intron-containing lincRNA genes (~41% human and mouse lincRNAs contain introns; see table 1) are shown in table 2 and supplementary table S1, Supplementary Material online. The indel cutoff (we used alignments with the total length of indels below a threshold of 15%, 30%, or 45%) employed to vary alignment stringency does not qualitatively influence the results although higher statistical significance was observed under the more stringent criteria (15% and 30% indels) (table 2 and supplementary table S1, Supplementary Material online). The substitution rate of mouse lincRNA exons was found to be significantly lower compared with mouse introns (Ke/Ki < 1; table 2 and fig. 1). These results suggest that purifying selection acts on exons of lincRNA genes and are consistent with earlier observations (Ponjavic et al. 2007; Guttman et al. 2009). Additionally, the distribution of substitution rates was notably wider for concatenated exons of mouse lincRNAs than it was for concatenated introns (fig. 1 and supplementary figure S1, Supplementary Material online), indicative of the variance in the intensity of the purifying selection on mouse lincRNA genes. The lower substitution rates of the exons compared with the introns were observed for human lincRNA genes as well (supplementary table S1, Supplementary Material online) although the statistical support was weaker due to the smaller size of the human lincRNA set. The significantly reduced substitution rate in the mouse and human lincRNA exons was reproduced when all lincRNAs (intron containing and intronless) were used for the calculation of exon substitution rates (supplementary tables S2 and S3, Supplementary Material online). The reduced substitution rate of the lincRNA exons compared with the adopted neutral baseline (in this case, the substitution rate for introns) and the broad distribution of Ke values qualitatively resemble the case of protein-coding exons, which are almost universally subject to purifying selection of widely varying strengths (Ka/KS <1) (Koonin and Wolf 2010). However, the purifying selection on the exons in the lincRNAs is much weaker than on nonsynonymous positions in the protein-coding genes (supplementary figure S1, Supplementary Material online). Both the strength and the shape of the distribution of the substitution rates in lincRNA exons more closely resemble synonymous than nonsynonymous substitutions in protein-coding genes (supplementary figure S1, Supplementary Material online) although the differences in methods used to estimate substitution rates in noncoding and coding sequences preclude a direct quantitative comparison (Resch et al. 2007). As is the case with KS, neutral evolution of intron sequences is only an approximation because some introns contain embedded genes (e.g., coding for microRNAs) and can be constrained for other reasons as well (Haddrill et al. 2005; Gazave et al. 2007). Therefore, the estimates obtained here represent the low bound of the selective pressure affecting exons of lincRNAs.
The range of expression levels of lincRNAs measured using microarrays is narrow, with the vast majority of lincRNAs having median expression <9 (log2-normalized A-values, see Materials and Methods) (supplementary figure S2, Supplementary Material online). This distribution is quite different from the distribution of the expression levels of protein-coding genes, which includes a much greater fraction of highly expressed genes (supplementary figure S3, Supplementary Material online). In addition to microarrays, we also used the raw number of ESTs hits as an alternative measure of expression level (supplementary figure S4 and tables S4 and S5, Supplementary Material online). There was a strong highly significant correlation between the expression levels extracted from the microarray data and those obtained using EST (Pearson CC = 0.39, P < 10−62) (supplementary figure S5, Supplementary Material online).
Table 3 summarizes the Pearson, Spearman, and Kendall correlation coefficients between evolutionary distances and the median expression levels of lincRNAs estimated using microarrays. For the exons of mouse lincRNAs, statistically significant negative correlation was consistently observed between the sequence evolution rate and the expression level, with correlation coefficients mostly in the range of 0.1–0.16 (table 3, figs. 2A–C, and supplementary figure S6, Supplementary Material online). In contrast, for introns, the correlation coefficients were very low and statistically not significant, that is, there was negligible or no connection between evolutionary rate and expression (table 3). The results for the human lincRNAs corroborated the mouse data and also revealed negative correlations for exons but not for introns, although the statistical significance of the results was inevitably lower due to the smaller size of the data set (fig. 2D, supplementary figure S7 and table S6, Supplementary Material online). As an alternative measure of expression level, we employed EST counts and showed that the rate-expression correlation pattern obtained using this approach was similar to the microarray results, that is, there was a significant negative correlation for human and mouse lincRNA exons but not for introns (supplementary tables S7 and S8, Supplementary Material online). Using maximum expression levels or the 75% quantile of the distribution of the expression levels instead of the median produced similar results, confirming that the observed negative correlation between expression and evolutionary rate is a robust property of lincRNAs (supplementary tables S9 and S10, Supplementary Material online). Analysis of the expression breadth across the tissues also showed a significant negative correlation between the breadth and the lincRNA evolutionary rates although the typical correlation coefficients (0.04–0.12) were smaller compared with the median of expression and EST counts (results not shown).
Further analysis using the mouse RNA-seq data strongly supported the consistent negative correlation between the expression of lincRNAs and the rate of evolution of lincRNA exons (fig. 3 and table 4). In this case, the observed correlations were a uniformly highly statistically significant support: all P values were <0.000001. The correlation coefficients obtained when the RNA-seq data were used as the measure of expression (table 4) were larger than the corresponding correlation coefficients obtained with the microarray data (table 3). This difference could be due to the truncation of microarray data (the threshold 7.0 was applied, see Materials and Methods) because of which the distributions of expression data were asymmetrical (fig. 2), causing problems for correlation analysis. The RNA-seq data showed no such asymmetry (fig. 3).
The authors of the NRED database have employed different techniques to remove contaminating protein-coding genes from the lincRNA data set (Dinger et al. 2009). Nevertheless, to control for the possibility that the observed correlations were caused by a contamination of the set of lincRNAs with protein-coding genes, we removed all sequences with significant similarity to protein-coding genes from the mouse lincRNA data set and recalculated the correlation coefficients (see Materials and Methods). The correlation between the evolutionary rate and expression level remained negative and statistically significant in all cases (supplementary table S11, Supplementary Material online). We then performed an additional experiment to control for a possible admixture of protein-coding genes in the analyzed lincRNA data sets: the coding potential of lincRNAs was predicted using the SYNCOD program (Rogozin et al. 1999). This method has been shown to produce a relatively low rate of overpredicted protein-coding regions (Rogozin et al. 1999). The SYNCOD analysis identified 94 potential protein-coding regions in 2390 lincRNAs (exons only) and 527 potential protein-coding regions in 2462 introns of lincRNA genes. The mean density of protein-coding regions in lincRNAs (one potential protein-coding region per 47 Kb) was close to that in introns (one per 45 Kb). The frequency of potential protein-coding regions was similar in the direct and complementary strands for both sets (~50%, all differences are not statistically significant). The false positive rate for the SYNCOD method has been estimated at ~0.06–0.07 (Rogozin et al. 1999), which is similar to the fraction of potential protein-coding regions in the exons (94/2390 = 0.04). Thus, taken together, the results of these analyses appear to effectively rule out a significant contamination of the analyzed lincRNA data set with protein-coding genes.
To control for the possibility that the observed negative correlation could be (at least, partially) due to regional substitution biases across the genome (Resch et al. 2007), we analyzed the substitution rate of exons divided by the substitution rate of introns within the same gene (Ke/Ki, supplementary table S12, Supplementary Material online). This ratio is analogous to Ka/KS and is expected to reflect the strength of purifying selection that affects the exons of lincRNAs. Negative correlation between the Ke/Ki ratio and the expression level was consistently observed although some values were not statistically significant due to small sample sizes (supplementary table S12, Supplementary Material online). Thus, a moderate but highly significant negative correlation between the evolutionary rates (or selection strengths) and the expression levels of human and mouse lincRNA exons is a consistent feature of the evolution of lincRNAs.
We further sought to compare the magnitude of the negative correlation between the evolutionary rate and the expression level for lincRNAs and for protein-coding genes. Rates of nonsynonymous substitutions and synonymous substitutions were calculated using human–mouse pairwise alignments of protein-coding genes. For the purpose of this comparison, we used a sampling procedure that was repeated 1,000 times (for details, see table 5). Each sample of the orthologous protein-coding genes from the human and mouse had the size and the mean evolutionary distance approximately equal (according to the Student's t-test, table 5) to those for lincRNA pairwise comparisons (table 2). For all genes used to draw these control samples, the Ka/KS values were below unity (supplementary figure S8, Supplementary Material online) indicating that these were bona fide protein-coding genes. Analysis of correlations between the evolutionary rate and the expression level for these protein-coding genes showed that Pearson correlation coefficient values for lincRNAs were well within the distributions of correlation coefficients for nonsynonymous substitution rates (Ka) across the samples of protein-coding genes, and in some cases, on the negative tail of these distributions (table 5 and fig. 4). Thus, the negative correlation between the evolutionary rate and expression level of lincRNAs is at least as strong as that for nonsynonymous substitution rates in the mammalian protein-coding genes at the same level of divergence. When compared with the synonymous rates in the same samples of protein-coding genes, lincRNAs showed a stronger correlation (table 5 and fig. 4).
The demonstration that the magnitude of the correlation between the evolutionary divergence and the expression level is similar for lincRNAs and for protein-coding genes raises the key question about the biological factors underpinning such correlations. It has been shown that RNA folding is crucial for mRNA stability and functionality and is correlated with the expression level and breadth (Nackley et al. 2006; Shabalina et al. 2006; Parmley and Hurst 2007; Zhang et al. 2010). Here we analyzed folding characteristics of lincRNAs and the connections between predicted lincRNA secondary structure stability, expression level, and the rate of evolution. We found an abundance of predicted stable folding (calculated as the fraction of paired nucleotides in the optimal folding of the full-length transcript) in the lincRNA data set. The distributions of the fraction of base-paired nucleotides were similar for lincRNAs and the mRNA control set (supplementary figure S9, Supplementary Material online), which was compiled taking into account the nucleotide content, length, and gene structure of the lincRNAs (Shabalina et al. 2006). A significant positive correlation was detected between the fraction of paired nucleotides in the predicted optimal folding of mouse lincRNAs and their expression level, which was calculated from the EST counts (fig. 5A) or from GenAtlas 2 database of the microarray data (fig. 5B). A similar connection between the folding and the expression level was observed for the human lincRNAs (supplementary table S13, Supplementary Material online). Free energy (ΔG) normalized against the transcript length in the optimal folding showed the same trend (data not shown).
To disentangle the relationships between the structural features, evolution and expression of lincRNAs, we constructed a linear regression model to predict gene expression patterns based on RNA folding and/or evolutionary rates. Taking into account the connection between the expression level and the evolutionary rate of lincRNA genes, linear regression analysis showed that the evolutionary variable (Ke for the mouse–rat comparison) was predictive with respect to the expression level of the mouse lincRNA genes (EST abundance), independent of RNA structural features (R = 0.122 on the validation set and R = 0.105 on the training set). Conversely, a model that used the RNA structural parameter (fraction of paired nucleotides in the mouse lincRNA folding, PRF) alone yielded R = 0.167 on the validation set and R = 0.14 on the training set. The two variables had orthogonal predictive power, that is, R2 values for cumulative structural and evolutionary predictions (R2 = 0.033 = 0.1822) were close to the sum of R2 (0.033–0.036 = 0.023 + 0.013) values for independent structural (R2 = 0.023 = 0.1522) and evolutionary (R2 = 0.013 = 0.1122) predictions. Thus, the multiple regression models indicate that the two variables, Ke and PRF, independently correlate with lincRNA gene expression (F-test, P < 0.01). These findings are in agreement with the corresponding observations for mRNAs (SAS, unpublished data). Consistent with these observations, we did not find significant correlations between the evolutionary rates of lincRNAs and the predicted folding.
The lincRNAs comprise a substantial part of the mammalian RNome but very little is currently known about their functions and evolution. Together with previous observations, the results described here suggest (even if indirectly) that many lincRNAs are indeed functional molecules that are subject to relatively weak but significant purifying selection as determined from the Ke/Ki ratio. As such, lincRNAs genes provide evolutionary biologists with a unique data set to investigate the general and more idiosyncratic features of evolution by comparing their evolutionary patterns with those of protein-coding genes. Unlike the highly conserved structural RNAs (rRNAs and tRNAs) or small microRNAs, the lincRNA genes closely resemble protein-coding genes in terms of diversity, size, and gene architecture. The fundamental difference is that the transcripts of these genes are not translated into proteins but rather function directly as RNA molecules. Evolution of protein-coding genes shows correlations of varying strengths with several molecular phenomic variables (Koonin and Wolf 2006; Wolf et al. 2006). The most consistent and typically strongest is the negative correlation between the rate of sequence evolution and expression level of protein-coding genes or protein abundance (Drummond and Wilke 2008, 2009; Wolf et al. 2010). This relationship between evolution and expression of protein-coding genes inspired the hypothesis that evolution of proteins is driven primarily by selection for robustness to misfolding, which is partly caused by the errors of translation (Drummond and Wilke 2008, 2009). Evolutionary models built on the assumption that the deleterious effect of misfolding is the primary fitness cost associated with mutations in the protein-coding genes have been shown to be compatible both with the dependency between the evolutionary rate and expression and with the universal distribution of the evolutionary rates of protein evolution (Drummond and Wilke 2009; Lobkovsky et al. 2010). In view of this unifying hypothesis of protein evolution, we were interested to determine whether the evolution of lincRNAs is similarly connected with expression.
The results presented here reveal the existence of a relatively weak but consistent and highly significant negative correlation between the evolutionary rate and expression level of lincRNAs. Introns of lincRNA genes provide an internal control: the absence of correlation for the intronic sequences indicates that the observed connection between evolution and expression has to do with structure and function (or robustness to malfunction) of the mature lincRNA molecules. We further showed that the level of correlation between evolutionary distances and expression is similar for lincRNAs and protein-coding genes evolving under comparable constraints. The connection between expression and evolution in mammals is relatively weak for both lincRNA and protein-coding genes, with only 1–2% of the variance in evolutionary rates accounted for by expression. These findings are compatible with the previous observations that the negative correlation between the sequence evolutionary rate and the expression level is the weakest in mammals among all tested model organisms (Drummond and Wilke 2008). It seems most likely that this limited dependency is caused by the general weakness of purifying selection in mammals due to their characteristic low effective population sizes (Lynch and Conery 2003; Lynch 2006). Accordingly, mammals might not be the best choice of the model to study the causes of the dependency between evolution and expression for protein-coding gene. However, by the same token, this seems to be the only model on which a comparison of the evolutionary regimes of protein-coding genes and “protein-like” lincRNAs is possible because large diverse repertoires of long ncRNAs apparently could not evolve in organisms subject to strong selective constraints (Lynch 2007; Koonin and Wolf 2010).
We then examined potential connections between the predicted stability of lincRNA folding, their expression, and the rate of evolution. A limited in magnitude but significant positive correlation was detected between the predicted folding and expression: lincRNA molecules with greater folding potential show a tendency to be highly expressed. A positive correlation between the (predicted) RNA stability and expression level has been described previously for mammalian mRNAs (Shabalina et al. 2006). However, we found no significant link between folding and the rate of evolution of lincRNAs and further observed that RNA folding and sequence evolution rate contributed to the expression level of lincRNAs independently.
The findings reported here show that the link between evolution and expression is a fundamental dependency that is not limited to protein-coding genes. Whether or not the deleterious effects of misfolding, leading to the formation of nonfunctional protein or RNA molecules, represent the principal factor behind this universal link remains to be determined. Certainly, the process of RNA folding is fundamentally different from protein folding as the two processes are based on different types of molecular interactions. Nevertheless, there is also undeniable general similarity between the folding processes of these two classes of biomolecules. Indeed, both proteins and RNAs are heteropolymers that fold to form well-defined secondary structure elements through local interactions followed by the formation of a unique 3D conformation through nonlocal interactions. Moreover, RNA misfolding is common if not thoroughly understood, and the increasingly apparent prevalence of RNA chaperones attests to its biological relevance (Cristofari and Darlix 2002; Bhaskaran and Russell 2007; Rajkowitsch et al. 2007; Russell 2008; Semrad 2011). At face value, the observations reported here on the lack of connection between predicted RNA folding and evolutionary rate and the independence of the contributions of predicted folding and evolutionary rate to lincRNA expression can be taken as argument against a causal connection between lincRNA misfolding and the evolution–expression coupling. However, these observations should be interpreted with much caution. Prediction of the base-pairing potential is a blunt instrument that certainly does not reveal the true complexity of the RNA folding process and might not be able to distinguish well between correctly folded and misfolded RNA molecules. However, according to our estimations, about 60% of nucleotides are paired in lincRNAs and mRNAs (Shabalina et al. 2006), which is comparable with the base pairing values for some experimentally characterized mRNAs (Kertesz et al. 2010). Also, some of the local predicted structures for lincRNAs are in agreement with the structures predicted by biochemical probing, for example, for the A region of Xist RNA (Maenner et al. 2010).
The distinct possibility remains that misfolded lincRNAs are deleterious similar to misfolded proteins, and this effect might explain the connection between their evolutionary rate and expression. Certainly, alternative explanations for this universal link could be relevant as well, for example, the potentially greater number of both functional and nonfunctional interactions in highly expressed proteins and RNAs constraining their evolution. Furthermore, it is impossible to rule out that, although the correlations between expression and evolution are of the same sign and similar in magnitude for proteins and lincRNAs, the underlying causes are substantially different (even if this possibility is less than parsimonious).
The functions of the numerous lncRNAs remain largely unknown but the results presented here support previous findings that many of these RNAs are subject to purifying selection, albeit relatively weak, and so are predicted to be functional. We found that lincRNAs recapitulate the universal negative correlation between the evolutionary rate and expression that has been reported for protein-coding genes from diverse model organisms. Moreover, the magnitude of the correlation for the lincRNAs was comparable to the magnitude of the correlation that we identified in equal-sized control sets of protein-coding genes with levels of sequence conservation similar to those observed for lincRNAs. The expression level of the lincRNAs also was significantly and positively correlated with the predicted extent of lincRNA molecule folding (base paring). However, there was no significant correlation between lincRNA folding and evolutionary rate, and the contributions of the evolutionary rate and folding to the expression level were found to be independent. The results of this work indicate that the anticorrelation between evolutionary rate and expression level is a general feature of gene evolution. The causative factors behind this fundamental dependency that might include similar fitness effects of protein and RNA misfolding remain to be elucidated.
We thank Liran Carmel, Joshua Cherry, Jean Thierri-Mieg, Mikhail Galperin, Alexander Lobkovsky, Kira Makarova, and Yuri Wolf for useful discussions. This work was supported by the Intramural Research Program of the National Library of Medicine at National Institutes of Health (US Department Health and Human Services).