The discovery of many RNA sequences that do not encode proteins (non-coding RNAs or ncRNA) and have biological functions beyond those of tRNA and rRNA, has significantly expanded the known role of RNA in diverse cellular processes. Consequently, there is a growing effort to systematically identify ncRNAs utilizing both experimental and computational techniques. Experimental approaches are typically used to identify non-coding portions of an organism's genome that are actively being transcribed. These approaches are not dependent on the identification of conserved RNA sequences or secondary structures, and therefore are well-suited for the discovery of unstructured or poorly-conserved ncRNAs. However, experimental limitations can cause some RNAs to be missed, and the false-positive rate may be high due to "transcriptional noise" [1
]. Alternatively, computational methods seek to identify evidence of conserved RNA sequences and secondary structures through comparative genomics [3
]. However, such methods usually cannot be used to identify RNA motifs that may not have conserved secondary structure, are small with few base-pairing elements, or are not well-represented in genomic sequence databases.
Marine metagenomic sequence data are a proven resource for the discovery of novel protein diversity and have provided additional examples for thousands of previously identified open reading frames (ORFs) with no known homologs [5
]. While there have been surveys conducted with the marine metagenome to discover additional examples of known ncRNAs [6
], there have been no studies explicitly examining these data for novel RNA motifs, in part due to unique computational challenges inherent to metagenomic datasets. Specifically, the exceedingly large amount of sequence data available (~7 billion base pairs), relatively poor annotation of protein coding regions due to a high frequency of fragmentary genes that result from short sequence reads, and comparatively high sequencing error rates make metagenomic data analysis difficult [8
To circumvent many of the challenges associated with analyzing metagenomic sequence data, we have used the genome of 'Cand
. P. ubique' HTCC 1062 as a starting point to discover new RNA motifs within the marine metagenome. Bacteria of the SAR11 clade, of which 'Cand
. P. ubique' is a representative, are found throughout the world's oceans and are the dominant aerobic heterotrophs in marine surface waters [11
]. Given its numeric advantage, genes from members of the SAR11 clade are well-represented in marine metagenomic libraries with nearly 20% of sequence reads from the Global Oceanographic Survey (GOS) matching most closely to genes present in the 'Cand
. P. ubique' genome [12
]. Only ~30% of the GOS reads could be aligned well to the 584 available reference genomes. The other predominant genera represented in the GOS data are Prochlorococcus, Synechococcus, Burkholderia
, and Shewanella
, none of which are closely related to 'Cand
. P. ubique'. While, alignments to every reference genome were identified, typically they showed identity to regions corresponding to large, highly conserved genes [13
At 1.3 million base pairs, the genome of 'Cand
. P. ubique' is the smallest known for a free-living organism, but it appears to encode for nearly all the basic functions of Alphaproteobacteria cells [14
]. The genome contains very little non-coding DNA, with a median intergenic region (IGR) length of 3 nucleotides. In addition, the organism has remarkably low GC content (29%). While evaluating nucleotide composition is usually not a viable method for identifying ncRNAs [15
], in genomes with a strong AT bias or hyperthermophilic environment, the higher GC content necessary to maintain a stable RNA structure may be used to identify candidate ncRNAs [16
. P. ubique' offers an ideal opportunity to utilize nucleotide composition as its genome has very few long IGRs, which are generally low GC (23% on average).
In the current study we combine nucleotide composition with comparative genomics approaches to identify novel structured RNA motifs in 'Cand. P. ubique' and the marine metagenomic data. First, we demonstrate that longer, higher GC 'Cand. P. ubique' IGRs are much more likely to contain structured RNAs (rRNAs, tRNAs, etc.). Subsequently, we utilized the IGRs in 'Cand. P. ubique' with similar properties that lack assigned ncRNAs as the starting point for a comparative sequence analysis strategy that takes advantage of marine metagenomic sequences. We discovered four likely structured ncRNAs including a new riboswitch class, and three other candidate cis-regulatory motifs. In addition we describe several other conserved IGRs that encode potential structured RNA elements.