|Home | About | Journals | Submit | Contact Us | Français|
Selenocysteine (Sec) is known as the 21st amino acid, a cysteine analogue with selenium replacing sulphur. Sec is inserted co-translationally in a small fraction of proteins called selenoproteins. In selenoprotein genes, the Sec specific tRNA (tRNASec) drives the recoding of highly specific UGA codons from stop signals to Sec. Although found in organisms from the three domains of life, Sec is not universal. Many species are completely devoid of selenoprotein genes and lack the ability to synthesize Sec. Since tRNASec is a key component in selenoprotein biosynthesis, its efficient identification in genomes is instrumental to characterize the utilization of Sec across lineages. Available tRNA prediction methods fail to accurately predict tRNASec, due to its unusual structural fold. Here, we present Secmarker, a method based on manually curated covariance models capturing the specific tRNASec structure in archaea, bacteria and eukaryotes. We exploited the non-universality of Sec to build a proper benchmark set for tRNASec predictions, which is not possible for the predictions of other tRNAs. We show that Secmarker greatly improves the accuracy of previously existing methods constituting a valuable tool to identify tRNASec genes, and to efficiently determine whether a genome contains selenoproteins. We used Secmarker to analyze a large set of fully sequenced genomes, and the results revealed new insights in the biology of tRNASec, led to the discovery of a novel bacterial selenoprotein family, and shed additional light on the phylogenetic distribution of selenoprotein containing genomes. Secmarker is freely accessible for download, or online analysis through a web server at http://secmarker.crg.cat.
Most proteins are made of twenty amino acids. However, there is a small group of proteins that incorporate a 21st amino acid, Selenocysteine (Sec). These proteins are called selenoproteins and are present in some, but not all, species from the three domains of life. Sec is inserted in selenoproteins in response to the UGA codon, normally a stop codon. A Sec specific tRNA (tRNASec), which only exists in the organisms that synthesize selenoproteins recognizes the UGA codon. tRNASec is not only indispensable for Sec incorporation into selenoproteins, but also for Sec synthesis, since Sec is synthesized on its own tRNA. The structure of tRNASec differs from that of canonical tRNAs, and general tRNA detection methods fail to accurately predict it. We developed Secmarker, a tRNASec specific identification tool based on the characteristic structural features of the tRNASec. Our benchmark shows that Secmarker produces nearly flawless tRNASec predictions. We used Secmarker to scan all currently available genome sequences. The analysis of the highly accurate predictions obtained revealed new insights into the biology of tRNASec.
Selenoproteins contain the non-universal amino acid selenocysteine (Sec), a selenium-containing cysteine analogue. Selenoproteins are present in the three domains of life [1–3]. An estimated ~20% of the sequenced prokaryotic genomes encode selenoproteins [2, 4–6]. Among eukaryotes, selenoproteins are present across most metazoan lineages , although complete loss of selenoproteins has been reported in some insects [8–11] and nematodes . Selenoproteins are missing in all fungi and land plant genomes . Protist lineages show a scattered distribution of the Sec trait (i.e., the usage of Sec in selenoproteins) . Although they constitute a very small fraction of the proteome of a given organism, selenoproteins cover important roles in antioxidant defense, redox regulation, thyroid hormone activation and others . Many of them have been shown to be encoded by essential genes in mammals (e.g., [14–16]).
Selenoprotein biosynthesis requires a molecular system of cis- and trans-acting factors dedicated to the synthesis of Sec and to its insertion in the nascent polypeptide chain during translation . Central to this system is the tRNA carrying Sec, tRNASec, which plays a key role in both Sec biosynthesis and insertion. Sec is unique for it is the only known amino acid in eukaryotes whose synthesis occurs on its tRNA, lacking its own tRNA synthetase. [18–21]. The tRNASec is first misacylated with serine by seryl-tRNA synthetase (SerRS) to give Ser-tRNASec. In eukaryotes and archaea, serine is phosphorylated by O-phosphoseryl-tRNA kinase (PSTK), then the phosphoseryl moiety is converted to selenocysteine by Sec synthase (SecS, SepSecS). In bacteria, instead, Ser-tRNASec is directly converted to Sec-tRNASec by the bacterial Sec synthase (SelA). Both in prokaryotes and eukaryotes, the selenium donor for the synthesis of Sec is selenophosphate, which is, in turn, synthesized from selenide by selenophosphate synthetase (SPS/SelD). Sec is inserted in response to the UGA codon–normally a stop codon. During the translation of selenoprotein transcripts, the Sec-specific translation elongation factor (EF-Sec in eukaryotes and archaea, SelB in bacteria) brings Sec-tRNASec to the ribosome  at the Sec encoding UGA codon upon recognition of a secondary structure in the mRNA, the Sec insertion sequence (SECIS), by the SECIS binding protein (SBP2 in eukaryotes, SelB in bacteria).
Due to the non canonical usage of the UGA codon, prediction of selenoprotein genes in genomes is a difficult task, ignored by virtually all widely used computational annotation pipelines. As a result, selenoprotein genes are usually mispredicted, being generally truncated at the 3’ (when UGA is assumed to be the stop codon) or 5’ end (when a AUG downstream of the Sec-encoding UGA is preferred as the site of translation initiation to an upstream AUG that would lead to an in-frame UGA codon). Methods dedicated specifically to the prediction of selenoprotein genes have been developed [23–25], but they still require some non-negligible human curation resources. The efficient identification of a genome marker for Sec utilization would be, in this regard, beneficial since it will help to allocate dedicated selenoprotein annotation resources only when needed. tRNASec is one such marker. Unlike other components of the selenoprotein biosynthesis system, which participate also in other pathways and may thus be found in selenoproteinless genomes, tRNASec is specific to selenoprotein-containing genomes [6, 8, 9, 12].
Prediction of tRNASec is usually carried out with general purpose tRNA detection programs, namely tRNAscan-SE  and aragorn  (e.g., in [8, 28–30]). Even though the two programs have been thoroughly benchmarked for canonical tRNAs, they fail to accurately predict tRNASec genes, often predicting them in selenoproteinless genomes, and failing to predict them in selenoprotein containing genomes .
Here we describe Secmarker, a computational pipeline to predict tRNASec in genomes. Secmarker uses Infernal , and has two main components, first three manually curated covariance models (CMs), corresponding to tRNASec in bacteria, archaea and eukaryotes. Second, a set of filters that reduce substantially the number of false positive produced by Infernal when using these methods. The non-universality of Sec utilization and the absence of tRNASec in organisms without such trait allowed us to design a proper benchmark for tRNASec predictions. Such a benchmark is impossible for the rest of tRNAs, all of which occur practically in all living organisms. Our results show that with the appropriate post-processing filters, Secmarker produces almost flawless tRNASec predictions. Secmarker can quickly scan entire genomes. We ran it on about 10,000 eukaryotic and prokaryotic genomes currently available, and identified highly reliable tRNASec gene candidates in 2,884 of them. Analysis of the results revealed a number of novel insights into the biology and evolution of tRNASec, including the identification of an unusual fold for the tRNA in bacteria, an eukaryotic intron-containing tRNASec, the discovery of a number of genomes containing multiple tRNASec genes likely to be functional, and the tracing of the duplicated copy of human tRNASec, likely a pseudogene, to the root of hominids. Moreover, the analysis of the genomes with predicted tRNASec genes led to the discovery of a novel bacterial selenoprotein family, allowed to refine the phylogenetic distribution of selenoprotein containing genomes within insects, and resulted in the identification of the first non-insect arthropod species lacking selenoproteins.
Secmarker is a tRNASec computational detection pipeline that runs Infernal  with three manually curated tRNASec CMs for archaea, bacteria and eukaryotes. The program scans a nucleotide sequence with the three models using cmsearch from the Infernal package, filters the results, and assigns the cmsearch score to the predicted candidates (Materials and Methods).
The three models incorporate the structural features characteristic of tRNASec in each of the three domains of life (Fig 1). The structure of tRNASec comprises an aminoacyl acceptor arm (A-stem), a dihydrouridine arm (D-stem and D-loop), an anticodon arm (C-stem and C-loop, carrying the UCA anticodon complementary to UGA), a variable arm (V-stem and V-loop) and a TΨC arm (T-stem and T-loop). It is the longest tRNA, with 90–101 nucleotides, rather than the conventional ~75 nucleotides in canonical tRNAs . It has an unusual structure, different from the canonical 7/5 fold in other tRNAs (where 7 and 5 are the number of base pairs (bp) in the A and T stems, respectively). The tRNASec adopts a 9/4 fold in eukaryotes [32, 33] and archaea , and a 8/5 fold in bacteria . The acceptor and T arms form the AT-stem, which has 13 bp in tRNASec, compared to 12 bp in the usual 7/5 structure in other tRNAs. It has an exceptionally long variable arm, even longer than those of type-2 tRNAs (e.g., tRNASer) . The D arm of tRNASec has a long D-stem, with 6 bp in eukaryotes and bacteria, and 7 bp in archaea [36, 37], and a 4 bp D-loop, in contrast to the 3–4 bp D-stem and 7–12 nt D-loop in the canonical tRNAs . Although SerRS recognizes both tRNASer and tRNASec, the unique structure of tRNASec is responsible for its specific interactions with PSTK [38, 39], SecS [33, 38] and EF-Sec in eukaryotes/archaea, and SelA  and SelB  in bacteria, discriminating tRNASec from tRNASer.
The residue 73 in tRNAs, referred to as the discriminator base, is essential for aminoacylation by the corresponding aminoacyl-tRNA synthetase . A guanine at this position (G73) is highly favored by SerRS . Although tRNASer possessing U73 have been observed in certain yeasts , tRNASec carries a G73 in the three domains of life, which plays a critical role for the serylation by SerRS [19, 46, 47]. In fact, any mutation at this position prevents the aminoacylation of tRNASec with serine . Structure-based studies in both archaea and human showed that the residue G73 is also involved in latter steps of Sec formation. In archaea, during tRNASec phosphorylation, G73 forms base-specific hydrogen bonds with conserved residues of PSTK . Those residues are essential for PSTK activity in vitro and in vivo [34, 49]. In human, the interaction of SecS with the acceptor arm of tRNASec involves base-specific hydrogen bonds between G73 and Arg398 . Those interactions would be prevented by the substitution of G73 for any other nucleotide(A, C or U) . In bacteria, the residues G1 and G73 in tRNASec interact with the C-terminal region of SelA. Deletion of SelA residues 423 and 424, localized in the region that contacts G73, produces inactive enzymes . The workflow of Secmarker includes the identification of the residue at position 73 in the tRNASec candidates, but this residue is not included in the models or used to score candidates.
Secmarker is available for online analysis at http://secmarker.crg.cat, and it can also be downloaded and run locally. Secmarker requires a local installation of the Infernal package  and the ViennaRNA package . The program analyzed ~4MB/s in a single CPU (Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz) with 12GB of memory. See Materials and Methods for details.
Unlike for the rest of tRNAs, it is possible to design a proper set for benchmarking predictions of tRNASec. This is because of the non-universality of Sec utilization trait and the absence of tRNASec in organisms without such trait. Thus, tRNASec predictions in selenoproteinless genomes are necessarily false positives, while lack of predictions in selenoprotein containing genomes correspond to false negatives. Analogous criteria cannot be employed for any other tRNA, since they are almost invariably present in the genomes of all organisms.
To design the benchmarking data set we used previous work in , where the presence of both selenoproteins and selenoprotein machinery factors was used to classify eukaryotic and bacterial genomes as either selenoprotein containing or selenoproteinless. This resulted in a set of 217 bacterial genomes (42 of which encode selenoproteins) and 212 eukaryotic genomes (105 of which encode selenoproteins). In addition, since archaea were not well represented in , we used Selenoprofiles  to scan 213 archaeal genomes for the presence of selD and EF-Sec, as well as selenoproteins. After manual curation of the results, we identified 14 genomes (6%) that use Sec. In total, therefore, our benchmark set included 642 sequenced genomes, of which 161 (25%) encoded selenoproteins (positive set) and 481 (75%) did not (negative set).
To evaluate the accuracy of tRNASec predictions at the genome level, we computed sensitivity, as the fraction of genomes from the positive set in which at least one tRNASec gene was predicted, and specificity as the fraction of genomes in the negative set in which no tRNASec was predicted. This benchmark, however, is not perfect at the individual prediction level; since the correct tRNASec loci are generally not known, true positives are overestimated and false positives underestimated, leading to overestimations of both sensitivity and specificity. Indeed the prediction of a wrong tRNASec locus in a selenoprotein encoding genome will be considered in our approach to be a true positive, when actually it is a false positive. This is partially alleviated by the fact that selenoprotein encoding genomes possess normally a single tRNASec locus . Thus, the number of tRNASec predicted genes in a given genome is also an indirect measure of specificity.
We ran aragorn (v1.2) , tRNAscan-SE (v1.23)  and Secmarker in our benchmark data set. Results are reported in Table 1. Using  as reference for selenoprotein containing genomes, Secmarker achieved globally both higher sensitivity than aragorn and tRNAscan-SE (99% vs 96% and 68%, respectively) and higher specificity (99% vs 83% and 89%, respectively), as well as within each domain/taxa considered (Table 1). Moreover, Secmarker predicted much fewer tRNASec candidates (1.7 on average in selenoprotein containing genomes) than tRNAscan-SE (47.1) and aragorn (20.0). Since most multiple tRNASec predictions in a given genome are likely to be false positives (see below), our measures of sensitivity and specificity actually underestimate the gap in performance between Secmarker and the other programs.
In addition to tRNAscan-SE and aragorn, we also used RF01852 (Rfam tRNA-Sec) with Infernal 1.1 . RF01852 achieved similar sensitivity than Secmarker, although the specificity was lower in prokaryotes and eukaryotes (Table 1 and S1 Text). It predicted 68% more tRNA-Sec genes than Secmarker, very likely to be false positives. In addition to having a superior performance, Secmarker has the advantage of identifying the domain to which the tRNASec encoding genome belongs (bacteria, archaea or eukaryota). This can be particularly useful in the analysis of metagenomic data, where generally there is no previous knowledge of the sequenced genomes.
Figs Figs2,2, ,33 and and44 summarize the tRNASec predictions obtained by the three programs in eukaryotes, bacteria and archaea (see S1 Text for details). At the genome level Secmarker produced only one apparently false negative prediction, and three apparent false postive predictions. Secmarker failed to predict tRNASec candidates in the genome of the selenoprotein containing protist Phytophthora capsici (Fig 2). Using Secmarker, however, on the raw sequence reads available for this genome, we identified a full length tRNASec gene (section 5 in S1 Text). Secmarker, thus, failed to predict it because the gene sequence is missing from the genome assembly analyzed here.
On the other hand, Secmarker predicted tRNASec genes in three genomes annotated in  as lacking selenoproteins: the eukaryote Phytophthora ramorum, and two bacteria from the genus Burkholderia. In all these cases, analysis of more recent assemblies indicated that these genomes encode selenoproteins, since we identified key genes for selenoprotein biosynthesis as well as selenoproteins themselves (S1 Text). Secmarker therefore correctly predicted tRNASec genes in these genomes. Evaluated at the genome level, therefore, Secmarker produces flawless predictions, and these are a perfect marker for selenoprotein containing genomes.
While there was good overall overlap between Secmarker, aragorn and tRNAscan-SE predictions in bacteria (Fig 3) and archaea (Fig 4), there were large discrepancies in eukaryotes (Fig 2). Both aragorn and tRNAscan-SE produced numerous false positive predictions in fungi and land plants, both known to lack selenoproteins . On the other hand, there was substantial overlap between gene predictions from aragorn and tRNAscan-SE in genomes with the Sec trait, that were not predicted by Secmarker (1,079 genes, Fig 2B). Even though these predictions were obtained from selenoprotein encoding genomes, we considered them very unlikely to be correct, because nearly all of them (99%) were predicted in just four genomes, those of Bos taurus (487), Ornithorhynchus anatinus (478), Loxodonta africana (50) and Danio rerio (21), and selenoprotein containing eukaryotic genomes are known to normally encode only one or very few tRNASec genes (see below).
In addition to the benchmark set, we ran Secmarker on the genome sequences available for 9,780 organisms. We initially predicted 3,341 tRNASec genes in 2,899 genomes (Table 2). The analysis of the Secmarker results revealed a number of insights on the biology, structure and evolution of tRNASec.
To assess the quality of the predictions at the individual level, we investigated the nucleotide present at the residue 73 of tRNASec candidates in the extended set of genomes. Across all analyzed genomes, the great majority of the tRNASec candidates predicted by Secmarker, 3,162 out of 3,341 (94.6%) contained the canonical guanine at position 73 (G73), as reflected in the multiple alignment of all the highest scoring Secmarker predicition in each genomes (Fig 5). In bacteria, following the G73, we observed a conserved CCA triplet, the universal 3’ end of mature tRNAs . The triplet is generally encoded in the genome in bacteria (93% of the tRNASec genes), but not in archaea (5%) and eukaryotes (3%), as previously observed for canonical tRNAs .
There were 178 tRNASec candidates in 125 genomes with a nucleotide different than a G in position 73. In 61 genomes, the non G73 candidate was either the sole prediction or the top scoring one. Nine such predictions were in vertebrate genomes (Monodelphis domestica, Haliaeetus albicilla, Opisthocomus hoazin, Fulmarus glacialis, Egretta garzetta, Tinamus guttatus, Cariama cristata, Struthio camelus and Phalacrocorax carbo). The remaining 52 were all in bacteria, and the analysis of the sequences led us to identify an unusual tRNASec structure (see next section).
The total length of the tRNASec acceptor stem plus T-stem is 13 bp (8+5 in bacteria  or 9+4 in archaea and eukaryotes ). Deviations from the bacterial 8+5 structure have been recently reported in  and . The former described tRNASec genes from Epsilonproteobacteria with 12 bp AT-stem plus one bulged nucleotide, and the latter described the Cloacimonetes type tRNASec, which has 12 bp (7+5) and lacks one nucleotide in the linker region between the acceptor stem and D-stem.
Among the 52 non G73 bacterial tRNASec identified in this study, detailed analysis revealed that 47 had a 12 bp AT-stem. Similar to the Cloacimonetes type , they had a 7 bp acceptor stem (7 residues between the T-stem and the discriminator base G73). Secmarker initially failed to correctly identify the G73 residue since it relies on the assumption of a 13 bp AT-stem, but their structural alignment actually revealed a conserved residue G73, and the CCA tail in some of them (S1 Fig). These tRNASec sequences were found in several genomes from Gammaproteobacteria, Clostridiales, Spirochaetes, in two species of Alphaproteobacteria, and in Rubrobacter xylanophilus DSM 9941 (Actinobacteria) and Dehalogenimonas lykanthroporepellens BL-DC-9 (Dehalococcoidetes), although not all tRNASec genes in these lineages exhibited the 7/5 fold. Most of these tRNASec had a bulged nucleotide in the acceptor stem, based on the inferred secondary structure (S1 and S2 Figs). The bulged nucleotide was observed in different positions (S2 Fig; columns A, B and C). Several tRNASec from Alphaproteobacteria and Gammaproteobacteria had an extra nucleotide in the linker region between the acceptor stem and D-stem (position 7a) while lacking the bulged nucleotide in the acceptor stem (S2 Fig; column D). R. xylanophilus DSM 9941 tRNASec lacked one nucleotide in the linker region between the acceptor stem and D-stem (S1 Fig). A common feature amongst most of the 12 bp AT-stem tRNASec was a bulged nucleotide in the anticodon stem (position 43a). Also, specific to Clostridiales, a bulged nucleotide in the D-stem (position 13a) was observed. The tRNA residues numbering was based in . The remaining five non G73 tRNASec bacterial top scoring candidates are shown in S1 Text.
In the genomes where these unusual tRNASec candidates were identified, we also predicted Sec-containing genes and the genes encoding the protein factors of the Sec machinery: selA, selB and selD. With few exceptions, tRNASec (selC) was found very close to selA and selB genes, forming a selABC operon (S2 Fig). Some of the genomes had two non-identical copies of tRNASec, which were located adjacent to each other in the same operon, in the case of four Clostridiales genomes, or in two different complete operons, in the case of Photobacterium profundum 3TCK (S2 Fig). Despite their unusual structure, these observations suggest that these tRNASec are indeed involved in Sec synthesis and incorporation.
After taking into account the 7/5 bacterial fold, only 14 tRNASec candidates among the 2,898 that were the sole or the top ranking prediction lacked a G at the position 73: the nine in vertebrates and the five in bacteria mentioned above. As these candidates had one or more disrupted pairs in their inferred secondary structure, they are most likely not functional–the functional tRNASec genes likely missing from the genome assemblies of these species–and they should, therefore, be considered Secmarker false positives. Furthermore, among the genomes (193) in which Secmarker predicted multiple tRNASec genes, there were 117 non G73 predictions. They scored much lower than the G73 predictions, irrespective of whether they were or not the top ranking prediction (S3 Fig).
From the analysis above, we conclude that G at position 73 is essential for tRNASec function. In total, Secmarker predicted 3213 G73 tRNASec genes in 2884 genomes.
Non G73 Secmarker predictions could partially reflect pseudogenization events. tRNASec pseudogenes have been previously described in rabbits, Chinese hamsters and humans [54, 57]. Here we investigated in detail the origin and evolutionary fate of the human tRNASec pseudogene. The human tRNASec was first identified as an opal suppressor gene  located on the chromosome 19 . Along with the gene, known as TRNAU1 (here named tRNASec1), a second copy was also identified  on chromosome 22 . The second human tRNASec, known as TRNAU2 (here named tRNASec2) presents features that suggest pseudogenization : it has a cytosine discriminator base, and several pairs in the acceptor stem are disrupted. Secmarker identified the two genes in their expected genomic locations in the genomes of hominids, but only tRNASec1 in the genome of other primates (Fig 6E). The homology between the two tRNASec is limited to the sequence of the mature tRNA, and several repetitive elements belonging to the ALU family are present surrounding tRNASec2 . These observations suggest that tRNASec2 originated by retrotransposition of tRNASec1 in the lineage of Apes, after the split with Nomascus and before the split of Pongo. As in human, tRNASec2 has a discriminator base cytosine in all analyzed hominids, whereas tRNASec1 has always guanine (Fig 6E). We searched for evidences of transcription of human tRNASec1 and tRNASec2. tRNA genes are transcribed by RNA polymerase (Pol) III . Pol III occupancy at tRNA loci, and importantly to their unique flanking regions, has been used to measure tRNA genes usage . We processed Pol III Chip-seq data performed on human liver samples from , and we found evidence of Pol III bound to tRNASec1 (Fig 6A), but not to tRNASec2 (Fig 6B), in two different samples. From all these observations, it would seem that tRNASec2 was “dead on arrival” after its origin by retrotransposition.
In general, it is assumed that tRNASec occurs as a single copy functional gene . Consistently, tRNASec knockout mice showed an early embryonic lethal phenotype . So far, only in one genome, that of zebrafish, two tRNASec genes have been reported , which are completely identical in sequence. However, even ignoring the non G73 predictions, we still found 151 genomes with two or more G73 tRNASec genes predicted by Secmarker (478 predictions in total). The score of the additional G73 copies was higher than the non G73 copies (S3 Fig; note that the residue at position 73 does not contribute to the Secmarker score of the predictions). This suggests that some of these could be functional. Analysis of these predictions, however, revealed that many of them are 100% identical in sequence, even when including the 100 bp flanking regions, suggesting artefacts in genome assembling. After discarding identical predictions, 376 candidates in 124 genomes remained. Detailed analysis on the 252 G73 copies (not including the top scoring candidate in these genomes) revealed that 145 predictions (80 genomes) had mutations, when compared to the top scoring one, that would disrupt the tRNA structure, and they are thus likely to be Secmarker false positives. Among the remaining predictions, one is likely to be a contaminant (a protist tRNA in a bird genome), and 69 predictions (46 genomes) did not have mutations or the mutations did not affect the pairing potential of the sequence. Interestingly, 27 predictions (21 genomes) had compensatory mutations when aligned to the top scoring candidate (Table 3). Many of these are likely to be“bona fide” tRNASec genes. While some genomes with multiple tRNASec genes have many selenoproteins, others have very few. Eighteen genomes (eight bacterial and ten eukaryotes) had two tRNASec genes (i.e., the top scoring one and an additional copy with compensating mutations), and three eukaryotes had four tRNASec genes. In the genomes of the common spider Parasteatoda tepidariorum and the lancelet Branchiostoma floridae, the compensating mutations in the duplicated copies were identical, suggesting that they occurred in the first duplicated copy before subsequent duplications. Strikingly, the genome of the diatom Fragilariopsis cylindrus had eleven non-identical predictions (taking into account the 100 nt flanking sequence). Two of them had a mutation that would disrupt the tRNA structure. Among the remaining nine, three showed compensatory mutations when compared to the top scoring one. Two of the duplicated copies showed the same compensatory mutations (S4 Fig).
We did not find any correlation between the number of tRNASec in genomes with multiple candidates and the number of selenoproteins.
Aiming to gain insights into the evolution of the Sec encoding trait, we searched for the rest of the Sec machinery and selenoprotein genes in all genomes using Selenoprofiles . Our results in prokaryotes were overall consistent with previous reports. Across genomes, the presence of tRNASec correlated well with the presence of selenoproteins and selenoprotein machinery. The most common exceptions were genomes with a SelD gene (selenophosphate synthetase) but no tRNASec or selenoproteins, consistent with SelD supporting Se utilization traits other than Sec [5, 6]. Our search also identified four selenoprotein genes in two archaeal assemblies (the Euryarchaeota strains SCGC AAA261-G15 and SCGC AAA288-E19) without predicted tRNASec; however all four genes had a bacterial SECIS (identified using ), thus very likely reflecting bacterial contamination in the assemblies.
We also detected three bacterial genomes with SelD and tRNASec, but without selenoprotein predictions from any known family. Although this may be caused by incomplete assemblies, it may suggest that these organisms use yet undiscovered selenoproteins. The three genomes (Paenibacillus vortex V453 and the two strains Brachyspira hampsonii 30446 and 30599) were analyzed with a custom procedure to identify TGA-containing open reading frames (ORF) (Materials and Methods). The analysis revealed a putative novel selenoprotein in the B. Hampsonii genomes. The candidate selenoprotein is a small protein that has a thioredoxin domain (PF13192; “Thioredoxin 3”) with a short 5’ extension that contains a conserved Cys/Sec residue (Fig 7A). The Cys-containing homologues identified are annotated as “Redox-active disulfide protein 2”. We found this novel selenoprotein in all other Brachyspira genomes analyzed, which, in contrast to B. Hampsonii, we identified other selenoprotein families. All genomes had three genes from this protein family: a Cys-containing homologue and two selenoproteins. The three genes were always found forming a gene cluster (Fig 7B). The two putative selenoproteins had good candidate bacterial SECIS downstream their TGA codon (Fig 7C). One of the two selenoproteins (“Sec.1” in Fig 7A) lacked the redox-active motif (CXXC) in the thioredoxin domain (columns 61–64 in Fig 7A). Proteins from the “Redox-active disulfide protein 2” family are classified as oxidoreductases acting on a sulfur group of donors. A search in STRING database  revealed that the genes from this protein family commonly neighbour genes from other selenoprotein families such as thioredoxin reductases, alkyl hydrogen peroxide reductase, peroxiredoxins, and other oxidoreductases.
Results in eukaryotes are summarized in S8 Fig: in the genomes analyzed, tRNASec correlated almost perfectly with the presence of Sec machinery factor EF-Sec and selenoprotein genes.
Most known metazoans encode selenoproteins with the exception of parasitic plant nematodes , and several insect orders, in which multiple Sec loss events have been described [6, 8, 9]. The analysis of the Secmarker predictions, however, provided a picture of much increased resolution of the distribution and evolution of the Sec trait in insects, and arthropods in general. Selenoproteins have been reported to be lost in Lepidoptera and Hymenoptera (i.e., no known species in these orders encode selenoproteins), and consistently, we did not find any other species from these orders encoding selenoproteins. Coleoptera were also assumed to entirely lack selenoproteins; however, we did find two coleopterans that encode selenoproteins. Selenoprotein losses have also been reported in some, but not all, Diptera and Paraneoptera species. Here we also found selenoproteinless species in Trichoptera and Strepsiptera. Finally, no arthropod outside insects have so far been reported to lack selenoproteins. Here, we report the genomes of two arachnids that lack selenoproteins. We next describe in additional detail these results (summarized in S9 Fig).
We did not find tRNASec, nor other Sec machinery factors, nor selenoproteins in the genome of the trichopteran Limnephilus lunatus (S9 Fig). Since Trichoptera is a sister group to Lepidoptera , our data suggest that selenoproteins could have been lost in the common ancestor of Trichoptera and Lepidoptera. Similarly, we did not find selenoproteins nor Sec machinery factors in the genome of Mengenilla moldrzyki (order Strepsiptera). Since all coleopterans analyzed to date lacked selenoproteins, it was assumed that a Sec loss event occurred at the root of the lineage [6, 8, 9]. However, we identified here two coleopterans with tRNASec, selenoproteins and a complete Sec machinery (S9 Fig). The genome of Onthophagus taurus contained two selenoprotein genes (SPS2 and SelK), and Nicrophorus vespilloides contained a SPS2 selenoprotein gene. All three genes have good candidate SECIS. From the phylogenetic topology of the available genomes from Coleoptera, based on , and from the phylogenetic location of the selenoprotein containing genomes, we infer that multiple independent Sec extinctions occurred in Coleoptera: in Cucujiformia (previously reported [6, 8, 9]), in the lineage leading to Agrilus planipennis (Elateriformia), and the lineage leading to Priacma serrata (Archostemata).
Outside insects, the genomes of the arachnids Dermatophagoides farinae and Sarcoptes scabiei also lacked selenoproteins and the Sec machinery factors (S9 Fig). These two species belong to Acari, a taxon of non-insect arthropods that include bulbs and mites, and they are the only two sequenced representatives from the order Astigmata (mites). Unlike selenoproteinless insects, these two genomes do not have a SPS1 gene, the non-selenoprotein paralogue of SPS2. SPS1 was predicted to emerge by gene duplication at the root of insects, as well as in other lineages independently . In Astigmata it appears that SPS2 was lost without prior duplication to generate SPS1, analogously to the situation in selenoproteinless nematodes . These are the two first non-insect arthropod genomes reported to have lost selenoproteins.
Among the genomes with more than one bona fide tRNASec predictions is that of the crustacean Daphnia pulex (common water flea), in which we identified two copies. Strikingly, the two copies contain introns. Although introns are not rare in canonical tRNAs, only a single case has been reported for tRNASec. This was recently found in Lokiarchaeota , using Secmarker. Eukaryotic tRNA introns are generally short (14–60 nucleotides), and invariably interrupt the C-loop one base 3’ to the anticodon . The introns in the two D. pulex tRNASec genes are 25 and 16 nucleotides long, and are located in the expected position (S5 Fig). Both genes have a G in position 73. The sequences of the mature tRNAs differ only in two positions. Notably, these positions map to the T arm, and are predicted to form pairs in both genes. The presence of two mutations in the residues that form a pair suggest that a compensatory mutation occurred to maintain the integrity of the structure of the tRNA. However unusual, this strongly suggests that D. pulex possesses two functional copies of tRNASec, and that both have an intron.
In spite of the low number of archaeal selenoprotein containing genomes analyzed, our results strongly support that tRNASec in archaea has generally a 7 bp D-stem, one base pair longer than eukaryotes and bacteria, as reported by  after analyzing a smaller set of genomes. We observed the 7 bp D-stem in the 19 Methanococcales analyzed here. The only exception, with a canonical 6 bp D-stem, was Methanopyrus kandleri (S6 Fig) as already noted in . The selenocysteine machinery in Lokiarchaeota, the most recently identified Sec-containing lineage in archaea, includes a tRNASec with a 7 bp D-stem and an intron in the T arm .
We evaluated the conservation of the tRNASec structure across eukaryotes. We used the program R-chie  to analyze the structural alignment containing the top scoring predictions in the benchmark set. The alignment largely supports the eukaryotic tRNASec structural model [32, 33], showing covariation of nucleotide pairs (i.e., variation of the two nucleotides that form a pair keeping the canonical base pairing) in all tRNA arms. The V arm showed the highest level of variability, and the anticodon arm, the lowest (Fig 8). Based on a larger alignment including the 553 eukaryotic top scoring G73 tRNASec candidates, there were only six positions, besides the anticodon triplet and the residue 73, 100% conserved across all species: G18 and G19 in the D-loop, U33 in the anticodon loop, U55 in the T-loop, C61 in the T-stem and C66 in the acceptor stem. Overall conservation, measured as the average of the conservation at each position, was higher in unpaired residues in loops and the linker region between acceptor and D arms (92%) than in paired residues in the stems (82%).
A remarkable finding was recently reported in , where the authors described bacterial organisms that code for Sec with codons other than UGA. In these species, tRNASec has an anticodon different than UCA, and accordingly, there are selenoprotein genes carrying a matching codon at the Sec site. We identified three such tRNAs in our set of prokaryotic genomes. The genomes belonged to the Geodermatophilaceae family, and, as reported in , their tRNASec had the anticodon CUA. Secmarker correctly identified these tRNASec variants. We used Selenoprofiles  to predict selenoprotein genes in those three genomes, and in addition to the formate dehydrogenases (FDHs) and UGSC-motif selenoproteins reported in , we identified a gene encoding an alkyl hydroperoxide reductase (AhpC) selenoprotein with a Sec-TAG codon in the genome of Blastococcus saxobsidens DD2 (S7 Fig).
Prediction of tRNASec has never received wide attention, possibly because of the low number of selenoprotein genes. Thus, while general purpose tRNA detection methods, such as tRNAscan-SE and aragorn have been thoroughly benchmarked for canonical tRNAs, this is not the case for tRNASec predictions–the tRNAscan-SE authors explicitly citing as a reason the low number of tRNASec sequences available . Indeed, among the more than 12,000 tRNA genes in tRNAdb , only 46 correspond to tRNASec.
Here, we built on the unique structural features of tRNASec to create covariance models that allow Secmarker to identify tRNASec genes with great accuracy. In addition to the intrinsic biological interest of refining the tRNASec structural features and improving tRNASec predictions, thus contributing to better genome annotations, accurate prediction of tRNASec genes has the additional benefit of serving as marker of Sec utilization and selenoprotein encoding capacity in genomes. Since annotation of selenoprotein genes requires dedicated effort, pre-scanning the genome with Secmarker, which is reasonably fast (~4 Mb/s), helps to allocate this effort only when needed.
Because, unlike the rest of amino acids, which are present in virtually all living species, Sec is only present in species encoding selenoproteins (to date about one quarter of all species with sequenced genomes), we were able to design a reliable benchmark for tRNASec predictions. Indeed, tRNASec predictions in selenoproteinless genomes are necessarily false positives, while lack of predictions in selenoprotein containing genomes denote false negatives. No equivalent benchmark can be implemented to evaluate predictions of tRNAs for other amino acids. As a marker of Sec utilization, Secmarker performs flawlessly; in our benchmark set, it predicted tRNASec genes in all genomes encoding selenoproteins, and it did not produce predictions in any of the genomes lacking them. In contrast, tRNAscan-SE and aragorn failed to produce predictions in genomes known to encode selenoproteins, while producing predictions in genomes known to lack them.
This accuracy at the “genome level” is only an approximation, however, to the real accuracy of tRNASec prediction programs. Indeed, a tRNASec prediction in a selenoprotein containing genomes, while accurate as a marker of Sec utilization, could actually be a false positive if the wrong locus (or loci) are predicted, leading also to a false negative if, in addition, the correct tRNASec is not predicted. This is often the case for aragorn and tRNAscan-SE. For instance, Secmarker failed to predict tRNASec in the selenoprotein containing genome of P. capsici because the tRNASec gene is missing from the current assembly, as revealed by the analysis of the raw reads available for this genome. However, aragorn predicted tRNASec candidates, and, as markers of Sec utilization, they would be considered correct in our benchmark. However, manual inspection of the candidates revealed that these predictions do not possess the features of bona fide tRNASec. In fact, the secondary structure of the two candidates predicted by aragorn in P. capsici did not fit the tRNASec model (S1 Text).
Evaluating the accuracy of the programs at the gene level is, however, challenging, since for most genomes we do not know the functional tRNASec genes. Nevertheless, our results strongly suggest that Secmarker has a much lower false positive rate than tRNAscan-SE and aragorn. First, the average tRNAscan-SE genes predicted per genome is 1.7 for Secmarker, 20 for aragorn and 47 for tRNAscan-SE. Since, with a few exceptions, genomes encode at the most one single tRNASec gene, the majority of tRNASec aragorn and tRNAscan-SE predictions are actually false positives. Secmarker can also produce false positive predictions. We can attempt to estimate their ratio from the analysis of the Secmarker results in the full set of genomes. Ignoring non G73 predictions, that can be trivially filtered out, Secmarker predicted 154 tRNASec candidates in 80 genomes (the 145 mentioned in Results plus 9 identical copies reported by Secmarker in those 80 genomes), with mutations destabilizing the tRNASec structure when compared to the top scoring prediction in the same genome. Thus, we estimated the lower boundary for the Secmarker false positive ratio to be less than 5% (154 out 3213 total G73 predictions). We do not believe this lower boundary to depart too much from the actual false positive ratio, since Secmarker most often predicts a single tRNASec gene in selenoprotein containing genomes. We believe the false negative ratio (i.e., the failure of Secmarker to predict the actual tRNASec gene) to be negligible, since analysis of the selenoprotein containing genomes from the benchmarking set in which Secmarker failed to predict a tRNASec gene revealed in all cases that the gene was missing from the analyzed genome assembly.
The accurate predictions of tRNASec by Secmarker allowed us to reclassify a number of genomes thought to lack selenoproteins, as selenoprotein containing instead, as well as to re-evaluate the phylogenetic distribution of selenoprotein encoding genomes within insects. Thus, we identified two novel selenoproteinless insect orders, Trichoptera and Strepsiptera. Conversely, we found selenoproteins in two coleopterans, which were previously assumed to lack selenoproteins. We also found two selenoproteinless arachnid species, revealing the first selenoprotein extinction observed in non-insect arthropods. Secmarker predictions also led to the identification of a novel bacterial selenoprotein family. Finally, they allowed us to consolidate recent findings, as well as to produce novel insights, about tRNASec. Thus, our results support the tRNASec archaeal fold, initially proposed based on a few sequences , and help to refine the novel bacterial fold recently reported . In addition, we have traced the evolutionary history of the duplication and pseudogenization of tRNASec occurred at the root of hominids, and report two intron containing tRNASec genes occurring in Daphnia–the first eukaryotic intron-containing tRNASec reported. Finally, in contrast to previous reports, we have identified a number of genomes that contain multiple tRNASec copies likely to be functional, since they exhibit compensating mutations. Notably, we identified three eukaryotic genomes with four non-identical tRNASec copies with compensating mutations. Since these genomes are phylogenetically diverse (the common house spider, a diatom and a lancelet), the duplicated tRNASec are likely to have independent origins. Their biological significance is unclear, since the genomes of these organisms do not encode particularly large numbers of selenoproteins compared to the genomes of organisms from the same taxa.
tRNAs with a non-canonical structure can be responsible for alterations in the universal genetic code (e.g., selenocysteine  and pyrrolysine ), but they are likely to be missed or misannotated. Recent studies have identified novel uncommon tRNA structures [72, 73], revealing additional complexity in the genetic code. The use of dedicated tools, as we shown here, can be useful for the proper identification and annotation of non-canonical tRNAs.
In summary, we described here the development and validation of Secmarker, a tool to predict tRNASec. The analysis of its predictions across thousands of genomes revealed a number of insights, ultimately contributing to our understanding of tRNASec and selenoproteins–one of the most fascinating class of proteins.
Secmarker is a novel tRNASec detection pipeline based on covariance models (CM). It includes three manually curated CMs for tRNASec. Each model corresponds to a domain of life (archaea, bacteria and eukaryotes) and incorporates its characteristic structural features. The program scans a nucleotide sequence with the three models using cmsearch from the Infernal package (v1.1.1) . After processing and filtering the hits by Infernal, the program produces a graphical output showing the tRNASec secondary structure (Fig 1).
Secmarker is available for online analysis at http://secmarker.crg.cat (Fig 9). The web server accepts sequences up to 100Mb, and runs at a search speed of ~4 Mb/s. After processing and filtering the candidates produced by Infernal, the program identifies their discriminator base and produces a graphical output showing the tRNASec cloverleaf secondary structure. The program can also be downloaded, installed, and run locally. Secmarker is written in python and requires a local installation of the Infernal package  (version 1.1.1, available at http://eddylab.org/infernal/) and the ViennaRNA package  (tested on version 2.1, available at http://www.tbi.univie.ac.at/RNA). Secmarker has been tested on python 2.6.6 and 2.7.10.
CMs are ‘a specialized type of stochastic context-free grammar’ . Infernal  can be used to build a CM from a multiple nucleotide sequence alignment with structural annotation. The sequences used to build the three models were obtained from the Rfam database  (RF01852, tRNASec). Here, it is important to mention that Rfam provides a single model for tRNASec. However, given the structural differences of tRNASec between the three domains of life, we built three independent, domain-specific models. In order to build the models, first, the Rfam tRNASec sequences were downloaded and clustered according to their taxonomic domain, using the species identifier. Then, tRNAscan-SE  was used to filter out sequences that did not match the eukaryotic or prokaryotic models, according to tRNAscan-SE labels “SeC(e)” and “SeC(p)”, respectively. With the remaining sequences, a recursive procedure using RNAfold from the Vienna package , and cmalign and cmbuild from the Infernal package, was designed to iteratively align the sequences based on their predicted structure. Finally, sequences with an anticodon different than UCA were discarded. The alignments used to build the models with cmbuild contained 10, 140 and 251 sequences for archaea, eukaryota and bacteria, respectively. The alignments and covariance models used by Secmarker are provided in S1 File.
The target nucleotide sequence is scanned with the three CMs using cmsearch , as first step. The default bit score cut-off for cmsearch is 40, but this can be set by the user using the -T option. This threshold was set upon confirmation that cmsearch did not miss any true positive in the benchmark set. Often, the same locus is identified by more than one model. Overlapping hits are thus removed, keeping for each locus only the hit with the highest bit score. The resulting hits are processed to identify the anticodon triplet, the boundaries of each tRNA arm and the position 73 (see next section). By default any anticodon is accepted, although hits with a anticodon different than UCA are filtered through a more stringent bit score threshold (55). The final candidates are filtered through a custom procedure designed to identify the most common false positives: hits with shorter or missing arms. The tRNASec candidates in the output are labeled according to the model (eukaryotic, archaeal or bacterial) with the highest bit score by cmsearch.
Secmarker runs a procedure to identify the position 73, the discriminator base, in tRNASec, exploiting the length of 13 nt of the AT-stem in this family of tRNAs. This position is not included in our models, so it is not considered in the search phase. In order to identify the position 73, the program first identifies the position 61 (numbering based on ), and then retrieves the 14th base 3’ from that position, if that nucleotide is present in the input sequence. Since the total length of the sequence predicted by Infernal at the search phase, could vary according to the number of pairs in the acceptor arm, this procedure is independent of the number of pairs in the acceptor stem.
Secmarker produces a graphical output representing the secondary structure of the predicted tRNASec genes (Fig 1). The tRNA structure is represented in its cloverleaf form, with the different nucleotide pairs colored according to the arm. Wobble pairs (GU and UG) are indicated with a faint color. The nucleotides in the anticodon triplet (normally UCA) are circled. The discriminator base, is also circled, if detected. The graphical output can be activated using the flag -plot, which by default is off. In order to produce the graphical output, Secmarker requires a local installation of the program RNAplot from the Vienna package (tested on version 2.1.1) .
In order to test the prediction of tRNASec, we used a set of 641 sequenced genomes (212 eukaryotes, 217 bacteria and 212 archaea). We had previously analyzed the bacterial and eukaryotic organisms in this set for the presence of the Sec utilization trait and selenoproteins . The set of archaeal genomes not analyzed in , was obtained from NCBI. Sec utilization was predicted in these species based on the presence of the genes for selD and EF-sec, which were annotated using Selenoprofiles .
Secmarker, aragorn v1.2 , tRNAscan-SE v1.23  and Infernal 1.1  with RF01852 (Rfam tRNA-Sec) were used to predict tRNASec in the genomic sequences. Aragorn was executed using the -t flag (predict tRNA only). For prokaryotic sequences, tRNAscan-SE was executed with -B flag. The single tRNA-Sec model RF01852 was used using the parameters recommended in the curation page (http://rfam.xfam.org/family/RF01852#tabview=tab9), ‘cmsearch –nohmmonly -T 25.39’. The results were then parsed to exclude those hits with score lower than 47.0 (“gathering threshold”). Bacterial tRNASec genes predicted in eukaryotic genomes were assumed to originate from bacterial contamination in the eukaryotic genome assemblies. We could filter out such cases from the output of tRNAscan-SE and Secmarker. All programs were executed in a SGE distributed cluster using a single cpu with 12Gb of memory available.
We implemented a procedure to identify TGA-containing ORFs in prokaryotic genomes. The procedure was based on the modification of an existing annotation of protein coding genes. The genes included in the annotation were extended at both ends, using the same frame of translation, up to a stop codon different than TGA. All in-frame TGA codons were included in the extensions. The amino acid sequence of the TGA-containing ORFs were analyzed for sequence conservation using Blastp  against the protein database UniRef90 from UniProtKB. Since all selenoprotein families have Cys-containing homologues (non-selenoprotein genes with a Cys residue at the homologous Sec position), we expected any selenoprotein gene to show TGA/Cys pairs in the Blast alignments. We parsed the Blast outputs and selected those ORFs that produced three or more hits with a TGA aligned to a Cys residue. The selected ORFs were analyzed further. For each ORF, a profile alignment, containing the TGA-containing sequence and the Cys homologues identified by Blast, was build and used to scan a set of bacterial genomes with Selenoprofiles .
Pol III chip-seq data analyzed in this work was produced in . We downloaded the fastq files corresponding to human liver samples (ERR039133 and ERR039141) from ArrayExpress (https://www.ebi.ac.uk/arrayexpress/E-MTAB-958; accession: E-MTAB-958;). The fastq files were processed using our in-house chip-seq pipeline (https://github.com/guigolab/chip-nf). Phytophthora capsici Genome Sequencing Illumina HiSeq 2000 reads were downloaded from NCBI SRA (http://www.ncbi.nlm.nih.gov/sra), accessions: SRR943799 and SRR945695, and analyzed with Secmarker. The following reads contained the full sequence of a eukaryotic tRNASec gene: SRR943799.568178, SRR943799.262468, SRR943799.84635, SRR945695.19108665, SRR945695.14526540, SRR945695.2975118.
The alignment contains 52 tRNASec sequences identified in this study, including the 47 top scoring candidates plus five gene copies (indicated with a star), with an unusually short 7 bp acceptor stem. The acceptor stem is delimited by the T-stem (brown) and the residue G73 (the 4th residue from the right), and has 7 pairs (grey) in all sequences. Positions where bulged nucleotides can be observed are numbered in red on top of the alignment. The nucleotides numbering is based in . The sequences were aligned using Infernal  and visualized with RALEE . RALEE highlights the nucleotides that are paired according to the consensus secondary structure (second line from the bottom, SS_cons) of the alignment, and that also respect the standard pairing rules.
Inferred secondary structure of bacterial tRNASec candidates. The structures have a 7 bp acceptor stem (one pair shorter than the canonical bacterial tRNASec) and show a bulged nucleotide in different positions in the acceptor stem. They are classified in four types (columns A-D) according to the bulged nucleotide in the acceptor arm: (A) position 3a, (B) 4a, and (C) 5a; (D) has an extra nucleotide in position 7a, in the linker region between the acceptor stem and D-stem. Other bulged nucleotides are also indicated with red numbers. Numbering based on . Genes selA, selB and selD were often found in proximity to tRNASec, and are shown above the corresponding structure.
Distribution of scores obtained in non-identical tRNASec predictions (3,226) for the top scoring candidates (“top”) and for the multiple copies (“copies”). The predictions were split according to the residue in position 73 into the following categories: G73, non-G73 and G73CM (copies with G73 and with compensatory mutations when compared to the top scoring one).
(A) Multiple sequence alignment of bacterial AhpC proteins. The selenocysteine residue (red) in B. Saxobsidens DD2 (top) corresponds to a UAG codon in the genome sequence. (B) The AhpC UAG-Sec codon (underlined in red) followed by a bSECIS secondary structure, predicted with RNAfold . (C)The tRNASec in B. Saxobsidens has a CUA anticodon, complementary to the UAG codon. Protein identifiers: Sphaerobacter thermophilus D1CAV3_SPHTD, Xanthobacter autotrophicus A7IJH6_XANP2, Ktedonobacter racemifer D6TT72_9CHLR, Rhodopirellula sallentina M5U546_9PLAN, Hirschia baltica C6XML7_HIRBI.
Structural alignment of the two intron-containing tRNASec genes identified in this study, and the cloverleaf structure (including the longest intron). The boundaries of the introns are indicated by the dashed lines. The rightmost position of the alignment corresponds to the discriminator base. The sequences were aligned using Infernal  and visualized with RALEE . See S1 Fig caption for RALEE coloring scheme.
The 20 archaeal tRNASec sequences identified in this study are included. Note the 7 bp D-stem (light blue) in all sequences, with the exception M. kandleri. The sequences were aligned using Infernal , and visualized with RALEE . See S1 Fig caption for RALEE coloring scheme.
The eleven tRNASec candidate sequences in the F. cylindrus genome, including the 100 nt in the flanking regions, are shown. The tRNA boundaries correspond to the positions 101–187. The secondary structure is represented below the tRNA region. Five of the sequences (6, 7, 10, 11 and 9) exhibit compensatory mutations (green) compared to the top scoring candidate (1, top), although two of them (11 and 9) have a mutation that produces a mismatch in one of the pairs (red). The remaining mutations (white) would not affect the pairing potential of the sequence.
Sunburst diagram showing the eukaryotic genomes in our set. The presence of tRNASec (black dot) and EF-Sec (white dot) genes is indicated in the terminal nodes, and the number of selenoproteins is indicated by a black bar. The length of the bar is proportional to the number of selenoproteins. Some nodes were collapsed based on the presence of tRNASec. Those nodes include a number in parentheses, indicating the number of species collapsed. In the collapsed nodes, the average number of selenoproteins was computed, and a white dot indicates that all species contain EF-Sec genes. The phylogeny was obtained from NCBI taxonomy.
Species tree including a subset of the arthropod genomes analyzed in this work. The shaded boxes indicate known (dark grey) and novel (red) Sec extinctions. Each species is annotated with the presence of tRNASec, the protein factors of the Sec machinery (including the selenoprotein SPS2-Sec), and SPS1 genes. SPS1-Arg corresponds to SPS genes with an arginine codon at the homologous Sec position, and SPS1-rt corresponds to SPS genes with a UGA codon, in which a readthrough event occurs but the inserted amino acid is not known (see ). The black horizontal bar indicates the number of selenoproteins. The topology of the Coleoptera lineage was adapted according to .
The file S1_File.tar contains the three alignments (eukaryota.stk, bacteria.stk and archaea.stk), the tRNAsec_db.stk file (the concatenation of the three alignments), and the tRNAsec_db.cm file (the infernal cmbuild model). tRNAsec_db.cm was generated with infernal 1.1.1 using the following parameters: “cmbuild --hand tRNAsec_db.cm tRNAsec_db.stk” and “cmcalibrate tRNAsec_db.cm”.
We thank Emilio Palumbo for technical support with the chip-nf pipeline. We also thank Romina Garrido for administrative support.
This work was funded by the Ministry of Economy and Competitiveness (MINECO) under the grant number BIO2011-26205. We acknowledge support of the Spanish Ministry of Economy and Competitiveness, ‘Centro de Excelencia Severo Ochoa 2013–2017’ SEV-2012-0208 and also the support of the Agency for the Research Centres of Catalonia CERCA Programme / Generalitat de Catalunya. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
All relevant data are within the paper and its Supporting Information files.