|Home | About | Journals | Submit | Contact Us | Français|
Nested genes represent an intriguing form of non-random genomic organization in which the boundaries of one gene are fully contained within another, longer host gene. The C. elegans genome contains over 10,000 nested genes, 92% of which are ncRNAs, which occur inside 16% of the protein coding gene complement. Host genes are longer than non-host coding genes, owing to their longer and more numerous introns. Indel alleles are available for nearly all of these host genes that simultaneously alter the nested gene, raising the possibility of nested gene disruption contributing to phenotypes that might be attributed to the host gene. Such dual-knockouts could represent a source of misinterpretation about host gene function. Dual-knockouts might also provide a novel source of synthetic phenotypes that reveal the functional effects of ncRNA genes, whereby the host gene disruption acts as a perturbed genetic background to help unmask ncRNA phenotypes.
Although many individual labs now routinely perform whole-genome sequencing, understanding how genes interact within genetic pathways and networks remains an important challenge and often represents a limiting factor in genomics. A straightforward way to learn about gene function is to examine the phenotypic effect of ablating a given gene. Forward genetic screens, in which random mutations are selected for a given phenotype, and reverse genetic screens, in which phenotypes are scored for mutations generated in genes of interest, are powerful but laborious approaches. Fifteen years after the completion of the Caenorhabditis elegans genome sequence, only ~15% of the protein coding genes have an allele with an associated phenotype.1 Several large-scale projects provide the worm community with genetic resources to facilitate the investigation of gene function by identifying or generating mutations which can be subsequently introduced in a given genetic background. The C. elegans Deletion Mutant Consortium has generated deletion strains for 6,013 genes.2 The Million Mutation Project has created ~2,000 mutant strains carrying over 800,000 genetic alterations.3 In addition, whole-genome sequencing of wild isolates provide a rich catalog of naturally occurring allelic variants in C. elegans 3,4 and in its relative C. briggsae.5
Despite these great efforts, the non-random organization of genes in genomes creates practical complications for mapping phenotypes to single genes. Examples of non-random gene organization include clustering of co-expressed genes, a high proportion of genes within operons, and differential gene densities along chromosomes coinciding with variation in recombination rates.6,7 An especially intriguing and common gene arrangement is when a “nested” gene is located within another “host” gene.8 Interestingly, in these gene structures, nested and host genes often display weak or negative expression correlation, perhaps because of selection against transcriptional interference.9-11 Most nested genes are small non-coding RNAs (ncRNAs)12,13 for which persistence inside host genes appears to be inversely proportional to the ncRNA family size.14 The most widely studied ncRNAs, the microRNAs (miRNAs), are particularly abundant in nested arrangements with ~30% of plant and animal miRNAs located in introns of protein coding genes.15
Nested structures pose a challenge for gene knockout studies in which the nested gene may be mistakenly altered along with the focal host gene (or vice versa), making it difficult to ascribe an eventual phenotype to a single gene product. This idea was put into sharp relief by the analysis of gene traps in mouse and the finding that ~200 miRNAs may have concomitantly been misregulated.16 The functional analysis of miRNAs themselves is complicated by the generation of multiple regulatory small RNAs from a single miRNA gene, and because the different miRNA forms have the potential to bind to the same target genes.17 Here, we extend the notion that nested genes may be inadvertently disrupted along with their host gene in the worm genome and identify such potential variants. We also identify phenotype-causing alleles for which the coding sequence of host genes has been altered along with the sequence of their nested gene, raising caution for the interpretation of these phenotypes.
Using the genomic coordinates of 46,734 C. elegans genes annotated in WormBase WS248, we identified 10,638 genes that are fully contained in 3,252 host protein coding genes (15.95 % of the 20,391 protein coding genes). Of these nested genes, 9,076 genes are intronic, 636 genes are located within a coding exon and 926 genes are within the boundaries of the host protein coding gene but are neither fully contained within an intron or a coding exon. The majority of host genes (56%) have only 1 nested gene, and up to 146 genes are nested within a host. Host genes tend to be longer than non-host genes and have both more introns and longer introns (Fig. 1), with 19,854 protein coding genes representing candidate hosts by virtue of having at least one intron. Only 608 protein coding genes are nested within a host. In a few instances, the nesting relationships span multiple layers in a way analogous to the matryoshka Russian dolls, with 28 protein coding genes that are both nested and host, and 46 nested genes that are embedded in more than one host gene (Fig. 2). Nested protein coding genes are relatively short, on average 1.6 Kb long, with a median number of 3 introns per gene and a greater proportion of nested genes are intronless (ratio of intronless genes over genes with introns = 0.086) relative to host genes (ratio = 0.003, χ2 = 186.74, P < 0.0001) and non-host genes (ratio = 0.030, χ2 = 46.03, P < 0.0001).
Most of the nested genes (92.25 %) are non-coding RNAs, with piRNAs being the most abundant class (Table 1). The majority of nested genes are located on chromosome IV (60%), with chromosomes I, II and III carrying each 6–7% of nested genes, and with chromosomes V and X having each ˜10% of nested gene structures. piRNAs cluster in 2 regions of chromosome IV comprising ~7Mb of sequence,18 which corresponds to the high density of nested piRNA genes in the genome (Fig. 3). Protein coding genes in the mitochondrial genome are intronless and do not have nested genes.
We then compared the collection of curated variants in Wormbase to gene nestedness. We analyzed 106,511 insertions and deletions (indels) that include lab-generated mutations and natural polymorphisms in wild isolates from the Million Mutation Project and the Gene Knockout Consortium. We focused on indels because the majority of nested genes are non-coding RNAs located in introns of protein-coding genes, and so single point mutations and single nucleotide polymorphisms (SNPs) are unlikely to influence the functions of both partners in the nested arrangement. In contrast, we hypothesized that indels could alter the expression of the nested genes through complete or partial duplication or deletion. We cross-referenced the positions of host genes and indels and identified 3,400 variants overlapping with the coding sequence of 3,227 host genes and with the sequence of 10,596 nested genes. Thus, there were only 25 host genes and 42 nested genes that lacked variants afflicting both members of the nested-host pair. However, we excluded 2,844 large variants with boundaries falling beyond the positions of the host genes and potentially affecting neighboring genes either directly or indirectly through regulatory sequences, because we are interested in mutations that seemingly alter a focal protein coding gene but that also unintentionally disrupt its nested gene. This procedure yielded a total of 556 variants affecting the coding sequence of 436 host genes along with the sequence of 794 nested genes (Table S1). Consequently, use of these variants to investigate the function of the host protein coding genes presents the risk that any phenotype may be confounded by the alteration of the nested genes. To provide a list of alleles as experimental alternatives, we identified 408 indels that affect only the coding sequence of 199 of these 436 host genes such that they do not simultaneously alter their nested genes (Table S2).
Next we sought to determine how many protein coding genes have an allele with a known phenotype that also disrupts a nested gene. By comparing the list of 3,162 protein coding genes with a phenotype to our list of 436 host genes, we identified 94 alleles with a phenotype that compromise both the coding sequence of 89 host genes and the sequence of their 133 nested genes (Table S3). Again, most nested genes are non-coding RNAs, with the 2 most abundant classes being annotated as ncRNAs and piRNAs (Fig. 4A). Of the 94 alleles, 71 cause a deletion within the host gene and 23 alleles are “complex substitutions” (Fig. 4B), with the median variant lengths being respectively 806 and 672 bp (Fig. 4C) and altering between 1.4% to 100% of the nested gene sequence (Fig. 4D).
As a concrete example of nested gene disruption, consider the gene unc-59 located on chromosome I. Worms with a 518 bp deletion in unc-59(tm1928), removing most of the second exon, have egg-laying and locomotive defects. This allele also entirely ablates the miRNA mir-8205 located in the second intron of unc-59. In this case, a 329 bp-long deletion (tm1939) partially removing unc-59 exons 1 and 2 has a similar phenotype to the deletion that also removes mir-8205, giving confidence that the tm1928 phenotype is not driven solely by the knockout of mir-8205. However, miRNA mutants generally display subtle phenotypes, if at all, with additional environmental or genetic perturbations often facilitating the expression of their functional effects.19-23 Consequently, might the joint disruption of a nested miRNA and its host gene yield the genetically perturbed conditions that could assist in producing synthetic miRNA phenotypes? In this example, it is unclear how or whether the mir-8205 deletion could interact directly with unc-59 to produce additional phenotypic consequences. As mir-8205 could regulate as many as 162 target genes, its deletion potentially misregulates many transcripts because its unique seed sequence predicts little redundancy with other miRNAs (miRBase.org). More generally, however, it may well be that the disruption of the nested genes unpredictably interfere with the host gene's phenotype through shared genetic network architecture, perhaps by reinforcing or contributing to it in ways unrelated to the host gene's primary function.
In conclusion, we concur with Osokine et al.16 that the complexity of genome organization, and nested gene structures in particular, increases the risk that the causality of gene function may sometimes be mistakenly interpreted. A further extension of this phenomenon is the possibility of inadvertent disruption of downstream operon gene expression owing to knockout of an upstream gene within a given operon. This may be particularly relevant in worm genetics. As next-generation sequencing costs continue to drop, there may be a renewed interest in forward genetic screens coupled with whole-genome sequencing to dissect gene function instead of relying on the faster and large-scale RNAi screens.24,25 Fortunately, most nested genes are non-coding RNAs, for which deletion often results in little developmental defects, and the C. elegans genome is exceptionally well-annotated.26 Consequently, the risk of misinterpreting the genetic basis of knockout phenotypes may be limited to conditions in which the disruption of the ncRNAs has potent effects. On the other hand, an unexpected consequence of mutations that disrupt both a host and its nested genes may be to increase the likelihood of observing a phenotype from the joint knockout. Such synthetic mutants might reveal interesting biological processes, even if potentially more difficult to interpret than single gene knockouts.
We extracted the genomic positions of all 46,734 protein-coding and non-coding genes using the genome annotation of C. elegans WS248. We also extracted the genomic coordinates of 106,511 insertion/deletion (indels) variants using the same GFF annotation file. We first searched the C. elegans genome for genes that are fully contained within the boundaries of a protein coding gene using the genes' genomic coordinates. We then sorted the nested genes in 3 categories: genes entirely nested within an intron, genes entirely contained within a coding exon, and genes that are within the boundaries of the host but are neither fully intronic or exonic. We then generated a list of variants potentially altering the function of both genes in each host-nested gene pairs by identifying indels with genomic coordinates overlaping with both the coding sequence of the host gene and with the sequence of the nested gene. Using WormMine WS238, we downloaded the list of alleles and their associated phenotypes for 3,162 genes. We then cross-referenced our list of variants with this list of alleles to identify host genes with phenotypes that potentially result from the joint disruption of the nested gene and the disruption of the host's coding sequence. We used TargetScan 27 to predict the target genes of mir-8205. We first extracted the sequences of the annotated 3′-untranslated regions (UTRs) of the C. elegans genes, keeping the UTR of a single transcript, and we predicted targets using the seed sequences of the 5′-arm and 3′-arm of mir-8205.
No potential conflicts of interest were disclosed.
This work is supported by a grant from the National Health Institutes (GM096008) to A.D.C.