A new method, which allows for the identification and prioritization of predicted cancer genes for future analysis, is presented. This method generates a gene-specific score called the “S-Score” by incorporating data from different types of analysis including mutation screening, methylation status, copy-number variation and expression profiling. The method was applied to the data from The Cancer Genome Atlas and allowed the identification of known and potentially new oncogenes and tumor suppressors associated with different clinical features including shortest term of survival in ovarian cancer patients and hormonal subtypes in breast cancer patients. Furthermore, for the first time a genome-wide search for genes that behave as oncogenes and tumor suppressors in different tumor types was performed. We envisage that the S-score can be used as a standard method for the identification and prioritization of cancer genes for follow-up studies.
Sense transgene-induced post-transcriptional gene silencing (S-PTGS) is thought to be a type of RNA silencing in which ARGONAUTE1 directs the small interfering RNA (siRNA)-mediated cleavage of a target mRNA in the cytoplasm. Here, we report that the altered splicing of endogenous counterpart genes is a main cause for the reduction of their mature mRNA levels. After the S-PTGS of a tobacco endoplasmic reticulum ω-3 fatty acid desaturase (NtFAD3) gene, 3′-truncated, polyadenylated endo-NtFAD3 transcripts and 5′-truncated, intron-containing endo-NtFAD3 transcripts were detected in the total RNA fraction. Although transcription proceeded until the last exon of the endogenous NtFAD3 gene, intron-containing NtFAD3 transcripts accumulated in the nucleus of the S-PTGS plants. Several intron-containing NtFAD3 transcripts harboring most of the exon sequences were generated when an endogenous silencing suppressor gene, rgs-CaM, was overexpressed in the S-PTGS plants. These intron-containing NtFAD3 splice variants were generated in the presence of NtFAD3 siRNAs that are homologous to the nucleotide sequences of these splice variants. The results of this study indicate that the inhibition of endo-NtFAD3 gene expression is primarily directed via the alteration of splicing and not by cytoplasmic slicer activity. Our results suggest that the transgene and intron-containing endogenous counterpart genes are differentially suppressed in S-PTGS plants.
Although recurrent somatic mutations in the splicing factor U2AF1 (also known as U2AF35) have been identified in multiple cancer types, the effects of these mutations on the cancer transcriptome have yet to be fully elucidated. Here, we identified splicing alterations associated with U2AF1 mutations across distinct cancers using DNA and RNA sequencing data from The Cancer Genome Atlas (TCGA). Using RNA-Seq data from 182 lung adenocarcinomas and 167 acute myeloid leukemias (AML), in which U2AF1 is somatically mutated in 3–4% of cases, we identified 131 and 369 splicing alterations, respectively, that were significantly associated with U2AF1 mutation. Of these, 30 splicing alterations were statistically significant in both lung adenocarcinoma and AML, including three genes in the Cancer Gene Census, CTNNB1, CHCHD7, and PICALM. Cell line experiments expressing U2AF1 S34F in HeLa cells and in 293T cells provide further support that these altered splicing events are caused by U2AF1 mutation. Consistent with the function of U2AF1 in 3′ splice site recognition, we found that S34F/Y mutations cause preferences for CAG over UAG 3′ splice site sequences. This report demonstrates consistent effects of U2AF1 mutation on splicing in distinct cancer cell types.
At present we know that phenotypic differences between organisms arise from a variety of sources, like protein sequence divergence, regulatory sequence divergence, alternative splicing, etc. However, we do not have yet a complete view of how these sources are related. Here we address this problem, studying the relationship between protein divergence and the ability of genes to express multiple isoforms. We used three genome-wide datasets of human-mouse orthologs to study the relationship between isoform multiplicity co-occurrence between orthologs (the fact that two orthologs have more than one isoform) and protein divergence. In all cases our results showed that there was a monotonic dependence between these two properties. We could explain this relationship in terms of a more fundamental one, between exon number of the largest isoform and protein divergence. We found that this last relationship was present, although with variations, in other species (chimpanzee, cow, rat, chicken, zebrafish and fruit fly). In summary, we have identified a relationship between protein divergence and isoform multiplicity co-occurrence and explained its origin in terms of a simple gene-level property. Finally, we discuss the biological implications of these findings for our understanding of inter-species phenotypic differences.
The human retrotransposon with the highest copy number is the Alu element. The human genome contains over one million Alu elements that collectively account for over ten percent of our DNA. Full-length Alu elements are randomly distributed throughout the genome in both forward and reverse orientations. However, full-length widely spaced Alu pairs having two Alus in the same (direct) orientation are statistically more prevalent than Alu pairs having two Alus in the opposite (inverted) orientation. The cause of this phenomenon is unknown. It has been hypothesized that this imbalance is the consequence of anomalous inverted Alu pair interactions. One proposed mechanism suggests that inverted Alu pairs can ectopically interact, exposing both ends of each Alu element making up the pair to a potential double-strand break, or “hit”. This hypothesized “two-hit” (two double-strand breaks) potential per Alu element was used to develop a model for comparing the relative instabilities of human genes. The model incorporates both 1) the two-hit double-strand break potential of Alu elements and 2) the probability of exon-damaging deletions extending from these double-strand breaks. This model was used to compare the relative instabilities of 50 deletion-prone cancer genes and 50 randomly selected genes from the human genome. The output of the Alu element-based genomic instability model developed here is shown to coincide with the observed instability of deletion-prone cancer genes. The 50 cancer genes are collectively estimated to be 58% more unstable than the randomly chosen genes using this model. Seven of the deletion-prone cancer genes, ATM, BRCA1, FANCA, FANCD2, MSH2, NCOR1 and PBRM1, were among the most unstable 10% of the 100 genes analyzed. This algorithm may lay the foundation for comparing genetic risks posed by structural variations that are unique to specific individuals, families and people groups.
The insertion/deletion (I/D) polymorphism of the angiotensin converting enzyme (ACE), commonly associated with many diseases, is believed to have affected human adaptation to environmental changes during the out-of-Africa expansion. APOBEC3B (A3B), a member of the cytidine deaminase family APOBEC3s, also exhibits a variable gene insertion/deletion polymorphism across world populations. Using data available from published reports, we examined the global geographic distribution of ACE and A3B genotypes. In tracking the modern human dispersal routes of these two genes, we found that the variation trends of the two I/D polymorphisms were directly correlated. We observed that the frequencies of ACE insertion and A3B deletion rose in parallel along the expansion route. To investigate the presence of a correlation between the two polymorphisms and the effect of their interaction on human health, we analyzed 1199 unrelated Chinese adults to determine their genotypes and other important clinical characteristics. We discovered a significant difference between the ACE genotype/allele distribution in the A3B DD and A3B II/ID groups (P = 0.045 and 0.015, respectively), indicating that the ACE Alu I allele frequency in the former group was higher than in the latter group. No specific clinical phenotype could be associated with the interaction between the ACE and A3B I/D polymorphisms. A3B has been identified as a powerful inhibitor of Alu retrotransposition, and primate A3 genes have undergone strong positive selection (and expansion) for restricting the mobility of endogenous retrotransposons during evolution. Based on these findings, we suggest that the ACE Alu insertion was enabled (facilitated) by the A3B deletion and that functional loss of A3B provided an opportunity for enhanced human adaptability and survival in response to the environmental and climate challenges arising during the migration from Africa.
Epithelial ovarian cancer (EOC) is usually discovered after extensive metastasis have developed in the peritoneal cavity. The ovarian surface is exposed to peritoneal fluid pressures and shear forces due to the continuous peristaltic motions of the gastro-intestinal system, creating a mechanical micro-environment for the cells. An in vitro experimental model was developed to expose EOC cells to steady fluid flow induced wall shear stresses (WSS). The EOC cells were cultured from OVCAR-3 cell line on denuded amniotic membranes in special wells. Wall shear stresses of 0.5, 1.0 and 1.5 dyne/cm2 were applied on the surface of the cells under conditions that mimic the physiological environment, followed by fluorescent stains of actin and β-tubulin fibers. The cytoskeleton response to WSS included cell elongation, stress fibers formation and generation of microtubules. More cytoskeletal components were produced by the cells and arranged in a denser and more organized structure within the cytoplasm. This suggests that WSS may have a significant role in the mechanical regulation of EOC peritoneal spreading.
Chromatin organization affects alternative splicing and previous studies have shown that exons have increased nucleosome occupancy compared with their flanking introns. To determine whether alternative splicing affects chromatin organization we developed a system in which the alternative splicing pattern switched from inclusion to skipping as a function of time. Changes in nucleosome occupancy were correlated with the change in the splicing pattern. Surprisingly, strengthening of the 5′ splice site or strengthening the base pairing of U1 snRNA with an internal exon abrogated the skipping of the internal exons and also affected chromatin organization. Over-expression of splicing regulatory proteins also affected the splicing pattern and changed nucleosome occupancy. A specific splicing inhibitor was used to show that splicing impacts nucleosome organization endogenously. The effect of splicing on the chromatin required a functional U1 snRNA base pairing with the 5′ splice site, but U1 pairing was not essential for U1 snRNA enhancement of transcription. Overall, these results suggest that splicing can affect chromatin organization.
P-SSP7 is a T7-like phage that infects the cyanobacterium Prochlorococcus MED4. MED4 is a member of the high-light-adapted Prochlorococcus ecotypes that are abundant in the surface oceans and contribute significantly to primary production. P-SSP7 has become a model system for the investigation of T7-like phages that infect Prochlorococcus. It was classified as T7-like based on genome content and organization. However, because its genome assembled as a circular molecule, it was thought to be circularly permuted and to lack the direct terminal repeats found in other T7-like phages. Here we sequenced the ends of the P-SSP7 genome and found that the genome map is linear and contains a 206 bp repeat at both genome ends. Furthermore, we found that a 728 bp region of the genome originally placed downstream of the last ORF is actually located upstream of the first ORF on the genome map. These findings suggest that P-SSP7 is likely to use the direct terminal repeats for genome replication and packaging in a similar manner to other T7-like phages. Moreover, these results highlight the importance of experimentally verifying the ends of phage genomes, and will facilitate the use of P-SSP7 as a model for the correct assembly and end determination of the many T7-like phages isolated from the marine environment that are currently being sequenced.
Accumulation of the complex set of alternatively processed mRNA from the adenovirus major late transcription unit (MLTU) is subjected to a temporal regulation involving both changes in poly (A) site choice and alternative 3′ splice site usage. We have previously shown that the adenovirus L4-33K protein functions as an alternative splicing factor involved in activating the shift from L1-52,55K to L1-IIIa mRNA. Here we show that L4-33K specifically associates with the catalytic subunit of the DNA-dependent protein kinase (DNA-PK) in uninfected and adenovirus-infected nuclear extracts. Further, we show that L4-33K is highly phosphorylated by DNA-PK in vitro in a double stranded DNA-independent manner. Importantly, DNA-PK deficient cells show an enhanced production of the L1-IIIa mRNA suggesting an inhibitory role of DNA-PK on the temporal switch in L1 alternative RNA splicing. Moreover, we show that L4-33K also is phosphorylated by protein kinase A (PKA), and that PKA has an enhancer effect on L4-33K-stimulated L1-IIIa splicing. Hence, we demonstrate that these kinases have opposite effects on L4-33K function; DNA-PK as an inhibitor and PKA as an activator of L1-IIIa mRNA splicing. Taken together, this is the first report identifying protein kinases that phosphorylate L4-33K and to suggest novel regulatory roles for DNA-PK and PKA in adenovirus alternative RNA splicing.
Accurate and efficient genome-wide detection of copy number variants (CNVs) is essential for understanding human genomic variation, genome-wide CNV association type studies, cytogenetics research and diagnostics, and independent validation of CNVs identified from sequencing based technologies. Numerous, array-based platforms for CNV detection exist utilizing array Comparative Genome Hybridization (aCGH), Single Nucleotide Polymorphism (SNP) genotyping or both. We have quantitatively assessed the abilities of twelve leading genome-wide CNV detection platforms to accurately detect Gold Standard sets of CNVs in the genome of HapMap CEU sample NA12878, and found significant differences in performance. The technologies analyzed were the NimbleGen 4.2 M, 2.1 M and 3×720 K Whole Genome and CNV focused arrays, the Agilent 1×1 M CGH and High Resolution and 2×400 K CNV and SNP+CGH arrays, the Illumina Human Omni1Quad array and the Affymetrix SNP 6.0 array. The Gold Standards used were a 1000 Genomes Project sequencing-based set of 3997 validated CNVs and an ultra high-resolution aCGH-based set of 756 validated CNVs. We found that sensitivity, total number, size range and breakpoint resolution of CNV calls were highest for CNV focused arrays. Our results are important for cost effective CNV detection and validation for both basic and clinical applications.
The ability to generate whole genome data is rapidly becoming commoditized. For example, a mammalian sized genome (∼3Gb) can now be sequenced using approximately ten lanes on an Illumina HiSeq 2000. Since lanes from different runs are often combined, verifying that each lane in a genome's build is from the same sample is an important quality control. We sought to address this issue in a post hoc bioinformatic manner, instead of using upstream sample or “barcode” modifications. We rely on the inherent small differences between any two individuals to show that genotype concordance rates can be effectively used to test if any two lanes of HiSeq 2000 data are from the same sample. As proof of principle, we use recent data from three different human samples generated on this platform. We show that the distributions of concordance rates are non-overlapping when comparing lanes from the same sample versus lanes from different samples. Our method proves to be robust even when different numbers of reads are analyzed. Finally, we provide a straightforward method for determining the gender of any given sample. Our results suggest that examining the concordance of detected genotypes from lanes purported to be from the same sample is a relatively simple approach for confirming that combined lanes of data are of the same identity and quality.
LINE-1 (L1) retroelements emerged in mammalian genomes over 80 million years ago with a few dominant subfamilies amplifying over discrete time periods that led to distinct human and mouse L1 lineages. We evaluated the functional conservation of L1 sequences by comparing retrotransposition rates of chimeric human-rodent L1 constructs to their parental L1 counterparts. Although amino acid conservation varies from ∼35% to 63% for the L1 ORF1p and ORF2p, most human and mouse L1 sequences can be functionally exchanged. Replacing either ORF1 or ORF2 to create chimeric human-mouse L1 elements did not adversely affect retrotransposition. The mouse ORF2p retains retrotransposition-competency to support both Alu and L1 mobilization when any of the domain sequences we evaluated were substituted with human counterparts. However, the substitution of portions of the mouse cys-domain into the human ORF2p reduces both L1 retrotransposition and Alu trans-mobilization by 200–1000 fold. The observed loss of ORF2p function is independent of the endonuclease or reverse transcriptase activities of ORF2p and RNA interaction required for reverse transcription. In addition, the loss of function is physically separate from the cysteine-rich motif sequence previously shown to be required for RNP formation. Our data suggest an additional role of the less characterized carboxy-terminus of the L1 ORF2 protein by demonstrating that this domain, in addition to mediating RNP interaction(s), provides an independent and required function for the retroelement amplification process. Our experiments show a functional modularity of most of the LINE sequences. However, divergent evolution of interactions within L1 has led to non-reciprocal incompatibilities between human and mouse ORF2 cys-domain sequences.
Despite the ever-increasing throughput and steadily decreasing cost of next
generation sequencing (NGS), whole genome sequencing of humans is still not a
viable option for the majority of genetics laboratories. This is particularly
true in the case of complex disease studies, where large sample sets are often
required to achieve adequate statistical power. To fully leverage the potential
of NGS technology on large sample sets, several methods have been developed to
selectively enrich for regions of interest. Enrichment reduces both monetary and
computational costs compared to whole genome sequencing, while allowing
researchers to take advantage of NGS throughput. Several targeted enrichment
approaches are currently available, including molecular inversion probe ligation
sequencing (MIPS), oligonucleotide hybridization based approaches, and PCR-based
strategies. To assess how these methods performed when used in conjunction with
the ABI SOLID3+, we investigated three enrichment techniques: Nimblegen
oligonucleotide hybridization array-based capture; Agilent SureSelect
oligonucleotide hybridization solution-based capture; and Raindance
Technologies' multiplexed PCR-based approach. Target regions were selected
from exons and evolutionarily conserved areas throughout the human genome. Probe
and primer pair design was carried out for all three methods using their
respective informatics pipelines. In all, approximately 0.8 Mb of target space
was identical for all 3 methods. SOLiD sequencing results were analyzed for
several metrics, including consistency of coverage depth across samples,
on-target versus off-target efficiency, allelic bias, and genotype concordance
with array-based genotyping data. Agilent SureSelect exhibited superior
on-target efficiency and correlation of read depths across samples. Nimblegen
performance was similar at read depths at 20× and below. Both Raindance
and Nimblegen SeqCap exhibited tighter distributions of read depth around the
mean, but both suffered from lower on-target efficiency in our experiments.
Raindance demonstrated the highest versatility in assay design.
Mammalian telomeres are specialized chromatin structures that require the telomere binding protein, TRF2, for maintaining chromosome stability. In addition to its ability to modulate DNA repair activities, TRF2 also has direct effects on DNA structure and topology. Given that mammalian telomeric chromatin includes nucleosomes, we investigated the effect of this protein on chromatin structure. TRF2 bound to reconstituted telomeric nucleosomal fibers through both its basic N-terminus and its C-terminal DNA binding domain. Analytical agarose gel electrophoresis (AAGE) studies showed that TRF2 promoted the folding of nucleosomal arrays into more compact structures by neutralizing negative surface charge. A construct containing the N-terminal and TRFH domains together altered the charge and radius of nucleosomal arrays similarly to full-length TRF2 suggesting that TRF2-driven changes in global chromatin structure were largely due to these regions. However, the most compact chromatin structures were induced by the isolated basic N-terminal region, as judged by both AAGE and atomic force microscopy. Although the N-terminal region condensed nucleosomal array fibers, the TRFH domain, known to alter DNA topology, was required for stimulation of a strand invasion-like reaction with nucleosomal arrays. Optimal strand invasion also required the C-terminal DNA binding domain. Furthermore, the reaction was not stimulated on linear histone-free DNA. Our data suggest that nucleosomal chromatin has the ability to facilitate this activity of TRF2 which is thought to be involved in stabilizing looped telomere structures.
To gain global insights into the role of the well-known repressive splicing regulator PTB we analyzed the consequences of PTB knockdown in HeLa cells using high-density oligonucleotide splice-sensitive microarrays. The major class of identified PTB-regulated splicing event was PTB-repressed cassette exons, but there was also a substantial number of PTB-activated splicing events. PTB repressed and activated exons showed a distinct arrangement of motifs with pyrimidine-rich motif enrichment within and upstream of repressed exons, but downstream of activated exons. The N-terminal half of PTB was sufficient to activate splicing when recruited downstream of a PTB-activated exon. Moreover, insertion of an upstream pyrimidine tract was sufficient to convert a PTB-activated to a PTB-repressed exon. Our results demonstrate that PTB, an archetypal splicing repressor, has variable splicing activity that predictably depends upon its binding location with respect to target exons.
Since the emergence of next-generation sequencing (NGS) technologies, great effort has been put into the development of tools for analysis of the short reads. In parallel, knowledge is increasing regarding biases inherent in these technologies. Here we discuss four different biases we encountered while analyzing various Illumina datasets. These biases are due to both biological and statistical effects that in particular affect comparisons between different genomic regions. Specifically, we encountered biases pertaining to the distributions of nucleotides across sequencing cycles, to mappability, to contamination of pre-mRNA with mRNA, and to non-uniform hydrolysis of RNA. Most of these biases are not specific to one analyzed dataset, but are present across a variety of datasets and within a variety of genomic contexts. Importantly, some of these biases correlated in a highly significant manner with biological features, including transcript length, gene expression levels, conservation levels, and exon-intron architecture, misleadingly increasing the credibility of results due to them. We also demonstrate the relevance of these biases in the context of analyzing an NGS dataset mapping transcriptionally engaged RNA polymerase II (RNAPII) in the context of exon-intron architecture, and show that elimination of these biases is crucial for avoiding erroneous interpretation of the data. Collectively, our results highlight several important pitfalls, challenges and approaches in the analysis of NGS reads.
Familial Dysautonomia (FD) is an autosomal recessive congenital neuropathy that results from abnormal development and progressive degeneration of the sensory and autonomic nervous system. The mutation observed in almost all FD patients is a point mutation at position 6 of intron 20 of the IKBKAP gene; this gene encodes the IκB kinase complex-associated protein (IKAP). The mutation results in a tissue-specific splicing defect: Exon 20 is skipped, leading to reduced IKAP protein expression. Here we show that phosphatidylserine (PS), an FDA-approved food supplement, increased IKAP mRNA levels in cells derived from FD patients. Long-term treatment with PS led to a significant increase in IKAP protein levels in these cells. A conjugate of PS and an omega-3 fatty acid also increased IKAP mRNA levels. Furthermore, PS treatment released FD cells from cell cycle arrest and up-regulated a significant number of genes involved in cell cycle regulation. Our results suggest that PS has potential for use as a therapeutic agent for FD. Understanding its mechanism of action may reveal the mechanism underlying the FD disease.
Transposable elements (TEs) have played an important role in the diversification and enrichment of mammalian transcriptomes through various mechanisms such as exonization and intronization (the birth of new exons/introns from previously intronic/exonic sequences, respectively), and insertion into first and last exons. However, no extensive analysis has compared the effects of TEs on the transcriptomes of mammals, non-mammalian vertebrates and invertebrates.
We analyzed the influence of TEs on the transcriptomes of five species, three invertebrates and two non-mammalian vertebrates. Compared to previously analyzed mammals, there were lower levels of TE introduction into introns, significantly lower numbers of exonizations originating from TEs and a lower percentage of TE insertion within the first and last exons. Although the transcriptomes of vertebrates exhibit significant levels of exonization of TEs, only anecdotal cases were found in invertebrates. In vertebrates, as in mammals, the exonized TEs are mostly alternatively spliced, indicating that selective pressure maintains the original mRNA product generated from such genes.
Exonization of TEs is widespread in mammals, less so in non-mammalian vertebrates, and very low in invertebrates. We assume that the exonization process depends on the length of introns. Vertebrates, unlike invertebrates, are characterized by long introns and short internal exons. Our results suggest that there is a direct link between the length of introns and exonization of TEs and that this process became more prevalent following the appearance of mammals.
Insertion of transposed elements within mammalian genes is thought to be an important contributor to mammalian evolution and speciation. Insertion of transposed elements into introns can lead to their activation as alternatively spliced cassette exons, an event called exonization. Elucidation of the evolutionary constraints that have shaped fixation of transposed elements within human and mouse protein coding genes and subsequent exonization is important for understanding of how the exonization process has affected transcriptome and proteome complexities. Here we show that exonization of transposed elements is biased towards the beginning of the coding sequence in both human and mouse genes. Analysis of single nucleotide polymorphisms (SNPs) revealed that exonization of transposed elements can be population-specific, implying that exonizations may enhance divergence and lead to speciation. SNP density analysis revealed differences between Alu and other transposed elements. Finally, we identified cases of primate-specific Alu elements that depend on RNA editing for their exonization. These results shed light on TE fixation and the exonization process within human and mouse genes.
Transposable elements (TEs) have contributed a wide range of functional sequences to their host genomes. A recent paper in BMC Molecular Biology discusses the creation of new transcripts by transposable element insertion upstream of retrocopies and the involvement of such insertions in tissue-specific post-transcriptional regulation.
Regulation of splicing in eukaryotes occurs through the coordinated action of multiple splicing factors. Exons and introns contain numerous putative binding sites for splicing regulatory proteins. Regulation of splicing is presumably achieved by the combinatorial output of the binding of splicing factors to the corresponding binding sites. Although putative regulatory sites often overlap, no extensive study has examined whether overlapping regulatory sequences provide yet another dimension to splicing regulation. Here we analyzed experimentally-identified splicing regulatory sequences using a computational method based on the natural distribution of nucleotides and splicing regulatory sequences. We uncovered positive and negative interplay between overlapping regulatory sequences. Examination of these overlapping motifs revealed a unique spatial distribution, especially near splice donor sites of exons with weak splice donor sites. The positively selected overlapping splicing regulatory motifs were highly conserved among different species, implying functionality. Overall, these results suggest that overlap of two splicing regulatory binding sites is an evolutionary conserved widespread mechanism of splicing regulation. Finally, over-abundant motif overlaps were experimentally tested in a reporting minigene revealing that overlaps may facilitate a mode of splicing that did not occur in the presence of only one of the two regulatory sequences that comprise it.
Throughout evolution, eukaryotic genomes have been invaded by transposable elements (TEs). Little is known about the factors leading to genomic proliferation of TEs, their preferred integration sites and the molecular mechanisms underlying their insertion. We analyzed hundreds of thousands nested TEs in the human genome, i.e. insertions of TEs into existing ones. We first discovered that most TEs insert within specific ‘hotspots’ along the targeted TE. In particular, retrotransposed Alu elements contain a non-canonical single nucleotide hotspot for insertion of other Alu sequences. We next devised a method for identification of integration sequence motifs of inserted TEs that are conserved within the targeted TEs. This method revealed novel sequences motifs characterizing insertions of various important TE families: Alu, hAT, ERV1 and MaLR. Finally, we performed a global assessment to determine the extent to which young TEs tend to nest within older transposed elements and identified a 4-fold higher tendency of TEs to insert into existing TEs than to insert within non-TE intergenic regions. Our analysis demonstrates that TEs are highly biased to insert within certain TEs, in specific orientations and within specific targeted TE positions. TE nesting events also reveal new characteristics of the molecular mechanisms underlying transposition.
More than 5% of alternatively spliced internal exons in the human genome are derived from Alu elements in a process termed exonization. Alus are comprised of two homologous arms separated by an internal polypyrimidine tract (PPT). In most exonizations, splice sites are selected from within the same arm. We hypothesized that the internal PPT may prevent selection of a splice site further downstream. Here, we demonstrate that this PPT enhanced the selection of an upstream 5′ splice site (5′ss), even in the presence of a stronger 5′ss downstream. Deletion of this PPT shifted selection to the stronger downstream 5′ss. This enhancing effect depended on the strength of the downstream 5′ss, on the efficiency of base-pairing to U1 snRNA, and on the length of the PPT. This effect of the PPT was mediated by the binding of TIA proteins and was dependent on the distance between the PPT and the upstream 5′ss. A wide-scale evolutionary analysis of introns across 22 eukaryotes revealed an enrichment in PPTs within ∼20 nt downstream of the 5′ss. For most metazoans, the strength of the 5′ss inversely correlated with the presence of a downstream PPT, indicative of the functional role of the PPT. Finally, we found that the proteins that mediate this effect, TIA and U1C, and in particular their functional domains, are highly conserved across evolution. Overall, these findings expand our understanding of the role of TIA1/TIAR proteins in enhancing recognition of exons, in general, and Alu exons, in particular.
Human genes are composed of functional regions, termed exons, separated by non-functional regions, termed introns. Intronic sequences may gradually accumulate mutations and subsequently become recognized by the splicing machinery as exons, a process termed exonization. Alu elements are prone to undergo exonization: more than 5% of alternatively spliced internal exons in the human genome originate from Alu elements. A typical Alu element is ∼300 nucleotides long, consisting of two arms separated by a polypyrimdine tract (PPT). Interestingly, in most cases, exonization occurs almost exclusively within either the right arm or the left, not both. Here we found that the PPT between the two arms serves as a binding site for TIA proteins and prevents the exon selection process from expanding into downstream regions. To obtain a wider overview of TIA function, we performed a cross-evolutionary analysis within 22 eukaryotes of this protein and of U1C, a protein known to interact with it, and found that functional regions of both these proteins were highly conserved. These findings highlight the pivotal role of TIA proteins in 5′ splice-site selection of Alu exons and exon recognition in general.