Salmonella Typhi and Typhimurium diverged only ∼50 000 years ago, yet have very different host ranges and pathogenicity. Despite the availability of multiple whole-genome sequences, the genetic differences that have driven these changes in phenotype are only beginning to be understood. In this study, we use transposon-directed insertion-site sequencing to probe differences in gene requirements for competitive growth in rich media between these two closely related serovars. We identify a conserved core of 281 genes that are required for growth in both serovars, 228 of which are essential in Escherichia coli. We are able to identify active prophage elements through the requirement for their repressors. We also find distinct differences in requirements for genes involved in cell surface structure biogenesis and iron utilization. Finally, we demonstrate that transposon-directed insertion-site sequencing is not only applicable to the protein-coding content of the cell but also has sufficient resolution to generate hypotheses regarding the functions of non-coding RNAs (ncRNAs) as well. We are able to assign probable functions to a number of cis-regulatory ncRNA elements, as well as to infer likely differences in trans-acting ncRNA regulatory networks.
In this unit, we describe a set of improvements we have made to the standard Illumina protocols to make the sequencing process more reliable in a high-throughput environment, reduce amplification bias, narrow the distribution of insert sizes, and reliably obtain high yields of data.
Illumina; Next-Generation; Sequencer; Protocols
Malaria elimination strategies require surveillance of the parasite population for genetic changes that demand a public health response, such as new forms of drug resistance. 1,2 Here we describe methods for large-scale analysis of genetic variation in Plasmodium falciparum by deep sequencing of parasite DNA obtained from the blood of patients with malaria, either directly or after short term culture. Analysis of 86,158 exonic SNPs that passed genotyping quality control in 227 samples from Africa, Asia and Oceania provides genome-wide estimates of allele frequency distribution, population structure and linkage disequilibrium. By comparing the genetic diversity of individual infections with that of the local parasite population, we derive a metric of within-host diversity that is related to the level of inbreeding in the population. An open-access web application has been established for exploration of regional differences in allele frequency and of highly differentiated loci in the P. falciparum genome.
CpG islands (CGIs) are prominent in the mammalian genome owing to their GC-rich base composition and high density of CpG dinucleotides1,2. Most human gene promoters are embedded within CGIs that lack DNA methylation and coincide with sites of histone H3 lysine 4 trimethylation (H3K4me3), irrespective of transcriptional activity3,4. In spite of these intriguing correlations, the functional significance of non-methylated CGI sequences with respect to chromatin structure and transcription is unknown. By performing a search for proteins that are common to all CGIs, here we show high enrichment for Cfp1, which selectively binds to non-methylated CpGs in vitro5,6. Chromatin immunoprecipitation of a mono-allelically methylated CGI confirmed that Cfp1 specifically associates with non-methylated CpG sites in vivo. High throughput sequencing of Cfp1-bound chromatin identified a notable concordance with non-methylated CGIs and sites of H3K4me3 in the mouse brain. Levels of H3K4me3 at CGIs were markedly reduced in Cfp1-depleted cells, consistent with the finding that Cfp1 associates with the H3K4 methyltransferase Setd1 (refs 7, 8). To test whether non-methylated CpG-dense sequences are sufficient to establish domains of H3K4me3, we analysed artificial CpG clusters that were integrated into the mouse genome. Despite the absence of promoters, the insertions recruited Cfp1 and created new peaks of H3K4me3. The data indicate that a primary function of non-methylated CGIs is to genetically influence the local chromatin modification state by interaction with Cfp1 and perhaps other CpG-binding proteins.
Armillaria mellea is a major plant
pathogen. Yet, no large-scale “-omics” data are available
to enable new studies, and limited experimental models are available
to investigate basidiomycete pathogenicity. Here we reveal that the A. mellea genome comprises 58.35 Mb, contains 14473 gene
models, of average length 1575 bp (4.72 introns/gene). Tandem mass
spectrometry identified 921 mycelial (n = 629 unique)
and secreted (n = 183 unique) proteins. Almost 100
mycelial proteins were either species-specific or previously unidentified
at the protein level. A number of proteins (n = 111)
was detected in both mycelia and culture supernatant extracts. Signal
sequence occurrence was 4-fold greater for secreted (50.2%) compared
to mycelial (12%) proteins. Analyses revealed a rich reservoir of
carbohydrate degrading enzymes, laccases, and lignin peroxidases in
the A. mellea proteome, reminiscent of both basidiomycete
and ascomycete glycodegradative arsenals. We discovered that A. mellea exhibits a specific killing effect against Candida albicans during coculture. Proteomic investigation
of this interaction revealed the unique expression of defensive and
potentially offensive A. mellea proteins (n = 30). Overall, our data reveal new insights into the
origin of basidiomycete virulence and we present a new model system
for further studies aimed at deciphering fungal pathogenic mechanisms.
Armillaria mellea; basidiomycete
proteomics; basidiomycete genomics; Next Generation
Sequencing; fungal proteomics; glycosidases; Candida
Chickens, pigs, and cattle are key reservoirs of Salmonella enterica, a foodborne pathogen of worldwide importance. Though a decade has elapsed since publication of the first Salmonella genome, thousands of genes remain of hypothetical or unknown function, and the basis of colonization of reservoir hosts is ill-defined. Moreover, previous surveys of the role of Salmonella genes in vivo have focused on systemic virulence in murine typhoid models, and the genetic basis of intestinal persistence and thus zoonotic transmission have received little study. We therefore screened pools of random insertion mutants of S. enterica serovar Typhimurium in chickens, pigs, and cattle by transposon-directed insertion-site sequencing (TraDIS). The identity and relative fitness in each host of 7,702 mutants was simultaneously assigned by massively parallel sequencing of transposon-flanking regions. Phenotypes were assigned to 2,715 different genes, providing a phenotype–genotype map of unprecedented resolution. The data are self-consistent in that multiple independent mutations in a given gene or pathway were observed to exert a similar fitness cost. Phenotypes were further validated by screening defined null mutants in chickens. Our data indicate that a core set of genes is required for infection of all three host species, and smaller sets of genes may mediate persistence in specific hosts. By assigning roles to thousands of Salmonella genes in key reservoir hosts, our data facilitate systems approaches to understand pathogenesis and the rational design of novel cross-protective vaccines and inhibitors. Moreover, by simultaneously assigning the genotype and phenotype of over 90% of mutants screened in complex pools, our data establish TraDIS as a powerful tool to apply rich functional annotation to microbial genomes with minimal animal use.
Salmonella Typhimurium is a major cause of human diarrhoeal infections, usually acquired from chickens, pigs, cattle, or their products. To understand the basis of persistence and pathogenesis in these reservoir hosts, and to inform the design of novel vaccines and treatments, we generated a library of 7,702 S. Typhimurium mutants, each bearing an insertion at a random position in the genome. Using DNA sequencing, we identified the disrupted gene in each mutant and determined its relative abundance in a laboratory culture and after experimental infection of mice, chickens, pigs, and cattle. The method allowed large numbers of mutants to be investigated simultaneously, drastically reducing the number of animals required to perform a comprehensive screen. We identified mutants that grow in culture but do not survive in one or more of the animals. The genes disrupted in these mutants are inferred to be important for the infection process. Most of these genes were required in all three food-producing animals, but smaller subsets of genes may mediate persistence in a specific host species. The data provide the most comprehensive map of virulence-associated genes for any bacterial pathogen in natural hosts and are highly relevant for the design of control strategies.
When combined with Haplotype Fusion PCR (HF-PCR), Ligation Haplotyping is a robust, high-throughput method for empirical determination of haplotypes, which can be applied to assaying both sequence and structural variation over long distances. Unlike alternative approaches to haplotype determination, such as allele-specific PCR and long PCR, HF-PCR and Ligation Haplotyping do not suffer from mispriming or template switching errors. In this method, HF-PCR is used to juxtapose DNA sequences from single molecule templates, that contain single nucleotide polymorphisms (SNPs) or paralogous sequence variants (PSVs) separated by several kilobases. HF-PCR employs an emulsion-based fusion PCR reaction, which can be performed rapidly, and in a 96-well format. Subsequently, a ligation-based assay is performed on the HF-PCR products to determine haplotypes. Products are resolved by capillary electrophoresis. Once optimized, the method is rapid to perform, taking a day and a half to generate phased haplotypes from genomic DNA.
Gorillas are humans’ closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago (Mya). In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating an average sequence divergence time 1.75 million years ago, but with evidence for more recent genetic exchange and a population bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution.
Functional impairment of DNA damage response pathways leads to increased genomic instability. Here we describe the centrosomal protein CEP152 as a new regulator of genomic integrity and cellular response to DNA damage. Using homozygosity mapping and exome sequencing, we identified CEP152 mutations in Seckel syndrome and showed that impaired CEP152 function leads to accumulation of genomic defects resulting from replicative stress through enhanced activation of ATM signaling and increased H2AX phosphorylation.
Precisely characterizing the breakpoints of copy number variants (CNVs) is crucial for assessing their functional impact. However, fewer than 0% of known germline CNVs have been mapped to the single-nucleotide level. We characterized the sequence breakpoints from a dataset of all CNVs detected in three unrelated individuals in previous array-based CNV discovery experiments. We used targeted hybridization-based DNA capture and 454 sequencing to sequence 324 CNV breakpoints, including 315 deletions. We observed two major breakpoint signatures: 70% of the deletion breakpoints have 1–30 bp of microhomology, whereas 33% of deletion breakpoints contain 1–367 bp of inserted sequence. The co-occurrence of microhomology and inserted sequence is low (10%), suggesting that there are at least two different mutational mechanisms. Approximately 5% of the breakpoints represent more complex rearrangements, including local microinversions, suggesting a replication-based strand switching mechanism. Despite a rich literature on DNA repair processes, reconstruction of the molecular events generating each of these mutations is not yet possible.
The concept of specific chemotherapy was developed a century ago by Paul Ehrlich and others. Dyes and arsenical compounds that displayed selectivity against trypanosomes were central to this work 1,2, and the drugs that emerged remain in use for treating Human African Trypanosomiasis (HAT) 3. Ehrlich recognised the importance of understanding the mechanisms underlying selective drug action and resistance for the development of improved HAT therapies, but these mechanisms have remained largely mysterious. Here, we use all five current HAT drugs for genome-scale RNA interference (RNAi) target sequencing (RIT-seq) screens in Trypanosoma brucei, revealing the transporters, organelles, enzymes and metabolic pathways that function to facilitate anti-trypanosomal drug action. RIT-seq profiling identifies both known drug importers 4,5 and the only known pro-drug activator 6, and links more than fifty additional genes to drug action. A specific bloodstream stage invariant surface glycoprotein (ISG75) family mediates suramin uptake while the AP-1 adaptin complex, lysosomal proteases and major lysosomal transmembrane protein, as well as spermidine and N-acetylglucosamine biosynthesis all contribute to suramin action. Further screens link ubiquinone availability to nitro-drug action, plasma membrane P-type H+-ATPases to pentamidine action, and trypanothione and multiple putative kinases to melarsoprol action. We also demonstrate a major role for aquaglyceroporins in pentamidine and melarsoprol cross-resistance. These advances in our understanding of mechanisms of anti-trypanosomal drug efficacy and resistance will aid the rational design of new therapies and help to combat drug resistance, and provide unprecedented levels of molecular insight into the mode of action of anti-trypanosomal drugs.
DFMO; eflornithine; ISG75; nifurtimox; RNAi
Functional studies will facilitate characterization of role and essentiality of newly available genome sequences of the human schistosomes, Schistosoma mansoni, S. japonicum and S. haematobium. To develop transgenesis as a functional approach for these pathogens, we previously demonstrated that pseudotyped murine leukemia virus (MLV) can transduce schistosomes leading to chromosomal integration of reporter transgenes and short hairpin RNA cassettes. Here we investigated vertical transmission of transgenes through the developmental cycle of S. mansoni after introducing transgenes into eggs. Although MLV infection of schistosome eggs from mouse livers was efficient in terms of snail infectivity, >10-fold higher transgene copy numbers were detected in cercariae derived from in vitro laid eggs (IVLE). After infecting snails with miracidia from eggs transduced by MLV, sequencing of genomic DNA from cercariae released from the snails also revealed the presence of transgenes, demonstrating that transgenes had been transmitted through the asexual developmental cycle, and thereby confirming germline transgenesis. High-throughput sequencing of genomic DNA from schistosome populations exposed to MLV mapped widespread and random insertion of transgenes throughout the genome, along each of the autosomes and sex chromosomes, validating the utility of this approach for insertional mutagenesis. In addition, the germline-transmitted transgene encoding neomycin phosphotransferase rescued cultured schistosomules from toxicity of the antibiotic G418, and PCR analysis of eggs resulting from sexual reproduction of the transgenic worms in mice confirmed that retroviral transgenes were transmitted to the next (F1) generation. These findings provide the first description of wide-scale, random insertional mutagenesis of chromosomes and of germline transmission of a transgene in schistosomes. Transgenic lines of schistosomes expressing antibiotic resistance could advance functional genomics for these significant human pathogens.
Sequence data from this study have been submitted to the European Nucleotide Archive (http://www.ebi.ac.uk/embl) under accession number ERP000379.
Schistosomes, or blood flukes, are responsible for the major neglected tropical disease called schistosomiasis, which afflicts over 200 million people in impoverished regions of the developing world. The genome sequence of these parasites has been decoded. Integration sites of retroviral transgenes into the chromosomes of schistosomes were investigated by high-throughput sequencing. Transgene integrations were mapped to the genome sequence of Schistosoma mansoni. Integrations were distributed apparently randomly across each of the eight chromosomes, including the seven autosomes and the sex chromosomes Z and W. Integration events of transgenes were characterized in chromosomes of cercariae that were progeny of schistosome eggs infected with pseudotyped virions. Also, transgenic cercariae were employed to infect mice and transgenes were detected in the F1 eggs. Together these findings confirmed vertical transmission of transgenes through the schistosome germline, through both the asexual and the sexual reproductive phases of the developmental cycle. Moreover, germline-transmitted retroviral transgenes encoding drug resistance to the aminoglycoside antibiotics allowed schistosomes to survive toxic concentrations of the antibiotic G418. These findings represent the first reports of wide-scale insertional mutagenesis of schistosome chromosomes and vertical transmission of a transgene through the schistosome germline.
TSPY1 is a tandemly-repeated gene on the human Y chromosome forming an array of approximately 21–35 copies. The testicular expression pattern and the inferred function of the TSPY1 protein suggest possible involvement in spermatogenesis. However, data are scarce on TSPY1 copy number variation in different Y lineages and its role in spermatogenesis.
We sought to define: 1) the extent of TSPY1 copy number variation within and among Y chromosome haplogroups; and 2) the role of TSPY1 dosage in spermatogenic efficiency.
Materials and Methods
A total of 154 idiopathic infertile men and 130 normozoospermic controls from Central Italy were analyzed. We used a quantitative PCR assay to measure TSPY1 copy number and also defined Y haplogroups in all subjects.
We provide evidence that TSPY1 copy number shows substantial variation among Y haplogroups and thus that population stratification does represent a potential bias in case-control association studies. We also found: 1) a significant positive correlation between TSPY1 copy number and sperm count (P < 0.001); 2) a significant difference in mean TSPY1 copy number between patients and controls (28.4 ± 8.3 vs. 33.9 ± 10.7; P < 0.001); and 3) a 1.5-fold increased risk of abnormal sperm parameters in men with less than 33 copies (P < 0.001).
TSPY copy number variation significantly influences spermatogenic efficiency. Low TSPY1 copy number is a new risk factor for male infertility with potential clinical consequences.
The insertion sites of the conjugative transposon Tn916 in the anaerobic pathogen Clostridium difficile were determined using Illumina Solexa high-throughput DNA sequencing of Tn916 insertion libraries in two different clinical isolates: 630ΔE, an erythromycin-sensitive derivative of 630 (ribotype 012), and the ribotype 027 isolate R20291, which was responsible for a severe outbreak of C. difficile disease. A consensus 15-bp Tn916 insertion sequence was identified which was similar in both strains, although an extended consensus sequence was observed in R20291. A search of the C. difficile 630 genome showed that the Tn916 insertion motif was present 100,987 times, with approximately 63,000 of these motifs located in genes and 35,000 in intergenic regions. To test the usefulness of Tn916 as a mutagen, a functional screen allowed the isolation of a mutant. This mutant contained Tn916 inserted into a gene involved in flagellar biosynthesis.
Massively parallel sequencing technology is revolutionizing approaches to genomic and genetic research. Since its advent, the scale and efficiency of Next-Generation Sequencing (NGS) has rapidly improved. In spite of this success, sequencing genomes or genomic regions with extremely biased base composition is still a great challenge to the currently available NGS platforms. The genomes of some important pathogenic organisms like Plasmodium falciparum (high AT content) and Mycobacterium tuberculosis (high GC content) display extremes of base composition. The standard library preparation procedures that employ PCR amplification have been shown to cause uneven read coverage particularly across AT and GC rich regions, leading to problems in genome assembly and variation analyses. Alternative library-preparation approaches that omit PCR amplification require large quantities of starting material and hence are not suitable for small amounts of DNA/RNA such as those from clinical isolates. We have developed and optimized library-preparation procedures suitable for low quantity starting material and tolerant to extremely high AT content sequences.
We have used our optimized conditions in parallel with standard methods to prepare Illumina sequencing libraries from a non-clinical and a clinical isolate (containing ~53% host contamination). By analyzing and comparing the quality of sequence data generated, we show that our optimized conditions that involve a PCR additive (TMAC), produces amplified libraries with improved coverage of extremely AT-rich regions and reduced bias toward GC neutral templates.
We have developed a robust and optimized Next-Generation Sequencing library amplification method suitable for extremely AT-rich genomes. The new amplification conditions significantly reduce bias and retain the complexity of either extremes of base composition. This development will greatly benefit sequencing clinical samples that often require amplification due to low mass of DNA starting material.
Next-Generation Sequencing; Illumina; Library; Plasmodium falciparum; AT-rich; Malaria; Clinical isolate; PCR; Tetramethyammonium chloride; PCR-free; Isothermal; Linear; Exponential
We have investigated whether regions of the genome showing signs of positive selection in scans based on haplotype structure also show evidence of positive selection when sequence-based tests are applied, whether the target of selection can be localized more precisely, and whether such extra evidence can lead to increased biological insights. We used two tools: simulations under neutrality or selection, and experimental investigation of two regions identified by the HapMap2 project as putatively selected in human populations. Simulations suggested that neutral and selected regions should be readily distinguished and that it should be possible to localize the selected variant to within 40 kb at least half of the time. Re-sequencing of two ~300 kb regions (chr4:158Mb and chr10:22Mb) lacking known targets of selection in HapMap CHB individuals provided strong evidence for positive selection within each and suggested the micro-RNA gene hsa-miR-548c as the best candidate target in one region, and changes in regulation of the sperm protein gene SPAG6 in the other.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-011-1111-9) contains supplementary material, which is available to authorized users.
Dynamic modification of histone proteins plays a key role in regulating gene expression. However, histones themselves can also be dynamic, which potentially affects the stability of histone modifications. To determine the molecular mechanisms of histone turnover, we developed a parallel screening method for epigenetic regulators by analyzing chromatin states on DNA barcodes. Histone turnover was quantified by employing a genetic pulse-chase technique called RITE, which was combined with chromatin immunoprecipitation and high-throughput sequencing. In this screen, the NuB4/HAT-B complex, containing the conserved type B histone acetyltransferase Hat1, was found to promote histone turnover. Unexpectedly, the three members of this complex could be functionally separated from each other as well as from the known interacting factor and histone chaperone Asf1. Thus, systematic and direct interrogation of chromatin structure on DNA barcodes can lead to the discovery of genes and pathways involved in chromatin modification and dynamics.
Packaging of eukaryotic genomes by the histone proteins influences many processes that use the DNA, such as transcription, repair, and replication. One well-known mechanism of regulation of histone function is the covalent modification of histone proteins. Replacement of modified histones by new histones has recently emerged as an additional layer of regulation (hereafter referred to as histone turnover). Although histone replacement can affect substantial parts of eukaryotic genomes, the mechanisms that control histone exchange are largely unknown. Here, we report a screening method for epigenetic regulators that we applied to search for histone exchange factors. The screening method is based on our finding that global chromatin changes in mutant cells can be inferred from chromatin states on short DNA barcodes. By analyzing the chromatin status of DNA barcodes of many yeast mutants in parallel, we identified positive and negative regulators of histone exchange. In particular, we find that the HAT-B complex promotes histone turnover. HAT-B is known to acetylate the tails of newly synthesized histones, but its role in chromatin assembly has been unclear. Hif1, the nuclear binding partner of HAT-B in the NuB4 complex, also promotes histone exchange but by non-overlapping mechanisms. These results provide a new perspective on pathways of histone exchange.
Massively parallel sequencing of transposon-flanking regions assigned the genotype and fitness score to 91% of Escherichia coli O157:H7 mutants previously screened in cattle by signature-tagged mutagenesis (STM). The method obviates the limitations of STM and markedly extended the functional annotation of the prototype E. coli O157:H7 genome without further animal use.
The development of technologies that allow the stable delivery of large genomic DNA fragments in mammalian systems is important for genetic studies as well as for applications in gene therapy. DNA transposons have emerged as flexible and efficient molecular vehicles to mediate stable cargo transfer. However, the ability to carry DNA fragments >10 kb is limited in most DNA transposons. Here, we show that the DNA transposon piggyBac can mobilize 100-kb DNA fragments in mouse embryonic stem (ES) cells, making it the only known transposon with such a large cargo capacity. The integrity of the cargo is maintained during transposition, the copy number can be controlled and the inserted giant transposons express the genomic cargo. Furthermore, these 100-kb transposons can also be excised from the genome without leaving a footprint. The development of piggyBac as a large cargo vector will facilitate a wider range of genetic and genomic applications.
We have surveyed 15 high-altitude adaptation candidate genes for signals of positive selection in North Caucasian highlanders using targeted re-sequencing. A total of 49 unrelated Daghestani from three ethnic groups (Avars, Kubachians, and Laks) living in ancient villages located at around 2,000 m above sea level were chosen as the study population. Caucasian (Adygei living at sea level, N = 20) and CEU (CEPH Utah residents with ancestry from northern and western Europe; N = 20) were used as controls. Candidate genes were compared with 20 putatively neutral control regions resequenced in the same individuals. The regions of interest were amplified by long-PCR, pooled according to individual, indexed by adding an eight-nucleotide tag, and sequenced using the Illumina GAII platform. 1,066 SNPs were called using false discovery and false negative thresholds of ~6%. The neutral regions provided an empirical null distribution to compare with the candidate genes for signals of selection. Two genes stood out. In Laks, a non-synonymous variant within HIF1A already known to be associated with improvement in oxygen metabolism was rediscovered, and in Kubachians a cluster of 13 SNPs located in a conserved intronic region within EGLN1 showing high population differentiation was found. These variants illustrate both the common pathways of adaptation to high altitude in different populations and features specific to the Daghestani populations, showing how even a mildly hypoxic environment can lead to genetic adaptation.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-011-1084-8) contains supplementary material, which is available to authorized users.
Sequencing the coding regions, the exome, of the human genome is one of the major current strategies to identify low frequency and rare variants associated with human disease traits. So far, the most widely used commercial exome capture reagents have mainly targeted the consensus coding sequence (CCDS) database. We report the design of an extended set of targets for capturing the complete human exome, based on annotation from the GENCODE consortium. The extended set covers an additional 5594 genes and 10.3 Mb compared with the current CCDS-based sets. The additional regions include potential disease genes previously inaccessible to exome resequencing studies, such as 43 genes linked to ion channel activity and 70 genes linked to protein kinase activity. In total, the new GENCODE exome set developed here covers 47.9 Mb and performed well in sequence capture experiments. In the sample set used in this study, we identified over 5000 SNP variants more in the GENCODE exome target (24%) than in the CCDS-based exome sequencing.
human exome; resequencing; GENCODE
The unique composition and spatial arrangement of RNA-binding proteins (RBPs) on a transcript guide the diverse aspects of post-transcriptional regulation1. Therefore, an essential step towards understanding transcript regulation at the molecular level is to gain positional information on the binding sites of RBPs2.
Protein-RNA interactions can be studied using biochemical methods, but these approaches do not address RNA binding in its native cellular context. Initial attempts to study protein-RNA complexes in their cellular environment employed affinity purification or immunoprecipitation combined with differential display or microarray analysis (RIP-CHIP)3-5. These approaches were prone to identifying indirect or non-physiological interactions6. In order to increase the specificity and positional resolution, a strategy referred to as CLIP (UV cross-linking and immunoprecipitation) was introduced7,8. CLIP combines UV cross-linking of proteins and RNA molecules with rigorous purification schemes including denaturing polyacrylamide gel electrophoresis. In combination with high-throughput sequencing technologies, CLIP has proven as a powerful tool to study protein-RNA interactions on a genome-wide scale (referred to as HITS-CLIP or CLIP-seq)9,10. Recently, PAR-CLIP was introduced that uses photoreactive ribonucleoside analogs for cross-linking11,12.
Despite the high specificity of the obtained data, CLIP experiments often generate cDNA libraries of limited sequence complexity. This is partly due to the restricted amount of co-purified RNA and the two inefficient RNA ligation reactions required for library preparation. In addition, primer extension assays indicated that many cDNAs truncate prematurely at the crosslinked nucleotide13. Such truncated cDNAs are lost during the standard CLIP library preparation protocol. We recently developed iCLIP (individual-nucleotide resolution CLIP), which captures the truncated cDNAs by replacing one of the inefficient intermolecular RNA ligation steps with a more efficient intramolecular cDNA circularization (Figure 1)14. Importantly, sequencing the truncated cDNAs provides insights into the position of the cross-link site at nucleotide resolution. We successfully applied iCLIP to study hnRNP C particle organization on a genome-wide scale and assess its role in splicing regulation14.
In the nucleus of eukaryotic cells, nascent transcripts are associated with heterogeneous nuclear ribonucleoprotein (hnRNP) particles that are nucleated by hnRNP C. Despite their abundance however, it remained unclear whether these particles control pre-mRNA processing. Here, we developed individual-nucleotide resolution UV-cross-linking and immunoprecipitation (iCLIP) to study the role of hnRNP C in splicing regulation. iCLIP data demonstrate that hnRNP C recognizes uridine tracts with a defined long-range spacing consistent with hnRNP particle organization. hnRNP particles assemble on both introns and exons, but remain generally excluded from splice sites. Integration of transcriptome-wide iCLIP data and alternative splicing profiles into an ‘RNA map’ indicates how the positioning of hnRNP particles determines their effect on inclusion of alternative exons. The ability of high-resolution iCLIP data to provide insights into the mechanism of this regulation holds promise for studies of other higher-order ribonucleoprotein complexes.
The combination of chromatin immunoprecipitation with next-generation sequencing technology (ChIP-seq) is a powerful and increasingly popular method for mapping protein–DNA interactions in a genome-wide fashion. The conventional way of analyzing this data is to identify sequencing peaks along the chromosomes that are significantly higher than the read background. For histone modifications and other epigenetic marks, it is often preferable to find a characteristic region of enrichment in sequencing reads relative to gene annotations. For instance, many histone modifications are typically enriched around transcription start sites. Calculating the optimal window that describes this enrichment allows one to quantify modification levels for each individual gene. Using data sets for the H3K9/14ac histone modification in Th cells and an accompanying IgG control, we present an analysis strategy that alternates between single gene and global data distribution levels and allows a clear distinction between experimental background and signal. Curve fitting permits false discovery rate-based classification of genes as modified versus unmodified. We have developed a software package called EpiChIP that carries out this type of analysis, including integration with and visualization of gene expression data.
Defining the mutational landscape when individuals of a species grow separately and diverge over many generations can provide insights into trait evolution. A specific example of this involves studying changes associated with domestication where different lines of the same wild stock have been cultivated independently in different standard environments. Whole genome sequence comparison of such lines permits estimation of mutation rates, inference of genes' ancestral states and ancestry of existing strains, and correction of sequencing errors in genome databases. Here we study domestication of the C. elegans Bristol strain as a model, and report the genome sequence of LSJ1 (Bristol), a sibling of the standard C. elegans reference wild type N2 (Bristol). The LSJ1 and N2 lines were cultivated separately from shortly after the Bristol strain was isolated until methods to freeze C. elegans were developed. We find that during this time the two strains have accumulated 1208 genetic differences. We describe phenotypic variation between N2 and LSJ1 in the rate at which embryos develop, the rate of production of eggs, the maturity of eggs at laying, and feeding behavior, all the result of post-isolation changes. We infer the ancestral alleles in the original Bristol isolate and highlight 2038 likely sequencing errors in the original N2 reference genome sequence. Many of these changes modify genome annotation. Our study provides a starting point to further investigate genotype-phenotype association and offers insights into the process of selection as a result of laboratory domestication.