In this unit, we describe a set of improvements we have made to the standard Illumina protocols to make the sequencing process more reliable in a high-throughput environment, reduce amplification bias, narrow the distribution of insert sizes, and reliably obtain high yields of data.
Illumina; Next-Generation; Sequencer; Protocols
CpG islands (CGIs) are prominent in the mammalian genome owing to their GC-rich base composition and high density of CpG dinucleotides1,2. Most human gene promoters are embedded within CGIs that lack DNA methylation and coincide with sites of histone H3 lysine 4 trimethylation (H3K4me3), irrespective of transcriptional activity3,4. In spite of these intriguing correlations, the functional significance of non-methylated CGI sequences with respect to chromatin structure and transcription is unknown. By performing a search for proteins that are common to all CGIs, here we show high enrichment for Cfp1, which selectively binds to non-methylated CpGs in vitro5,6. Chromatin immunoprecipitation of a mono-allelically methylated CGI confirmed that Cfp1 specifically associates with non-methylated CpG sites in vivo. High throughput sequencing of Cfp1-bound chromatin identified a notable concordance with non-methylated CGIs and sites of H3K4me3 in the mouse brain. Levels of H3K4me3 at CGIs were markedly reduced in Cfp1-depleted cells, consistent with the finding that Cfp1 associates with the H3K4 methyltransferase Setd1 (refs 7, 8). To test whether non-methylated CpG-dense sequences are sufficient to establish domains of H3K4me3, we analysed artificial CpG clusters that were integrated into the mouse genome. Despite the absence of promoters, the insertions recruited Cfp1 and created new peaks of H3K4me3. The data indicate that a primary function of non-methylated CGIs is to genetically influence the local chromatin modification state by interaction with Cfp1 and perhaps other CpG-binding proteins.
Salmonella Typhi and Typhimurium diverged only ∼50 000 years ago, yet have very different host ranges and pathogenicity. Despite the availability of multiple whole-genome sequences, the genetic differences that have driven these changes in phenotype are only beginning to be understood. In this study, we use transposon-directed insertion-site sequencing to probe differences in gene requirements for competitive growth in rich media between these two closely related serovars. We identify a conserved core of 281 genes that are required for growth in both serovars, 228 of which are essential in Escherichia coli. We are able to identify active prophage elements through the requirement for their repressors. We also find distinct differences in requirements for genes involved in cell surface structure biogenesis and iron utilization. Finally, we demonstrate that transposon-directed insertion-site sequencing is not only applicable to the protein-coding content of the cell but also has sufficient resolution to generate hypotheses regarding the functions of non-coding RNAs (ncRNAs) as well. We are able to assign probable functions to a number of cis-regulatory ncRNA elements, as well as to infer likely differences in trans-acting ncRNA regulatory networks.
When combined with Haplotype Fusion PCR (HF-PCR), Ligation Haplotyping is a robust, high-throughput method for empirical determination of haplotypes, which can be applied to assaying both sequence and structural variation over long distances. Unlike alternative approaches to haplotype determination, such as allele-specific PCR and long PCR, HF-PCR and Ligation Haplotyping do not suffer from mispriming or template switching errors. In this method, HF-PCR is used to juxtapose DNA sequences from single molecule templates, that contain single nucleotide polymorphisms (SNPs) or paralogous sequence variants (PSVs) separated by several kilobases. HF-PCR employs an emulsion-based fusion PCR reaction, which can be performed rapidly, and in a 96-well format. Subsequently, a ligation-based assay is performed on the HF-PCR products to determine haplotypes. Products are resolved by capillary electrophoresis. Once optimized, the method is rapid to perform, taking a day and a half to generate phased haplotypes from genomic DNA.
Functional impairment of DNA damage response pathways leads to increased genomic instability. Here we describe the centrosomal protein CEP152 as a new regulator of genomic integrity and cellular response to DNA damage. Using homozygosity mapping and exome sequencing, we identified CEP152 mutations in Seckel syndrome and showed that impaired CEP152 function leads to accumulation of genomic defects resulting from replicative stress through enhanced activation of ATM signaling and increased H2AX phosphorylation.
Precisely characterizing the breakpoints of copy number variants (CNVs) is crucial for assessing their functional impact. However, fewer than 0% of known germline CNVs have been mapped to the single-nucleotide level. We characterized the sequence breakpoints from a dataset of all CNVs detected in three unrelated individuals in previous array-based CNV discovery experiments. We used targeted hybridization-based DNA capture and 454 sequencing to sequence 324 CNV breakpoints, including 315 deletions. We observed two major breakpoint signatures: 70% of the deletion breakpoints have 1–30 bp of microhomology, whereas 33% of deletion breakpoints contain 1–367 bp of inserted sequence. The co-occurrence of microhomology and inserted sequence is low (10%), suggesting that there are at least two different mutational mechanisms. Approximately 5% of the breakpoints represent more complex rearrangements, including local microinversions, suggesting a replication-based strand switching mechanism. Despite a rich literature on DNA repair processes, reconstruction of the molecular events generating each of these mutations is not yet possible.
The concept of specific chemotherapy was developed a century ago by Paul Ehrlich and others. Dyes and arsenical compounds that displayed selectivity against trypanosomes were central to this work 1,2, and the drugs that emerged remain in use for treating Human African Trypanosomiasis (HAT) 3. Ehrlich recognised the importance of understanding the mechanisms underlying selective drug action and resistance for the development of improved HAT therapies, but these mechanisms have remained largely mysterious. Here, we use all five current HAT drugs for genome-scale RNA interference (RNAi) target sequencing (RIT-seq) screens in Trypanosoma brucei, revealing the transporters, organelles, enzymes and metabolic pathways that function to facilitate anti-trypanosomal drug action. RIT-seq profiling identifies both known drug importers 4,5 and the only known pro-drug activator 6, and links more than fifty additional genes to drug action. A specific bloodstream stage invariant surface glycoprotein (ISG75) family mediates suramin uptake while the AP-1 adaptin complex, lysosomal proteases and major lysosomal transmembrane protein, as well as spermidine and N-acetylglucosamine biosynthesis all contribute to suramin action. Further screens link ubiquinone availability to nitro-drug action, plasma membrane P-type H+-ATPases to pentamidine action, and trypanothione and multiple putative kinases to melarsoprol action. We also demonstrate a major role for aquaglyceroporins in pentamidine and melarsoprol cross-resistance. These advances in our understanding of mechanisms of anti-trypanosomal drug efficacy and resistance will aid the rational design of new therapies and help to combat drug resistance, and provide unprecedented levels of molecular insight into the mode of action of anti-trypanosomal drugs.
DFMO; eflornithine; ISG75; nifurtimox; RNAi
Functional studies will facilitate characterization of role and essentiality of newly available genome sequences of the human schistosomes, Schistosoma mansoni, S. japonicum and S. haematobium. To develop transgenesis as a functional approach for these pathogens, we previously demonstrated that pseudotyped murine leukemia virus (MLV) can transduce schistosomes leading to chromosomal integration of reporter transgenes and short hairpin RNA cassettes. Here we investigated vertical transmission of transgenes through the developmental cycle of S. mansoni after introducing transgenes into eggs. Although MLV infection of schistosome eggs from mouse livers was efficient in terms of snail infectivity, >10-fold higher transgene copy numbers were detected in cercariae derived from in vitro laid eggs (IVLE). After infecting snails with miracidia from eggs transduced by MLV, sequencing of genomic DNA from cercariae released from the snails also revealed the presence of transgenes, demonstrating that transgenes had been transmitted through the asexual developmental cycle, and thereby confirming germline transgenesis. High-throughput sequencing of genomic DNA from schistosome populations exposed to MLV mapped widespread and random insertion of transgenes throughout the genome, along each of the autosomes and sex chromosomes, validating the utility of this approach for insertional mutagenesis. In addition, the germline-transmitted transgene encoding neomycin phosphotransferase rescued cultured schistosomules from toxicity of the antibiotic G418, and PCR analysis of eggs resulting from sexual reproduction of the transgenic worms in mice confirmed that retroviral transgenes were transmitted to the next (F1) generation. These findings provide the first description of wide-scale, random insertional mutagenesis of chromosomes and of germline transmission of a transgene in schistosomes. Transgenic lines of schistosomes expressing antibiotic resistance could advance functional genomics for these significant human pathogens.
Sequence data from this study have been submitted to the European Nucleotide Archive (http://www.ebi.ac.uk/embl) under accession number ERP000379.
Schistosomes, or blood flukes, are responsible for the major neglected tropical disease called schistosomiasis, which afflicts over 200 million people in impoverished regions of the developing world. The genome sequence of these parasites has been decoded. Integration sites of retroviral transgenes into the chromosomes of schistosomes were investigated by high-throughput sequencing. Transgene integrations were mapped to the genome sequence of Schistosoma mansoni. Integrations were distributed apparently randomly across each of the eight chromosomes, including the seven autosomes and the sex chromosomes Z and W. Integration events of transgenes were characterized in chromosomes of cercariae that were progeny of schistosome eggs infected with pseudotyped virions. Also, transgenic cercariae were employed to infect mice and transgenes were detected in the F1 eggs. Together these findings confirmed vertical transmission of transgenes through the schistosome germline, through both the asexual and the sexual reproductive phases of the developmental cycle. Moreover, germline-transmitted retroviral transgenes encoding drug resistance to the aminoglycoside antibiotics allowed schistosomes to survive toxic concentrations of the antibiotic G418. These findings represent the first reports of wide-scale insertional mutagenesis of schistosome chromosomes and vertical transmission of a transgene through the schistosome germline.
TSPY1 is a tandemly-repeated gene on the human Y chromosome forming an array of approximately 21–35 copies. The testicular expression pattern and the inferred function of the TSPY1 protein suggest possible involvement in spermatogenesis. However, data are scarce on TSPY1 copy number variation in different Y lineages and its role in spermatogenesis.
We sought to define: 1) the extent of TSPY1 copy number variation within and among Y chromosome haplogroups; and 2) the role of TSPY1 dosage in spermatogenic efficiency.
Materials and Methods
A total of 154 idiopathic infertile men and 130 normozoospermic controls from Central Italy were analyzed. We used a quantitative PCR assay to measure TSPY1 copy number and also defined Y haplogroups in all subjects.
We provide evidence that TSPY1 copy number shows substantial variation among Y haplogroups and thus that population stratification does represent a potential bias in case-control association studies. We also found: 1) a significant positive correlation between TSPY1 copy number and sperm count (P < 0.001); 2) a significant difference in mean TSPY1 copy number between patients and controls (28.4 ± 8.3 vs. 33.9 ± 10.7; P < 0.001); and 3) a 1.5-fold increased risk of abnormal sperm parameters in men with less than 33 copies (P < 0.001).
TSPY copy number variation significantly influences spermatogenic efficiency. Low TSPY1 copy number is a new risk factor for male infertility with potential clinical consequences.
The insertion sites of the conjugative transposon Tn916 in the anaerobic pathogen Clostridium difficile were determined using Illumina Solexa high-throughput DNA sequencing of Tn916 insertion libraries in two different clinical isolates: 630ΔE, an erythromycin-sensitive derivative of 630 (ribotype 012), and the ribotype 027 isolate R20291, which was responsible for a severe outbreak of C. difficile disease. A consensus 15-bp Tn916 insertion sequence was identified which was similar in both strains, although an extended consensus sequence was observed in R20291. A search of the C. difficile 630 genome showed that the Tn916 insertion motif was present 100,987 times, with approximately 63,000 of these motifs located in genes and 35,000 in intergenic regions. To test the usefulness of Tn916 as a mutagen, a functional screen allowed the isolation of a mutant. This mutant contained Tn916 inserted into a gene involved in flagellar biosynthesis.
Massively parallel sequencing technology is revolutionizing approaches to genomic and genetic research. Since its advent, the scale and efficiency of Next-Generation Sequencing (NGS) has rapidly improved. In spite of this success, sequencing genomes or genomic regions with extremely biased base composition is still a great challenge to the currently available NGS platforms. The genomes of some important pathogenic organisms like Plasmodium falciparum (high AT content) and Mycobacterium tuberculosis (high GC content) display extremes of base composition. The standard library preparation procedures that employ PCR amplification have been shown to cause uneven read coverage particularly across AT and GC rich regions, leading to problems in genome assembly and variation analyses. Alternative library-preparation approaches that omit PCR amplification require large quantities of starting material and hence are not suitable for small amounts of DNA/RNA such as those from clinical isolates. We have developed and optimized library-preparation procedures suitable for low quantity starting material and tolerant to extremely high AT content sequences.
We have used our optimized conditions in parallel with standard methods to prepare Illumina sequencing libraries from a non-clinical and a clinical isolate (containing ~53% host contamination). By analyzing and comparing the quality of sequence data generated, we show that our optimized conditions that involve a PCR additive (TMAC), produces amplified libraries with improved coverage of extremely AT-rich regions and reduced bias toward GC neutral templates.
We have developed a robust and optimized Next-Generation Sequencing library amplification method suitable for extremely AT-rich genomes. The new amplification conditions significantly reduce bias and retain the complexity of either extremes of base composition. This development will greatly benefit sequencing clinical samples that often require amplification due to low mass of DNA starting material.
Next-Generation Sequencing; Illumina; Library; Plasmodium falciparum; AT-rich; Malaria; Clinical isolate; PCR; Tetramethyammonium chloride; PCR-free; Isothermal; Linear; Exponential
We have investigated whether regions of the genome showing signs of positive selection in scans based on haplotype structure also show evidence of positive selection when sequence-based tests are applied, whether the target of selection can be localized more precisely, and whether such extra evidence can lead to increased biological insights. We used two tools: simulations under neutrality or selection, and experimental investigation of two regions identified by the HapMap2 project as putatively selected in human populations. Simulations suggested that neutral and selected regions should be readily distinguished and that it should be possible to localize the selected variant to within 40 kb at least half of the time. Re-sequencing of two ~300 kb regions (chr4:158Mb and chr10:22Mb) lacking known targets of selection in HapMap CHB individuals provided strong evidence for positive selection within each and suggested the micro-RNA gene hsa-miR-548c as the best candidate target in one region, and changes in regulation of the sperm protein gene SPAG6 in the other.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-011-1111-9) contains supplementary material, which is available to authorized users.
Massively parallel sequencing of transposon-flanking regions assigned the genotype and fitness score to 91% of Escherichia coli O157:H7 mutants previously screened in cattle by signature-tagged mutagenesis (STM). The method obviates the limitations of STM and markedly extended the functional annotation of the prototype E. coli O157:H7 genome without further animal use.
The development of technologies that allow the stable delivery of large genomic DNA fragments in mammalian systems is important for genetic studies as well as for applications in gene therapy. DNA transposons have emerged as flexible and efficient molecular vehicles to mediate stable cargo transfer. However, the ability to carry DNA fragments >10 kb is limited in most DNA transposons. Here, we show that the DNA transposon piggyBac can mobilize 100-kb DNA fragments in mouse embryonic stem (ES) cells, making it the only known transposon with such a large cargo capacity. The integrity of the cargo is maintained during transposition, the copy number can be controlled and the inserted giant transposons express the genomic cargo. Furthermore, these 100-kb transposons can also be excised from the genome without leaving a footprint. The development of piggyBac as a large cargo vector will facilitate a wider range of genetic and genomic applications.
We have surveyed 15 high-altitude adaptation candidate genes for signals of positive selection in North Caucasian highlanders using targeted re-sequencing. A total of 49 unrelated Daghestani from three ethnic groups (Avars, Kubachians, and Laks) living in ancient villages located at around 2,000 m above sea level were chosen as the study population. Caucasian (Adygei living at sea level, N = 20) and CEU (CEPH Utah residents with ancestry from northern and western Europe; N = 20) were used as controls. Candidate genes were compared with 20 putatively neutral control regions resequenced in the same individuals. The regions of interest were amplified by long-PCR, pooled according to individual, indexed by adding an eight-nucleotide tag, and sequenced using the Illumina GAII platform. 1,066 SNPs were called using false discovery and false negative thresholds of ~6%. The neutral regions provided an empirical null distribution to compare with the candidate genes for signals of selection. Two genes stood out. In Laks, a non-synonymous variant within HIF1A already known to be associated with improvement in oxygen metabolism was rediscovered, and in Kubachians a cluster of 13 SNPs located in a conserved intronic region within EGLN1 showing high population differentiation was found. These variants illustrate both the common pathways of adaptation to high altitude in different populations and features specific to the Daghestani populations, showing how even a mildly hypoxic environment can lead to genetic adaptation.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-011-1084-8) contains supplementary material, which is available to authorized users.
Sequencing the coding regions, the exome, of the human genome is one of the major current strategies to identify low frequency and rare variants associated with human disease traits. So far, the most widely used commercial exome capture reagents have mainly targeted the consensus coding sequence (CCDS) database. We report the design of an extended set of targets for capturing the complete human exome, based on annotation from the GENCODE consortium. The extended set covers an additional 5594 genes and 10.3 Mb compared with the current CCDS-based sets. The additional regions include potential disease genes previously inaccessible to exome resequencing studies, such as 43 genes linked to ion channel activity and 70 genes linked to protein kinase activity. In total, the new GENCODE exome set developed here covers 47.9 Mb and performed well in sequence capture experiments. In the sample set used in this study, we identified over 5000 SNP variants more in the GENCODE exome target (24%) than in the CCDS-based exome sequencing.
human exome; resequencing; GENCODE
The unique composition and spatial arrangement of RNA-binding proteins (RBPs) on a transcript guide the diverse aspects of post-transcriptional regulation1. Therefore, an essential step towards understanding transcript regulation at the molecular level is to gain positional information on the binding sites of RBPs2.
Protein-RNA interactions can be studied using biochemical methods, but these approaches do not address RNA binding in its native cellular context. Initial attempts to study protein-RNA complexes in their cellular environment employed affinity purification or immunoprecipitation combined with differential display or microarray analysis (RIP-CHIP)3-5. These approaches were prone to identifying indirect or non-physiological interactions6. In order to increase the specificity and positional resolution, a strategy referred to as CLIP (UV cross-linking and immunoprecipitation) was introduced7,8. CLIP combines UV cross-linking of proteins and RNA molecules with rigorous purification schemes including denaturing polyacrylamide gel electrophoresis. In combination with high-throughput sequencing technologies, CLIP has proven as a powerful tool to study protein-RNA interactions on a genome-wide scale (referred to as HITS-CLIP or CLIP-seq)9,10. Recently, PAR-CLIP was introduced that uses photoreactive ribonucleoside analogs for cross-linking11,12.
Despite the high specificity of the obtained data, CLIP experiments often generate cDNA libraries of limited sequence complexity. This is partly due to the restricted amount of co-purified RNA and the two inefficient RNA ligation reactions required for library preparation. In addition, primer extension assays indicated that many cDNAs truncate prematurely at the crosslinked nucleotide13. Such truncated cDNAs are lost during the standard CLIP library preparation protocol. We recently developed iCLIP (individual-nucleotide resolution CLIP), which captures the truncated cDNAs by replacing one of the inefficient intermolecular RNA ligation steps with a more efficient intramolecular cDNA circularization (Figure 1)14. Importantly, sequencing the truncated cDNAs provides insights into the position of the cross-link site at nucleotide resolution. We successfully applied iCLIP to study hnRNP C particle organization on a genome-wide scale and assess its role in splicing regulation14.
In the nucleus of eukaryotic cells, nascent transcripts are associated with heterogeneous nuclear ribonucleoprotein (hnRNP) particles that are nucleated by hnRNP C. Despite their abundance however, it remained unclear whether these particles control pre-mRNA processing. Here, we developed individual-nucleotide resolution UV-cross-linking and immunoprecipitation (iCLIP) to study the role of hnRNP C in splicing regulation. iCLIP data demonstrate that hnRNP C recognizes uridine tracts with a defined long-range spacing consistent with hnRNP particle organization. hnRNP particles assemble on both introns and exons, but remain generally excluded from splice sites. Integration of transcriptome-wide iCLIP data and alternative splicing profiles into an ‘RNA map’ indicates how the positioning of hnRNP particles determines their effect on inclusion of alternative exons. The ability of high-resolution iCLIP data to provide insights into the mechanism of this regulation holds promise for studies of other higher-order ribonucleoprotein complexes.
The combination of chromatin immunoprecipitation with next-generation sequencing technology (ChIP-seq) is a powerful and increasingly popular method for mapping protein–DNA interactions in a genome-wide fashion. The conventional way of analyzing this data is to identify sequencing peaks along the chromosomes that are significantly higher than the read background. For histone modifications and other epigenetic marks, it is often preferable to find a characteristic region of enrichment in sequencing reads relative to gene annotations. For instance, many histone modifications are typically enriched around transcription start sites. Calculating the optimal window that describes this enrichment allows one to quantify modification levels for each individual gene. Using data sets for the H3K9/14ac histone modification in Th cells and an accompanying IgG control, we present an analysis strategy that alternates between single gene and global data distribution levels and allows a clear distinction between experimental background and signal. Curve fitting permits false discovery rate-based classification of genes as modified versus unmodified. We have developed a software package called EpiChIP that carries out this type of analysis, including integration with and visualization of gene expression data.
Defining the mutational landscape when individuals of a species grow separately and diverge over many generations can provide insights into trait evolution. A specific example of this involves studying changes associated with domestication where different lines of the same wild stock have been cultivated independently in different standard environments. Whole genome sequence comparison of such lines permits estimation of mutation rates, inference of genes' ancestral states and ancestry of existing strains, and correction of sequencing errors in genome databases. Here we study domestication of the C. elegans Bristol strain as a model, and report the genome sequence of LSJ1 (Bristol), a sibling of the standard C. elegans reference wild type N2 (Bristol). The LSJ1 and N2 lines were cultivated separately from shortly after the Bristol strain was isolated until methods to freeze C. elegans were developed. We find that during this time the two strains have accumulated 1208 genetic differences. We describe phenotypic variation between N2 and LSJ1 in the rate at which embryos develop, the rate of production of eggs, the maturity of eggs at laying, and feeding behavior, all the result of post-isolation changes. We infer the ancestral alleles in the original Bristol isolate and highlight 2038 likely sequencing errors in the original N2 reference genome sequence. Many of these changes modify genome annotation. Our study provides a starting point to further investigate genotype-phenotype association and offers insights into the process of selection as a result of laboratory domestication.
CpG islands (CGIs) are vertebrate genomic landmarks that encompass the promoters of most genes and often lack DNA methylation. Querying their apparent importance, the number of CGIs is reported to vary widely in different species and many do not co-localise with annotated promoters. We set out to quantify the number of CGIs in mouse and human genomes using CXXC Affinity Purification plus deep sequencing (CAP-seq). We also asked whether CGIs not associated with annotated transcripts share properties with those at known promoters. We found that, contrary to previous estimates, CGI abundance in humans and mice is very similar and many are at conserved locations relative to genes. In each species CpG density correlates positively with the degree of H3K4 trimethylation, supporting the hypothesis that these two properties are mechanistically interdependent. Approximately half of mammalian CGIs (>10,000) are “orphans” that are not associated with annotated promoters. Many orphan CGIs show evidence of transcriptional initiation and dynamic expression during development. Unlike CGIs at known promoters, orphan CGIs are frequently subject to DNA methylation during development, and this is accompanied by loss of their active promoter features. In colorectal tumors, however, orphan CGIs are not preferentially methylated, suggesting that cancer does not recapitulate a developmental program. Human and mouse genomes have similar numbers of CGIs, over half of which are remote from known promoters. Orphan CGIs nevertheless have the characteristics of functional promoters, though they are much more likely than promoter CGIs to become methylated during development and hence lose these properties. The data indicate that orphan CGIs correspond to previously undetected promoters whose transcriptional activity may play a functional role during development.
In the decade since the sequence of the human genome was announced, efforts have been made to annotate all genes with their regulatory sequences. CpG islands are short regions containing the sequence CG at high density that map to regions controlling the expression of most human genes (known as promoters). Using a biochemical method, we have identified and mapped all CpG islands in the human and mouse genomes and find that over half are remote from known gene promoters—so-called “orphans.” Mice, which were thought to possess far fewer CpG islands than humans, turn out to have a very similar number. Surprisingly, orphan CpG islands in both species often mark hitherto unknown promoters. The activity of these novel promoters is particularly dynamic during normal development, as they are often silenced by DNA methylation. In colorectal cancers, however, aberrant DNA methylation affects all CpG islands equally.
We report an alternative approach to transcriptome sequencing for the Illumina Genome Analyzer, in which the reverse transcription reaction takes place on the flowcell. No amplification is performed during the library preparation, so PCR biases and duplicates are avoided. Since the template is poly A+ RNA rather than cDNA, the resulting sequences are necessarily strand-specific. The method is compatible with paired- or single-ended sequencing.
High-throughput sequencing of cDNA has been used to study eukaryotic transcription on a genome-wide scale to single base pair resolution. In order to compensate for the high ribonuclease activity in bacterial cells, we have devised an equivalent technique optimized for studying complete prokaryotic transcriptomes that minimizes the manipulation of the RNA sample. This new approach uses Illumina technology to sequence single-stranded (ss) cDNA, generating information on both the direction and level of transcription throughout the genome. The protocol, and associated data analysis programs, are freely available from http://www.sanger.ac.uk/Projects/Pathogens/Transcriptome/. We have successfully applied this method to the bacterial pathogens Salmonella bongori and Streptococcus pneumoniae and the yeast Schizosaccharomyces pombe. This method enables experimental validation of genetic features predicted in silico and allows the easy identification of novel transcripts throughout the genome. We also show that there is a high correlation between the level of gene expression calculated from ss-cDNA and double-stranded-cDNA sequencing, indicting that ss-cDNA sequencing is both robust and appropriate for use in quantitative studies of transcription. Hence, this simple method should prove a useful tool in aiding genome annotation and gene expression studies in both prokaryotes and eukaryotes.
Amplification artifacts introduced during library preparation for the Illumina Genome Analyzer increase the likelihood that an appreciable proportion of these sequences will be duplicates, and cause an uneven distribution of read coverage across the targeted sequencing regions. As a consequence, these unfavorable features result in difficulties in genome assembly and variation analysis from the short reads, particularly when the sequences are from genomes with base compositions at the extremes of high or low GC content. Here we present an amplification-free method of library preparation, in which the cluster amplification step, rather than the polymerase chain reaction, enriches for fully ligated template strands, reducing the incidence of duplicate sequences, improving read mapping and SNP calling and aiding de novo assembly. We illustrate this by generating and analysing DNA sequences from extremely GC-poor (Plasmodium falciparum), GC-neutral (Escherichia coli) and high GC (Bordetella pertussis) genomes.
Next-generation sequencing technologies are revolutionizing biology by allowing for genome-wide transcription factor binding-site profiling, transcriptome sequencing, and more recently, whole-genome resequencing. While it is currently not possible to generate complete de novo assemblies of higher-vertebrate genomes using next-generation sequencing, improvements in sequence read lengths and throughput, coupled with new assembly algorithms for large data sets, will soon make this a reality. These developments will in turn spawn a revolution in how genomic data are used to understand genetics and how model organisms are used for disease gene discovery. This review provides an overview of the current next-generation sequencing platforms and the newest computational tools for the analysis of next-generation sequencing data. We also describe how next-generation sequencing may be applied in the context of vertebrate model organism genetics.