PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of bioinfoLink to Publisher's site
 
Bioinformatics. 2013 April 15; 29(8): 1076–1077.
Published online 2013 February 14. doi:  10.1093/bioinformatics/btt074
PMCID: PMC3624799

Wessim: a whole-exome sequencing simulator based on in silico exome capture

Abstract

Summary: We propose a targeted re-sequencing simulator Wessim that generates synthetic exome sequencing reads from a given sample genome. Wessim emulates conventional exome capture technologies, including Agilent’s SureSelect and NimbleGen’s SeqCap, to generate DNA fragments from genomic target regions. The target regions can be either specified by genomic coordinates or inferred from in silico probe hybridization. Coupled with existing next-generation sequencing simulators, Wessim generates a realistic artificial exome sequencing data, which is essential for developing and evaluating exome-targeted variant callers.

Availability: Source code and the packaged version of Wessim with manuals are available at http://sak042.github.com/Wessim/.

Contact: sak042/at/cs.ucsd.edu or vbafna/at/cs.ucsd.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Generation of simulated next-generation sequencing (NGS) reads serves the essential gold standard in developing and evaluating variant detection algorithms. Many tools, including wgsim (Li et al., 2009), have been developed for artificial NGS read generation. Recent studies, such as ART (Huang et al., 2012), pIRS (Hu et al., 2012) and GemSim (McElroy et al., 2012), have accomplished more realistic simulation by reproducing known biases coming from sequence context and empirical platform-dependent errors.

Although the current simulators mainly focus on realistic NGS read generation, we address another important problem on specifying target regions. Whole-exome sequencing (WES) is currently being regarded as a superior option for disease variant finding studies; it is cost-effective, much smaller in size and easier to interpret results. As typical statistics of whole-exome sequencing data (e.g. coverage, read distribution and bias) are distinct from that of whole-genome sequencing data, variant detection tools are usually required to calibrate their algorithms for practical use of exome data (Krumm et al., 2012; Sathirapongsasuti et al., 2011). This naturally demands more realistic simulation of exome sequencing to promote accuracy of variant calling tools by providing a statistical base for performance evaluation.

Wessim emulates the exome capture procedure that forms the basis of major commercial solutions (such as Agilent’s SureSelect and Roche/NimbleGen’s SeqCap) by implementing (i) DNA shearing to generate random fragments, (ii) probe capture by hybridization and (iii) single- or paired-end sequencing of the selected fragments. Other important features including fragment length and GC-content were rigorously considered to reproduce more accurate coverage biases. We compared our synthetic data with real WES data to confirm the similarity of major statistics. The exome capture process is highly optimized so that the overall running time is only bounded by the NGS read generation step.

2 METHODS

Wessim takes (i) a FASTA file of sample genome sequence such as the human reference assembly and (ii) genomic target regions. Wessim first generates random DNA fragments from designated target regions, which is specified either by a BED file that contains target regions’ coordinates or a set of probe sequences that are used for capturing fragments. Each fragment is further filtered by length and GC-content to reproduce potential biases. Finally, NGS reads are generated from selected DNA fragments using major emulated platforms.

2.1 Generation of DNA fragments

We define a DNA fragment An external file that holds a picture, illustration, etc.
Object name is btt074i1.jpg of length L(f) and sequence S(f), where cf, sf and ef are the chromosome, start and end position of f’s genomic origin, respectively. Two distinct approaches for the fragment generation are described later in the text.

Ideal target approach: Each exome capture platform manifests its own ideal target regions. For example, Agilent’s SureSelect All Exon V4 targets ~186 kb exonic regions. All BED files are freely available online from vendor’s websites. In this approach, each DNA fragment f is generated within a randomly selected target region An external file that holds a picture, illustration, etc.
Object name is btt074i2.jpg, where c is the chromosome, s and e are start and end position, respectively. The probability of selecting each target region R is proportional to its length An external file that holds a picture, illustration, etc.
Object name is btt074i3.jpg. A slack boundary variable ξ allows a fragment to be generated from an extended interval An external file that holds a picture, illustration, etc.
Object name is btt074i4.jpg.

Probe hybridization approach: This approach implements a probe-level capture procedure for more realistic DNA fragment generation. Agilent’s probe sequences are available at the SureDesign website (https://earray.chem.agilent.com/suredesign/); probe sequence of NimbleGen is not publicly available at this time.

Given an oligonucleotide probe p of sequence S(p), we first retrieve p’s hybridizable regions An external file that holds a picture, illustration, etc.
Object name is btt074i5.jpg, where each sequence An external file that holds a picture, illustration, etc.
Object name is btt074i6.jpg matches S(p) with a good score (e.g. sequence identity An external file that holds a picture, illustration, etc.
Object name is btt074i7.jpg%). The probability of selecting An external file that holds a picture, illustration, etc.
Object name is btt074i8.jpg is inversely proportional to the number of mismatches between An external file that holds a picture, illustration, etc.
Object name is btt074i9.jpg and S(p). To generate a fragment, we first choose a random probe px from the entire probe set and select a random hybridizable region h from An external file that holds a picture, illustration, etc.
Object name is btt074i10.jpg. A fragment f can only be generated when more than a certain fraction of f overlaps the selected hybridizable region, which we defined as a minimum overlap ratio b0. The probability of generating a fragment f from h can be calculated by:

equation image
(1)

where An external file that holds a picture, illustration, etc.
Object name is btt074i11.jpg is the probability of selecting a probe px, An external file that holds a picture, illustration, etc.
Object name is btt074i12.jpg is the conditional probability of selecting h from An external file that holds a picture, illustration, etc.
Object name is btt074i13.jpg given px and b is the fraction of hybridizable region in f.

2.2 Reproducing bias

To emulate the fragment bias on GC-content and fragment length, we implemented a filter as follows. Denote the fractional GC-content as G(f) and the joint distribution of L(f) and G(f) as An external file that holds a picture, illustration, etc.
Object name is btt074i14.jpg. Given f, such that An external file that holds a picture, illustration, etc.
Object name is btt074i15.jpg and An external file that holds a picture, illustration, etc.
Object name is btt074i16.jpg, we retain f with the probability An external file that holds a picture, illustration, etc.
Object name is btt074i17.jpg where An external file that holds a picture, illustration, etc.
Object name is btt074i18.jpg and An external file that holds a picture, illustration, etc.
Object name is btt074i19.jpg is the observed distribution as computed by Benjamini and Speed (2012).

2.3 Sequencing fragments

Generating and sequencing DNA fragments are two separate processes. This ideally enables various existing NGS simulators to be incorporated with Wessim as an independent module. Here, we used an advanced NGS simulator GemSim, which widely supports benchmarked empirical error models of major sequencing platforms (e.g. Solexa, SOLiD and 454). We downloaded and modified the source code of GemSim to convert input unit from genome to fragment while maintaining its core error models.

3 RESULT ANALYSIS

We generated simulated exome data using three different platforms (SureSelect V2, 50M and V4) in two different modes (ideal target and probe hybridization) to compare with real data (Fig. 1A). The read coverage of our simulated data over 288 826 consensus coding exons showed a highly correlated An external file that holds a picture, illustration, etc.
Object name is btt074i21.jpg) pattern with that of the real data. We also confirmed that the probe hybridization approach can reproduce a realistic per exon read distribution, whereas ideal target approach did not (refer Fig. 1B and Supplementary Figs S5 and S6 for full results). In a performance test, Wessim could generate An external file that holds a picture, illustration, etc.
Object name is btt074i22.jpg kb of reads per second including fragment generation and filtering in an 4 core Intel i7-2600K system. This corresponds to ~5.45 h of running time for generating 66 million reads (~two GAIIx lanes).

Fig. 1.
Simulated exome sequencing by Wessim. (A) Read coverage of simulated (upper 3 rows) and real (lower 3 rows) exome sequencing data over 288 826 consensus coding exons are depicted. Wessim’s probe-based simulation reproduced the coverage distribution ...

Supplementary Material

Supplementary Data:

ACKNOWLEDGEMENT

The authors thank Dr Terry Speed and Yuval Benjamini for providing their GC-content bias data.

Funding: National Institute of Health (U54-HL108460, 5R01-H6004962); National Science Foundation (CCF-1115206); National Institute of Child Health and Human Development (1P01HD070494-01).

Conflict of Interest: none declared.

REFERENCES

  • Benjamini Y, Speed TP. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012;40:e72. [PMC free article] [PubMed]
  • Hu X, et al. pIRS: profile-based illumina pair-end reads simulator. Bioinformatics. 2012;28:1533–1535. [PubMed]
  • Huang W, et al. ART: a next-generation sequencing read simulator. Bioinformatics. 2012;28:593–594. [PMC free article] [PubMed]
  • Krumm N, et al. Copy number variation detection and genotyping from exome sequence data. Genome Res. 2012;22:1525–1532. [PubMed]
  • Li H, et al. The sequence alignment/map format and samtools. Bioinformatics. 2009;25:2078–2079. [PMC free article] [PubMed]
  • McElroy K, et al. GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics. 2012;13:74. [PMC free article] [PubMed]
  • Sathirapongsasuti JF, et al. Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV. Bioinformatics. 2011;27:2648–2654. [PMC free article] [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press