Recent advances in high-throughput sequencing have revolutionized genomics, making it possible for even single research groups to generate large amounts of sequence data very rapidly and at substantially lower costs than traditional Sanger sequencing. This puts the ability to perform deep transcriptome sequencing and transcript quantification, whole genome sequencing and resequencing into the hands of many more researchers. However, while cost and time have been greatly reduced, the error profiles of next-generation platforms differ significantly to those of previous approaches. By addressing this issue, the number of sequences and the quality of the data can be optimized.
The Illumina Genome Analyzer is based on parallel, fluorescence-based readout of millions of immobilized sequences that are iteratively sequenced using reversible terminator chemistry [1
]. In brief, up to eight DNA libraries are hybridized to an eight-lane flow cell. In each of the lanes, single-stranded library molecules hybridize to complementary oligos that are covalently bound to the flow cell surface. Using the double stranded duplex, the reverse strand of each library molecule is synthesized and the now covalently bound molecule is then further amplified in a process called bridge amplification. This generates clusters each containing more than 1,000 copies of the starting molecule. One strand is then selectively removed, free ends are subsequently blocked and a sequencing primer is annealed onto the adapter sequences of the cluster molecules.
Starting from the sequencing primers, 3' terminated and fluorescence-labeled nucleotides are incorporated using a modified polymerase. Base incorporation ceases after the addition of a single base due to the 3' termination of the incorporated nucleotides. The fluorophores attached to the nucleotides are illuminated using a red and a green laser, and imaged through different filters, yielding four images per tile. The number of tiles varies; for Genome Analyzer I it is typically 300 tiles per lane, for Genome Analyzer II it is 100 tiles per lane. After an imaging cycle, the fluorescent labels as well as the 3' terminators are chemically removed and the next incorporation cycle is started. Incorporation and imaging cycles are repeated up to a designated number of cycles, defining the read length for all clusters.
After sequencing, images are analyzed and intensities extracted for each cluster. The Illumina base caller, Bustard, has to handle two effects of the four intensity values extracted for each cycle and cluster: first, a strong correlation of the A and C intensities as well as of the G and T intensities due to similar emission spectra of the fluorophores and limited separation by the filters used; and second, dependence of the signal for a specific cycle on the signal of the cycles before and after, known as phasing and pre-phasing, respectively. Phasing and pre-phasing are caused by incomplete removal of the 3' terminators and fluorophores, sequences in the cluster missing an incorporation cycle, as well as by the incorporation of nucleotides without effective 3' terminators. Phasing and pre-phasing cause the extracted intensities for a specific cycle to consist of the signal of the current cycle as well as noise from the preceding and following cycles. As the number of cycles increases, the fraction of sequences per cluster affected by phasing increases, hampering the identification of the correct base.
Technical improvements in the filters and camera of the Genome Analyzer II have helped with distinguishing the A and C as well as G and T fluorophores. Phasing and pre-phasing was addressed by an improvement of the sequencing chemistry kit that became publically available in the late summer of 2008. This new sequencing chemistry preparation (order numbers FC-204-20xx) reduced the phasing rates determined by Bustard from, on average, 0.8% per cycle to 0.5%, and pre-phasing from 0.6% to 0.4% per cycle. In 2009, Illumina introduced a new chemistry (FC-103-300x) and further updates are expected within the year. Both improvements reduced the overall error rate and allow more sequencing cycles. Here, we present an improvement for the base calling on the Illumina Genome Analyzer platform that can be used for all versions of the Genome Analyzer platforms and chemistries to further decrease the overall error rate.
Two publications [2
] addressed the base calling of the Illumina platform, both using statistical learners trained on sequences called by the standard base caller, Bustard. Statistical learners, also called machine-learning approaches, describe a wide range of mathematical models and algorithms used to extract patterns and rules from huge data sets. In general, statistical learning can facilitate a better understanding of the basics underlying data or can be applied for predicting both qualitative (that is, discrete labels) and quantitative descriptors (that is, values out of a continuous range) from data. In this context, base calling can be seen as predicting discrete labels, finding the correct nucleotide label given the intensity values observed for a specific cycle (that is, a four-class classification problem).
Erlich et al
] published AltaCyclic, the first machine-learning based approach to base calling for the Genome Analyzer. Their approach applies support vector machines (SVMs) trained for each individual cycle. Rolexa [3
], a base caller for the statistical software package R [4
], applies Gaussian mixture models, similar to the approach used by Cokus et al
] for the analysis of bisulphite sequencing data. The two base callers differ further in that Rolexa generates ambiguity codes for potential erroneous base calls, while AltaCyclic produces unambiguous bases with quality scores.
We present Ibis (Improved base identification system), an accurate, fast and easy-to-use base caller for the Illumina sequencing system, which aims to significantly reduce the error rate and increase the output of usable reads. Our goal is to provide sequences with a lower number of base calling errors and better quality scores with each base. This will facilitate quality filtering of the data, sequence read mapping, de novo assembly and further data analysis like single nucleotide polymorphism (SNP) calling.