In the previous Section we saw that sequencing errors, even at low overall error rates, that are systematically biased can result in incorrect downstream analysis. Our goal in this paper is to define a framework that captures uncertainty in the sequencing pipeline in a way that allows downstream analysis to guard against these systematic errors. The core of our methodology is to model this uncertainty starting from measurements of fluorescence intensity all the way through read mapping. In this Section we motivate and develop our model to capture uncertainty from fluorescence intensity. The purpose of this model is to generate a short-read nucleotide probability profile for each read by drawing information not only from the intensities of each individual read as sequencing cycles progress, but also from other reads, making use of the large numbers of samples provided by these technologies. In the next Section, we describe quality metrics derived from these intensity model estimates that capture base-calling uncertainty.
For each read i we observe over N sequencing cycles, a 4-by-N matrix of measured fluorescence intensities. We denote this matrix as yi, and specify measurements further as, for example yijA, indicating the measured intensity for read i, cycle j in the A channel. The intensity quadruplet of all four channels for read i and cycle j is denoted as yij. We plot in Figure W2 an example of intensity data. In an optimal setting, one would expect that each panel in Figure W2 shows three clusters of points: one along the x-axis, corresponding to reads where the nucleotide at the current cycle corresponds to that axis, another group for the y-axis respectively, and another group around the origin, corresponding to clusters where the nucleotide at the current cycle corresponds to neither of the two bases plotted. The actual data, however, showed some salient features that deviate from this expectation: (a) scatter, instead of high and low clusters, the intensity measurements for each channel are scattered across the intensity spectrum, (b) signal loss and corruption across cycles, where overall intensity decays and there is more overlap between intensity clusters as cycles progress and, (c) cross-talk, where intensity clusters for pairs of channels are highly correlated, e.g. A and C measurements.
The started log transform
We saw previously that fluorescence intensity in each fragment cluster is measured in cycles, as fragments complementary to those in the cluster are synthesized. However, at each cycle nucleotide incorporation might fail for a strand, terminating synthesis for subsequent cycles. Therefore, as incorporation fails, one would expect an exponential loss of measured signal from each cluster. To model this type of exponential process in the signal, and to control the variance in measurements, we preprocess the intensity measurements using a started log2 transform. Since the intensity measurements given by Illumina have been background-corrected during image processing—unfortunately there is no way to recover the uncorrected signal without going back to the image data — some of the measured intensity values are negative, and thus a logarithmic transform is not defined. In the following, we show transformed observed intensities, denoted as h(y) = log2(y + c). Constant c is chosen depending on the smallest intensity value s in the first cycle of a random sample of fragment clusters; if this value is positive we set c = 0, otherwise, we set c = |s|. If any of the unsampled intensities is lower than c, it is set to c before taking the log transform.
4.1 Effects observed in intensity data
The read-cycle effect
We can explain scatter in the intensity data by what we denote a read effect. Each fragment undergoes an in situ PCR reaction to amplify the corresponding sequence fragment and create a fragment cluster. The amount of amplification varies drastically between fragments. Since the fluorescence intensity of each cluster depends on the amount of amplification, the overall intensity, that is, measured across channels and cycles, varies significantly between clusters. We can clearly see this from the data, e.g. in Web Figure W3 where boxplots of all the intensity measurements for a random sample of reads show the wide variation in intensity across reads.
Similarly, there is a cycle effect in the measured intensities of each read, where intensity in the upper range drops as sequencing progresses, while it increases in the lower range. shows this effect for a few reads. Each panel in this figure shows two groups of points for each read: the upper group is the maximum intensity over the four channels at the corresponding cycle — this is the signal we are trying to capture — while the lower group is the median of the other three channels — this is noise. Due to the started log transform described above, these two groups can be modeled as linear functions of cycle. The intercepts of the two least-square lines model the overall intensity measurements of the cluster. The slope of the upper line models signal loss, while the slope of the lower line models signal corruption.
Figure 3 The read-cycle effect. Each panel corresponds to measured intensities for a single read. The upper (blue) points are the maximum intensity at each sequencing cycle, the lower (pink) points are the median of the remaining intensities. We can observe the (more ...) The base-cycle effect
Finally, we describe a global effect observed across reads. In this case, we noticed that overall intensity measurements vary between channels differently as sequencing cycles progress. Web Figure W4 illustrates this. We see how the median of intensity measurements change differently across channels as sequencing cycles progress. For instance, the median intensity in the G channel drops much faster than the other channels as sequencing progresses.
4.2 The intensity model
Our proposed intensity model for read i
) = Muij
is a matrix that models the cross-talk effect and h
is the started log transform discussed above. We assume M
is a matrix with unit diagonal so that the measured log transformed intensities are linear combinations of the underlying log intensities uij
. This linear combination assumption is consistent with the cross-talk model of Li and Speed (1999)
. In fact, we estimate M
using their procedures as is also used in the Illumina platform.
We model uijc
, the underlying log intensity of read i
, cycle j
, channel c
, as follows:
for all i, j, c
, and Δijc
is an indicator of the true, unobserved, nucleotide being measured for read i
in cycle j
. That is,
= 1 for all i, j
This model is motivated by Figure W2 and 3 and is best understood by referring to those two plots. Parameters αi = [α0i α1i] and β = [β0i β1i] correspond to the intercepts and slopes of the two read-specific linear models shown in : αi to the upper linear model, or the “signal loss model”, and βi to the lower linear model, i.e., the “signal-corruption” model. Parameters μcjα and μcjα are used to capture base-cycle effects, that is, how the median intensity in each channel varies as sequencing cycles occur (Web Figure W4). These two parameters are shared by reads in each tile. Thus, given N reads and M cycles, there are N sets of αi and βi parameters, each estimated from 4 *M observations, and a set of base-cycle parameters μcjα and μcjβ, each estimated from N *M observations. In the Illumina data we use in this paper, N ≈ 30, 000 and M ≈ 40.
Parameter estimates are computed using the EM framework (Dempster et al., 1977
), with details on how we derive them given in Web Appendix A. A result of estimating using the EM framework is an estimate of zijc
} = P
), e.g., the posterior probability that the true nucleotide for read i
in cycle j
We reiterate that the effects included in the model are there as a result of an exploratory analysis of Illumina intensity data. They capture features of the data explainable by the chemistry of the sequencing process. For example, signal loss is due to the decreasing ability of nucleotide incorporation to occur as sequencing progresses, while signal corruption is due to residual fluorescence measured from previous sequencing cycles. Along with the cross-talk effect, these are inherent in the sequencing-by synthesis process. While better chemistry can diminish the strength of these effects, one might not expect their disappearance, especially, as more sequencing cycles are carried out in newer technologies. The model proposed here can be easily modified to accommodate changes in these effects or effects that arise as technology changes.