Genomic translational research faces a scarcity of properly stored and annotated clinical samples. Archived formalin-fixed tissues in paraffin blocks offer a unique opportunity to study thousands of samples with extensive clinical records and follow-up information. In our study, we show that it is possible to obtain enough DNA from a single 5

µm FFPE slide (~1–2

cm
2) to perform whole genome sequencing of sufficient coverage depth to identify potentially important mutations. The FFPE process combined with long storage times is known to result in DNA fragmentation. We show that for the two breast tumor samples analyzed DNA fragmentation did not produce large biases in coverage depth distribution (
Supplementary Figure S2). However, we observed a higher global nucleotide mismatch rate within aligned reads from FFPE tumor samples when compared to matched germline (A) and a higher base substitution rate across all 6 different substitution types (C). Consistent with damage due to formalin fixation, we observed this increase was biased towards C·G

>

T·A mismatches. Interestingly the two samples studied were differentially affected by the formalin fixation, tumor 02542 showing a 1.8-fold increase in the global nucleotide mismatch rate and greater C·G

>

T·A bias compared to tumor 06408. This discrepancy can be explained by the absence of strict standards in the formalin fixation step, where tissue samples are routinely fixed between 24 and 48

h (
11) but sometimes can be fixed for considerably longer times. The time of the formalin fixation step is not known for the studied samples and not generally included in pathology reports. Another possible explanation could be the size of the tumor tissue, or its density, which also affects the fixation procedure. As formalin fixation-induced DNA damage could potentially be so great as to inhibit the ability to analyze an FFPE sample by next generation sequencing we have established a relatively simple test to assess the integrity of FFPE samples. By simply sequencing from 500

000 to 1 million raw reads from a single FFPE tumor, one can determine the extent of DNA damage and identify the best preserved samples to conduct larger, more expensive whole genome sequencing (A and
Supplementary Figure S4).
Using a set of innovative filters (Filter 2.4–2.6), we establish a successful method for filtering false positive somatic variants caused by the FFPE damage to the tumor DNA, thus increasing our confidence in the final set of called somatic mutations. It is important to compare our novel filters to existing post-alignment filtering methods such as GATK (
20). Existing methods filter for poor base quality with a stringent threshold; this is due to the fact that incorrectly called variants are typically caused by low quality sequence data. The fact that FFPE causes random damage, the ‘errors’ do not have poor base quality. Our method filters on the AAF without using a threshold for all substitution types; but rather it uses a mismatch error rate across the genome of the given sample. This is important as the amount of FFPE DNA damage varies from sample to sample. To achieve the same goal as our novel post-alignment filters, one could propose applying more stringent criteria to align the reads. Aligners that trim the reads when their mismatch rate becomes too high have been implemented (
36,
37). As a result, the global nucleotide mismatch rate would improve, but at the cost of a reduced effective sequencing coverage depth. Such strategies could also remove
bona fide somatic mutations surrounded by extensive DNA damage therefore limiting the sensitivity to call variants. A second potential alternate approach for achieving a set of high-confidence somatic mutations in FFPE samples would be to sequence to greater coverage depth. Since formalin fixation is performed on the resected tumor sample and will generally randomly affect different DNA locations in different cells, elevated global nucleotide mismatch rates in DNA sequencing reads should still lead to accurate variant calls at sufficiently high sequencing coverage depth. In our study, the global nucleotide mismatch rate was indeed higher than the variant calling rate, especially in FFPE tumors (18–32

×

10
−3 versus 10–11

×

10
−4). In a recent study of whole-exome sequencing of FFPE tumors, 40-fold coverage was insufficient to filter false positives due to formalin fixation DNA damage identified by the substitution profile and discordance with matched frozen tissue (
15). Indeed, the authors estimate that 80× coverage is required to obtain accurate variant calling in the presence formalin fixation DNA damage. However, for samples such as 02542 in our study with substantial amounts of formalin fixation induced DNA damage, the coverage depths required to overcome the global nucleotide mismatch rates in the sequencing reads to achieve accurate variant calls could be even greater. Thus, applying our series of standard and novel filters will likely have utility for identifying high-confidence somatic mutations in FFPE tumor samples even when there is relatively low sequence coverage depth.
In our study, we have not analyzed the tumors for somatic events such as chromosomal translocations or large copy number alterations (CNA). Methods developed for this purpose (
38–40), rely more on the correct mapping of read pairs than accurate sequence. We have only sequenced single reads, and were thus not able to perform this analysis. We believe that the vast majority of the reads mapped in our FFPE tumor samples are mapped at the correct location. However, it is possible that the sensitivity of translocation or CNA detection would be affected as a greater number of reads might have ambiguous mappings due to the mismatches introduced by the FFPE damage. Various distributions of insert size in read pairs, especially large ones (1–10

kb) obtained through mate-pair libraries, can also improve the sensitivity of the detection of large deletions. However, the FFPE process fragments the DNA and therefore would not be adequate for such studies.
Overall, our study demonstrates that a methodical characterization and analysis of the sequencing data can reduce the noise resulting from formalin fixation induced DNA damage and lead to calling a high-confidence set of somatic mutations. This opens up the possibility of sequencing huge archives of stored clinical FFPE samples of a variety of cancers. Furthermore, we demonstrate that a limited amount of DNA can be used for a genome-wide deep sequencing analysis, which enables studies on small clusters of tumor cells such residual cancer after treatment or dormant metastases.