PCR amplification introduces artifacts into sequencing libraries 9-12
. In addition to nucleotide misincorporation, amplification tends to be uneven, so that some sequence species become overrepresented in the resulting library. This situation is exacerbated by templates with GC-biased compositions.
By ligating adapters which consist of all sections required for sequencing primer annealing and attachment to the flowcell surface, we can avoid the requirement of a PCR step. The quantity of template DNA generated in this way is lower than when PCR is employed, but library quantification by qPCR 13
showed that from 5 μg starting DNA sufficient 200 bp no-PCR library can be obtained for > 400 high density GA lanes, more than enough for most sequencing purposes. Starting with lower quantities, e.g. 500ng of genomic DNA we can obtain sufficient library with a200 bp insert size for ~ 12 lanes. Inserts of 500 bp result in a lower yield than shorter fragment libraries, presumably due to the lower number of fragments present in the same mass of DNA.
As with standard Illumina adapters, the structure of no-PCR adapters ensures that all fully ligated template strands receive the unique adapter sequence complementary to the flow-cell adapters at their 5′ and 3′ end (). Because the efficiency of ligation is not 100 %, many template strands will receive no adapters, or will only be partially ligated. However, Illumina cluster amplification can only amplify template strands that have a different adapter at either end and thus the cluster amplification step performs the enrichment that is otherwise provided in the PCR.
We have demonstrated that for genomes of extreme GC composition, the sequence coverage provided by the no-PCR approach is more even than the standard, PCR-based Illumina library preparation, contains very few duplicates, aids mapping and SNP calling, and makes assembly more straightforward. This is best illustrated by the P. falciparum genome, which until now has resisted attempts at de novo assembly from short read data. The differences between the short- and long-read malaria assemblies are not large as the average fragment size for the no-PCR 3D7 library is only 170 bp, close to the long paired read length of 152 bp (2 × 76 bases).
It is important to note that P. falciparum genomes are extremely difficult to assemble even using 600-700 base Sanger sequence reads: assembly of clinical isolates from 6-fold Sanger coverage, yielded a contig N50 of only 7 kb (data not shown). Although it seems unlikely that assemblies from short read data alone will ever generate N50 values in the 7 kb range, we believe that we will be able to increase our malaria N50 beyond this by combining short read data with Sanger reads.
Approximately 2 % of the P. falciparum
3D7 reference sequence is not covered by the NP-3D7-S sequence data, since reads were only placed to their best location, while repetitive reads were not placed. In contrast, between 4.8 and 19.9 % of bases were not covered by mapping for the standard P. falciparum
libraries. Using an alternative alignment tool, MAQ 17
, which places repetitive reads to a random location, the uncovered regions were reduced to just 5585 bp for NP-3D7-S, indicating that 99.98 % of the 3D7 genome is represented in the sequence data.
Anecdotally, sequences with a GC content exceeding 80 % are difficult to sequence on a GA. The genome of B. pertussis
has a mean GC content of 68 %, and only a small proportion of sequence reads would have > 80 % GC. Nevertheless, both standard and no-PCR B. pertussis
libraries revealed GC profiles that were almost identical to simulated data, with no loss towards higher GC contents (), indicating that the standard library preparation protocol finds no difficulty with GC contents within this range. If there are difficulties in sequencing organisms with a higher GC content than B. pertussis
on a GA, our data indicates that these are not the result of PCR artefacts, though it is conceivable that biases are introduced at other stages in the sequencing process, such as cluster growth 7
. However, the high GC content of ST24 has hindered the generation of a finished-standard reference sequence by Sanger sequencing: the assembly still contains 115 contigs, of which some are vector contamination, and this prevents a more thorough analysis.
Because of the absence of the PCR step, the method is quicker to perform than the standard Illumina library prep 7
, and we feel that it should be employed routinely in the preparation of libraries for Illumina sequencing.