A key feature of these new platforms is their speed. Decreasing run time has clear advantages particularly within the clinical sequencing arena, but poses challenges in itself. Whilst manufacturers may state library prep times on the order of a couple of hours, these times don’t include upfront QC and library QC and quantification. Also, typical library prep times quoted usually apply to processing of only one sample; i.e., pipetting time is largely ignored. Purchasers of sequencing instruments will want to keep them running at full utilization, in order to maximize their investment and will also want to pool multiple samples on single runs for economic reasons. To obtain maximum throughput, users must consider the whole process, potentially investing in ancillary equipment and robotics to create an automated pipeline for the preparation of large numbers of samples. To process large numbers of samples quickly, a facility’s instrument base must be large enough to avoid sample backlogs. With this in mind, manufacturers are seeking to develop more streamlined sample-prep protocols that will facilitate timely sample loading. Here we have tested two such developments: enzymatic fragmentation and the Nextera technique. We conclude that these methods can be very useful, but users must carefully evaluate the methods they use for their particular applications and for use with genomes of extreme base composition to avoid bias.
Whilst the data generated using the Ion Torrent PGM platform has a higher raw error rate (~1.8%) than Illumina data (<0.4%), provided there is sufficient coverage, the representation and ability to call SNPs is quite closely matched between these technologies with more true positives being called from PGM data but far less false positives from the Illumina data. Detection of SNPs using PacBio data was not as accurate; the use of single-molecule sequencing to detect low level variants and quasispecies within populations remains unproven. We have found PacBio’s long reads useful for scaffolding de novo assemblies, though our experience suggests that this is currently not fully optimized and extensive method development is still required.
Interestingly, the mappability didn’t increase significantly with longer reads, although a beneficial effect was obtained from having mate-pair information. Current PacBio protocols favor the preferential loading of smaller constructs, resulting in average subread lengths that are significantly shorter than the often quoted average read lengths. Further development is therefore required to avoid having excess short fragments and adapter-dimer constructs in the library and reducing their loading efficiency into the ZMWs.
Whilst one would normally use higher coverage than used here for confident SNP detection (i.e., 30-40x depth), we were limited to 15x depth due to the yield of some of the platforms. Nonetheless, at least for the haploid genome, S
, 15x coverage should be a reasonable quantity for SNP detection and even in the human genome, 15x coverage has been shown to be sufficient to accurately call heterozygous SNPs [3
Variant calling is a highly subjective process; the particular software chosen as well as the specific parameters employed to make the predictions will change the results substantially. As such, the rate of both true SNP and false positive calling provided here are purely indicative and results obtained with each sequencing platform will vary. For any particular application using a specific sequencing method, optimisation of the SNP- and indel-calling algorithm would always be recommended.
We sequence many isolates of the malaria parasite P. falciparum
as it represents a significant health issue in developing countries; this organism leads to several million deaths per annum. There are several active large sequencing programs (e.g. MalariaGEN [13
]) that are currently aiming to sequence thousands of clinical malaria samples. As the malaria genome has a GC content of only 19.4% [14
], we use it as one of our test genomes, representing a significant challenge to most sequencing technologies. Based on the present study, use of Illumina sequencing technology with libraries prepared without amplification [4
] leads to the least biased coverage across this genome. Ion Torrent semiconductor sequencing is not recommended for sequencing of extremely AT-rich genomes, due to the severe coverage bias observed. This is likely to be an artifact introduced during amplification. Therefore, avoidance of library amplification and/or emPCR, or use of more faithful enzymes during emPCR, may eliminate the bias.
Illumina sequencing has matured to the point where a great many applications [15
] have been developed on the platform. Since the PGM is also a massively parallel short-read technology, many of those applications should translate well and be equally practicable. There are a few obvious exceptions; techniques involving manipulations on the flow cell such as FRT-seq [21
] and OS-Seq [22
] will be impossible using semiconductor sequencing. Also, the Ion Torrent platform currently employs fragment lengths of 100 or 200 bases; without a mate-pair type library protocol, these insert sizes are too short perhaps to enable accurate de novo
assemblies such as that demonstrated using ALLPATHS-LG for mammalian genomes using Illumina data [25
]. Conversely, Illumina sequencing on the HiSeq or MiSeq instruments requires heterogeneous base composition across the population of imaged clusters [26
]. In order to sequence monotemplates (where most sequenceable fragments have exactly the same sequence), it is often necessary to significantly dilute or mix the sample with a complex genomic library to enable registration of clusters. Semiconductor sequencing does not suffer this problem.
The DNA-input requirements of PacBio can be prohibitory. Illumina and PGM library preparation can be performed with far less DNA; the standard PGM IonEXpress library prep requires just 100
ng DNA and Illumina sequencing has been performed from sub-nanogram quantities [27
]. The yield, sample-input requirements and amplification-free library prep of PacBio potentially make it unsuitable for counting applications and for applications involving significant prior enrichment such as exome sequencing [15
] and ChIP-seq [18
]. The PacBio platform, by virtue of its long read lengths, should however have application in de novo sequencing and may also benefit analysis of linkage of alternative splicing and in of variants across long amplicons. Furthermore, the potential for direct detection of epigenetic modifications has been demonstrated [28
Finally, it should be noted that thus study represents a point in time, utilising kits and reagents available up until the end of 2011. Ion Torrent and Pacific Biosciences are relatively new sequencing technologies that have not had time to mature in the same way that the Illumina technology has. We anticipate that whilst some of the issues identified may be intrinsic, others will be resolved as these platforms evolve.