|Home | About | Journals | Submit | Contact Us | Français|
The ability to accurately sequence long DNA molecules is important across biology, but existing sequencers are limited in read length and accuracy. Here, we demonstrate a method to leverage short-read sequencing to obtain long and accurate reads. Using droplet microfluidics, we isolate, amplify, fragment and barcode single DNA molecules in aqueous picolitre droplets, allowing the full-length molecules to be sequenced with multi-fold coverage using short-read sequencing. We show that this approach can provide accurate sequences of up to 10 kb, allowing us to identify rare mutations below the detection limit of conventional sequencing and directly link them into haplotypes. This barcoding methodology can be a powerful tool in sequencing heterogeneous populations such as viruses.
Next-generation sequencing (NGS) has tremendously impacted biomedical research due to its ability to acquire massive amounts of sequence data1,2. Currently, the most widely adopted sequencing platform produces billions of short (<250bp) reads at a low cost of ~$50 per billion bases. However, short NGS reads pose challenges for many applications. For instance, piecing together short reads into long contiguous sequences can be challenging when assembling new genomes, particularly when repetitive sequences are present3,4. When sequencing metagenomes comprising thousands of species, it is often impossible to assemble the short reads into longer sequences that allow discovery of useful information, such as identification of the species to which a sequence belongs, or detection of gene clusters encoding useful molecules or phenotypes5,6,7. Furthermore, NGS is error-prone, generating an error in every thousand bases; this is often above the rate of biological variation, and consequently, prevents detection of true variants within the cloud of sequencing error8,9. The ability to obtain massive amounts of long and accurate reads would thus be a major step forward in our ability to characterize genomes accurately, and to study the impact of sequence variation in a variety of systems, such as in rapidly evolving virus populations10, rare polymorphisms in human populations11, and diverse and uncultivable species in microbial communities12.
To obtain longer and more accurate reads, one approach is to directly improve the sequencing instrument13,14. In addition to providing accurate reads, the instrument must be widely available, easy to use and cost-competitive. Currently, no platform can match short-read NGS in these aspects and as such, short-read sequencers dominate the market. Rather than inventing a new sequencing instrument, an alternative is to synthetically reconstruct long reads from short-read data, leveraging the widespread popularity of short-read NGS. An elegant approach is using unique molecular barcodes, which were first used to detect duplicated NGS reads for error correction, and digital counting of molecules15,16. To reconstruct long reads using molecular barcodes, long template molecules are broken into short fragments and labelled with ‘barcode' sequences identifying the template from which they originate17,18,19,20. All short fragments can then be pooled and sequenced, and fragments of individual templates grouped by barcode. The reads in each group are then used to reconstruct synthetic long reads. Implementations of this approach rely on intramolecular reactions to attach barcodes to the fragments; however, this reaction becomes inefficient for templates above 3kb. Alternatively, molecules can be physically isolated into wells, followed by fragmentation and barcoding. This approach can theoretically be extended to molecules of any length, but is limited in the number of templates that can be sequenced due to the limitations in throughput of liquid handling in well plates. Throughput can be increased by barcoding multiple templates in each well, but then single-molecule identity is lost19,20. To enable long and accurate DNA sequencing, an optimal approach would combine physical isolation of molecules with ultrahigh-throughput fluid handling.
In this paper, we describe single-molecule droplet barcoding (SMDB), an ultrahigh-throughput method to barcode long molecules for short-read sequencing. Using droplet microfluidics, we isolate and barcode single molecules in aqueous droplets ~1 million times smaller than conventional well plates. To validate the method, we sequence a library of known DNA templates of 3–5kb long and reconstruct long reads fully covering the templates. Furthermore, to demonstrate the ability to sequence large DNA molecules, we apply the method to the E. coli genome, obtaining synthetic read-lengths up to 10kb in length. Finally, to illustrate the power of the method for detecting variants below the detection limit of conventional sequencing, we apply it to a library of β-glucosidase genes mutated by PCR. While SMDB detects 457 SNPs in 81 haplotypes in the library, conventional short-read sequencing detects only one SNP and cannot generate haplotypes. The ability to characterize variants and haplotypes below the inherent detection limit of the sequencer should be powerful for studying systems in which rare variants have an important role, such as in microbial community dynamics and viral quasispecies.
Droplet microfluidics has recently been used to barcode the transcriptomes of single cells21,22,23. In SMDB, we use it to barcode fragments of single DNA molecules, performing all steps of template amplification, fragmentation and barcoding in a microfluidic workflow (Fig. 1). DNA barcodes uniquely tag all reads derived from a template, which allows the reads to be unambiguously clustered to generate a long and accurate consensus sequence for the template.
We leverage ultrahigh-throughput droplet microfluidics to amplify, fragment and barcode large numbers of individual DNA templates. The first step is to isolate and amplify the template molecules, accomplished by introducing them into a microfluidic flow focus droplet generator that encapsulates them in ~50μm diameter droplets of PCR reagent (Fig. 2a). The template concentration is controlled so that ~1 in 10 droplets contains a single molecule, in accordance with Poisson statistics24. The droplets are collected into a PCR tube and thermal cycled for amplification, generating within each droplet a clonal population of the single molecules so that, once fragmented and barcoded, we can obtain multi-fold coverage of each template.
Following amplification, the templates must be fragmented to a length compatible with short-read sequencing. Importantly, fragmentation must be performed while maintaining compartmentalization, to prevent pieces of different templates from mixing before barcodes have been attached. To fragment in the droplets, we use a microfluidic device to add Tn5 transposase into each droplet, which randomly fragments and attaches short sequences to the amplified templates25 (Fig. 2b). Because transposases are single-turnover enzymes, an optimal stoichiometric ratio of transposase to templates must be maintained with a 10-fold dilution of the template droplet into the fragmentation droplet. To address this need, we develop a module combining droplet splitting and merging (Fig. 2b and Supplementary Fig. 1). The incoming droplets pass through a junction sampling ~1/10th of their volume, which is then merged with a new droplet approximately equal to the size of the original droplet. This device accomplishes the necessary tasks of diluting the starting droplet and adding the new reagent, while maintaining the droplet size constant throughout the process. After the transposase is added, the droplets are collected into a syringe and incubated in a water bath at 55°C for the transposase reaction.
After the templates have been fragmented, the barcodes used to tag fragments belonging to the same template are attached by overlap-extension PCR in the droplets (Fig. 2c). In this reaction, barcode sequences attach to the fragments through regions of sequence homology on the adaptor sequences added by the transposase. This step thus requires merging three droplets: template, barcode and PCR reagent. We design a triple merger device for merging three droplets at once. Improving on the designs of conventional mergers26, we concatenate multiple merging junctions, which act independently to achieve robust merging of all three droplets (Fig. 2c and Supplementary Fig. 2). The volumes and reagent concentrations of the droplets are controlled to ensure correct stoichiometry for PCR barcoding. In addition, the channels enable one of each type of droplet to combine in the electro-coalescence junction, shown to the right in Fig. 2c. The resultant droplets are 90μm spherical diameter and can coalesce during thermal cycling (see Supplementary Note 1 for details on coalescing droplets). To make them more robust, we split the merged droplets into four portions using a splitter27. The split droplets are collected into PCR tubes and thermally cycled to attach the barcodes. Even with the small size, ~10–50% of droplets coalesce (Supplementary Fig. 3a), which is undesirable since it can lead to multiple templates or barcodes in a single droplet, and hence improper barcoding. We therefore remove these droplets using a combination of gravity-induced and pinched-flow fractionation28 (Supplementary Fig. 3b and Supplementary Methods). The remaining droplets are chemically ruptured and the DNA contents are purified over a spin column, then size selected to remove free barcodes, resulting in a sequence-ready library.
Uniquely barcoding millions of DNA templates requires tens of millions of ‘barcode droplets', each containing a clonal population of one barcode sequence. To generate these barcode droplets, we individually encapsulate and amplify random barcode molecules using the same technique shown in Fig. 2a (also see Supplementary Fig. 4a). Barcode molecules consisting of random N-mers flanked by constant sequences are chemically synthesized and encapsulated with PCR reagents for amplification. The molecules are loaded at a limiting dilution of ~1 in 10 droplets. The droplets are thermally cycled, generating within each loaded droplet a clonal population of amplified product; these droplets can then be merged with the template droplets for the barcoding step shown in Fig. 2c. Using this approach, we generate ~10 million barcode droplets in <1h for ~$10 of PCR reagent, which is sufficient to barcode ~1 million templates in the SMDB workflow.
Because barcode sequences are random, it is possible for two barcodes of the same sequence to label different templates. In in silico simulations, we find that the likelihood of this undesirable event is extremely low for barcodes of sufficient length (Supplementary Fig. 4b). During PCR amplification and sequencing of the barcodes, errors and mutations generate a cloud of related sequences around the original barcode sequence. By sequencing our barcode library, we find that the original barcode sequences are on average three Hamming distances from their nearest neighbour, while the sequences within the ‘cloud' of mutated barcodes around each original barcode are, on average, only 1 Hamming distance from their nearest neighbour (Supplementary Fig. 4c). However, the mutated barcodes typically comprise <5% of all reads and do not represent a significant source of inefficiency. To address this issue, we develop an algorithm to cluster mutated barcodes and their parent sequences into a single ‘barcode cluster' (Supplementary Note 2). These barcode clusters represent all fragments that originate from the same template, and thus, are used for template analysis, SNP identification and reassembly.
A key property of SMDB is its ability to barcode single molecules, which greatly simplifies bioinformatic analysis since all reads in a given cluster are known to originate from only one template. To validate that SMDB indeed barcodes single molecules, we apply it to a library of eight templates from 3 to 5kb long (for details on known template library, see Supplementary Methods). Because only one-tenth of barcode droplets contain barcodes, we expect only one-tenth of encapsulated templates to be barcoded. Starting with ~1M template droplets encapsulated at one in ten droplets containing templates, we expect a theoretical yield of ~10,000 barcoded templates. Practically, the yield of sequenced templates would be lower due to the sample losses incurred during the start-up of microfluidic devices and during the removal of coalesced droplets. Sequencing the library, we obtain ~10 million reads using a MiSeq 2 × 250 run, yielding 3,563 clusters, which represents ~35% of theoretical yield. For perfect barcoding of single molecules, all reads in all clusters should map to only one template. Aligning reads from each cluster to the eight reference sequences, we calculate for each barcode cluster the fraction of reads mapping to the dominant template, defined as the single (out of eight possible) template to which the majority of reads in a cluster map (Fig. 3a). We find that >90% of clusters contain >90% reads mapping to the dominant template. Nevertheless, we observe a low background of <2% of reads mapping to the non-dominant template in less than half of the barcode clusters, which we attribute to mis-tagging, a phenomenon often observed in barcoded sequence libraries prepared in well plates, and thought to originate from chimeric PCR products generated during library amplification and sequencing29. Since many barcode clusters contain some degree of non-dominant template reads, we define clusters containing >90% dominant template as single-template clusters. The overwhelming majority (~90%) of clusters are single-template clusters (Fig. 3a, inset). Instances of multiple templates in the same barcode cluster are infrequent, and consistent with the rate of co-encapsulation expected by Poisson statistics (see Supplementary Note 3 for details). Multiple-encapsulations can be reduced by lowering template concentration, which reduces the instances of multiple templates in the same barcode clusters at the expense of barcoding throughput.
The ideal sequencing data provides full-length, high-accuracy coverage of all templates in the sample. However, bias in sequencing can yield excessive coverage in certain regions and insufficient coverage in others. To investigate whether our approach is susceptible to such bias, we plot the coverage distribution for each template (Fig. 3b and Supplementary Fig. 5). We observe systematic coverage bias for all templates, much of which correlates with local GC content, and hence, is likely the result of the PCR amplification of the libraries for sequencing30. We also observe decreased coverage at the ends of templates, a known bias of transposase fragmentation25. Thus, the primary forms of bias in our data are the same as those observed in standard NGS, and result from the same sources.
To quantify how bias affects coverage, we define the coverage entropy as the informational entropy of the coverage distribution for each barcode cluster (see Supplementary Note 4 for discussion on coverage entropy). Clusters with high-coverage entropy exhibit flat distributions with uniform coverage, while the clusters with low-coverage entropy exhibit ‘peaky' distributions with non-uniform coverage. Consequently, coverage entropy is a good predictor of whether a cluster contains sufficient information to reassemble a template, and is thus an overall good metric for coverage uniformity (Supplementary Fig. 6a). Plotting the coverage entropy of each barcode cluster against the number of reads contained within it, we observe two populations, one in which entropy saturates rapidly with coverage (upper left) and another in which entropy rises more slowly (Fig. 3c). The clusters where entropy rises slowly with number of reads are more biased, and therefore require more sequencing to obtain the requisite information for assembly. On the basis of our results, an entropy >7 is required for successful assembly (Supplementary Fig. 6a). This corresponds >100 reads in the barcode cluster (Fig. 3c). Therefore, one measure for the efficient utilization of sequencing reads is the number of barcode clusters with >100 reads obtained for a fixed amount of total sequencing reads used (Supplementary Fig. 7). While more sequencing produces more viable barcode clusters, exhaustively sequencing the library results in inefficient utilization of reads.
An important application of NGS is to detect rare single-nucleotide polymorphisms (SNPs) in heterogeneous populations, such as viruses, cells or human beings8,10,31,32. Characterizing that SNPs are physically linked on the same template, called haplotyping, is important for understanding how multiple variants at distant loci can contribute to a given phenotype. However, performing these tasks with conventional NGS is often extremely challenging or impossible due to the inability of the short reads to span multiple SNPs. Moreover, standard NGS is error-prone, generating one error in every ~1,000 bases; this prevents confident detection of rare variants without accepting a large proportion of false-positives8,9,33. To enhance sensitivity, known patterns of error production can be modelled and used to correct data, providing modest improvements8. Molecular techniques can greatly increase sensitivity to detect rare SNPs but reduce read length even further34.
SMDB is able to confidently detect rare SNPs because each molecule is sequenced to great depth, allowing reads to be ‘averaged together' to obtain an accurate consensus for every base. To demonstrate this, we generate a population of DNA templates via 35 cycle PCR of a bacterial plasmid extracted from a culture grown from a single colony. In this population, every sequence shares significant homology, but rare variants exist. Variants like these can have important biological consequences, such as allowing HIV to evolve drug resistance or the development of rare alleles that increase risk for disease in human populations11,33. We sequence the population using SMDB on a MiSeq 2 × 150 run, obtaining 4.6 million reads in ~6,000 barcode clusters. Because each barcode cluster represents fragments amplified from a single molecule, we expect a fraction of the fragments—and therefore reads—to contain amplification errors. In the worst case scenario where an error is made in the first round of amplification, we expect ~50% of the reads to be erroneous for any one position in the sequence. Since these cases are reported as di-allelic SNPs by the SNP-caller, we keep only the mono-allelic SNP calls to ensure the highest accuracy of our mutation calls. We identify 457 high-confidence SNPs in ~10% of templates, whereas ~90% of the templates contain no SNPs compared to the reference (Fig. 4a and Supplementary Fig. 8). With the exception of SNP C1067G existing in ~5.5% of templates, all others are present in <0.1% of the templates, far below the limit of detection for standard NGS. To compare our results to standard SNP calling methods, which do not use barcode information, we call SNPs while disregarding the barcode grouping of reads and detect only the C1067G variant. Hence, SMBD amplifies the sensitivity of sequencing and allows capture of biological information invisible to standard methods. Unlike conventional NGS, the limit of detection of SMDB scales with the number of molecules sequenced and can be easily orders of magnitude more sensitive than conventional NGS (Fig. 4b).
In addition to detecting rare SNPs, SMDB naturally generates haplotypes, which are important for characterizing mutations that have synergistic effects and are broadly relevant from virus evolution to human genetics35,36. SMDB provides haplotyping information because SNPs that occur on the same template are grouped into the same barcode cluster, allowing haplotypes to be confidently identified for each template. To demonstrate SMDB haplotyping, we plot the haplotypes determined by SMDB in a phylogenetic tree, allowing us to determine the order of mutations that occurred during replication (Fig. 4c). The mutations in the population are generated by replication, and thus, in the absence of selection, ones that occur early in replication exist in a large subset of the progeny. The phylogenetic tree shows that C1067G was the first mutation that arose in the population, consistent with the fact that C1067G mutation is the most abundant SNP.
De novo assembly, the process of piecing together short reads into long ‘contigs', is necessary to extract useful information from short reads when a reference sequence is not available, such as when sequencing new genomes or metagenomes37,38. Despite years of improvement, de novo assemblers continue to struggle with datasets comprising multiple sets of highly homologous sequences18,37,38. In some cases, de novo assembly is practically impossible because the information needed to uniquely generate a contig spans a length beyond the accessible read length of short-read sequencing. SMDB simplifies de novo assembly by ensuring that all reads in a cluster originate from one template, allowing unambiguous assembly of a contig that was previously impossible when all reads from all templates must be considered concurrently.
To demonstrate de novo assembly with SMDB, we sequence a test library of known templates 3–5kb long with a MiSeq 2 × 250, obtaining ~9 million reads clustering into 2,043 groups. We perform de novo assembly on each barcode cluster independently, yielding 245 contigs >2kb long. The contigs span a range of lengths, and a significant portion of the assembled contigs cover the full length of the templates (Fig. 5a). To account for low-read coverage at the ends of the templates due to biased transposase insertion, we trim the first and last 250bp of the contigs. The resultant sequences are accurate when compared to the known reference sequences, having an overall error rate of 4.3 × 10−4 per base and no detectable structural variations or chimeras. If the errors in the contigs are artifacts of assembly or sequencing, we expect them to be negatively correlated with the coverage entropy of the barcode groups used to assemble them. However, we find contig accuracy is independent of coverage entropy, and rather, depends slightly on position in the contig (Fig. 5b and inset). This is reminiscent of the pattern of SNPs seen in the previous experiment (Fig. 4a), indicating that these are likely rare SNPs rather than errors in the assembled contigs.
Theoretically, any DNA template can be barcoded by SMDB if it can be encapsulated and amplified. However, PCR amplification becomes inefficient for templates longer than 5kb. To sequence molecules longer than this, we implement multiple displacement amplification (MDA), a non-specific, isothermal method that can amplify whole genomes39. We generate fragments of the E. coli genome 7–10kb in length and sequence the resulting library on a MiSeq 2 × 300 run from which we obtain ~13 million reads clustering into ~1,000 groups after quality filtering. As expected, de novo assembly with barcodes yields significantly longer and more accurate contigs than assembly without barcodes (Fig. 5c and inset). Interestingly, ~26% of these contigs do not map to the E. coli genome, but to other bacterial genomes in the NCBI refseq database, and thus represent contaminating DNA in the library rather than sequencing errors (Supplementary Fig. 6b,c). Thus, SMDB enables sequencing of long templates with arbitrary sequence, but care must be taken to limit contamination.
A challenge when performing molecular biology reactions in droplets is that, often, multiple reagents must be added to the droplets at different times. Since reagent addition always increases the size of the droplets, adding multiple reagents can produce final droplets that are too large to be robustly handled. To perform reagent addition while maintaining droplets at a reasonable size, we have developed a split–merge device that combines droplet splitting with droplet merger26,40. This device has the unique and valuable property of producing final droplets that are equal in size to the initial droplets; hence, this same device can be used to perform multiple additions on an emulsion while maintaining constant droplet size. The degree of dilution can be adjusted by varying the amount sampled from the split droplet, which is adjusted by controlling the flow rate of the splitting outlet. This obviates the need to construct a unique device with increasing dimensions for each round of reagent addition, and maintains the droplets in the size range that is optimized for handling and incubation. The split–merge device should be valuable when multiple reagent additions must be performed on an emulsion—a task that has thus far been a significant challenge for droplet microfluidic workflows.
The random Poisson encapsulation of templates and barcodes is a source of inefficiency in SMDB, but one that is overcome by leveraging the ultrahigh-throughput nature of droplet microfluidics. To ensure that most templates are paired with a single barcode, barcodes and templates are loaded at ~1 in 10 per droplet, yielding a single pairing event for ~1 in 100 droplets. Even with this inefficiency, the throughput of our device enables barcoding of ~3,500 molecules in ~15min. Assuming a modest template length of 5kb, this is sufficient to cover an E. coli genome at ~5 × coverage. With higher-throughput droplet generation and manipulation, such as emulsification under jetting conditions41 and parallelization of channel networks42,43, it should be possible to increase throughput by an order of magnitude. In addition, the template and barcode emulsions can be sorted to discard empty droplets, which should increase efficiency ~10-fold by ensuring that every pairing event comprises one of each component with no wasted droplets.
Encapsulation of templates into small volumes reduces amplification bias during PCR but also limits the amount of DNA generated for each barcoded template. Therefore, the number of starting templates is directly correlated with the amount of DNA obtained at the end of the workflow. We have empirically determined that >10,000 productive droplets are required to provide the minimum ~20 nanomoles for sequencing after accounting for sample loss through the workflow. Although it is possible to additionally PCR amplify lower yield libraries, this results in more bias, yielding uneven coverage of templates, and uneven distribution of reads into barcode groups.
Droplet microfluidic workflows have been successfully adapted into non-microfluidic labs through collaboration with labs with microfluidic expertise21,22. For labs interested in adopting SMDB, we suggest collaborating with a droplet microfluidics lab, because although the fabrication and operation of the microfluidic devices is straight forward, the handling of droplets outside of devices is quite nuanced. Dolomite, a company dedicated to providing off-the-shelf and custom designed droplet microfluidic devices for research, is also an excellent resource for implementing droplet microfluidics workflows into the lab.
New technologies for sequencing DNA while retaining long-range information are becoming available20,44. While these technologies share some similarity to ours, there are critical differences that make each approach better or worse for different applications. For example, recent methods that encapsulate many template molecules in each droplet provide very high throughput and are an inexpensive solution for barcoding large amounts of DNA, but the resulting sequence data cannot be deconvoluted back to single molecules since within each barcode cluster (droplet) many templates of different sequences exist. This may be acceptable for applications in which the templates are highly dissimilar or in which single-molecule resolution is not required, but in others it may prove problematic. In particular, for samples in which the molecules share significant homology but small sequence differences are biologically relevant, such as when studying viral diversity and evolution, these technologies are ineffective and the SMDB approach is better suited. A similar technology specifically targeted to sequence human genomes is available and therefore applications of SMDB to human genome sequencing are not investigated45.
We have applied SMDB to the barcoding of single DNA molecules from virus and microbial genomes, but the principle of encapsulating and barcoding nucleic acids in microfluidic droplets is broadly applicable. For example, droplet microfluidics has been used to encapsulate, lyse, and amplify single viruses and cells46,47. The SMDB workflow we describe here could be combined with these methods to barcode the genomes of these organisms, to perform whole-genome single virus or cell sequencing. This could make the barcoding workflow valuable for characterizing genetic reassortment in seasonal influenza. Indeed, while barcoding up to ~10,000 single entities is immediately practical with the methods we describe, if single cells rather than long templates were to be barcoded, the number of individual genomes that can be sequenced is limited by the sequencing throughput of NGS. Even with the massive capacity available with present-day instruments, it is not enough to fully leverage the throughput of our droplet method. However, as sequencing instruments continue to decrease in cost and increase in throughput, sequencing large barcoded populations of cells and viruses should become practical, impacting applications in which genetic diversity is important, such as in microbial communities.
Photoresist masters are created by spinning on a layer of photoresist SU-8 3025 (Microchem) onto a 3 inch silicon wafer (University Wafer) at 3,000rpm, then baking at 95°C for 5min. Then, the photoresist is subjected to 3min ultravoilet exposure over photolithography masks (CAD/Art Services) printed at 12,000 DPI. After ultravoilet exposure, the wafers are baked at 95°C for 10min then developed for 10min in fresh propylene glycol monomethyl ether acetate (Sigma Aldrich) then rinsed with fresh propylene glycol monomethyl ether acetate and baked at 95°C for 5min to remove solvent. To fabricate the triple merger device, a second layer of photoresist was patterned on top of the first layer after the first ultravoilet exposure to generate a two-layered master. The microfluidic devices are fabricated by curing poly(dimethylsiloxane) (10.5:1 polymer-to-crosslinker ratio) over the photoresist master48. The devices are cured in an 80°C oven for 1h, extracted with a scalpel, and inlet ports added using a 0.75mm biopsy core (World Precision Instruments, catalogue no. 504529). The device is bonded to a glass slide using O2 plasma treatment and channels are treated with Aquapel (PPG Industries) to render them hydrophobic. Finally, the devices are baked at 80°C for 10min to dry the Aquapel before they are ready for use.
Chemically synthesized barcode oligonucleotides (GCAGCTGGCGTAATAGCGAGTACAATCTGCTCTGATGCCGCATAGNNNNNNNNNNNNNNNTAAGCCAGCCCCGACACT) (IDT) are added at 0.01pM concentration into a PCR reaction mix containing 1 × NEB Hotstart Phusion polymerase (NEB, catalogue no. M0536L), 2% w/v Tween 20, 2% w/v PEG 6000, 400nM forward and reverse primers (FL128 CTGTCTCTTATACACATCTCCGAGCCCACGAGACGTGTCGGGGCTGGCTTA) (FL129 CAAGCAGAAGACGGCATACGAGATCAGCTGGCGTAATAGCG). The reaction mixture and HFE 7500 fluorinated oil (3M) with 2% (w/w) PEG-PFPE amphiphilic block copolymer surfactant (Ran Biotechnologies) are loaded into separate 1ml syringes and injected at 300 and 500μlh−1, respectively, into a flow-focusing droplet maker using syringe pumps (New Era, catalogue no. NE-501) controlled with a custom Python script (https://github.com/AbateLab/Pump-Control-Program). After collecting the emulsion in PCR tubes, the oil underneath the emulsion is removed using a pipette and replaced with FC-40 fluorinated oil (Sigma Aldrich, catalogue no. 51142-49-5) with 5% (w/w) PEG-PFPE amphiphilic block copolymer surfactant for improved thermal stability (see Supplementary Note 1 for details on thermostability). The emulsion is transferred to a T100 thermocycler (BioRad) and thermally cycled with the following program: 98°C for 3min, followed by 40 cycles with 2°C per second ramp rates of 98°C for 10s, 62°C for 20s and 72°C for 20s, followed by a final hold at 12°C. SYBR staining using 10 × SYBR GREEN I in HFE 7500 oil is used to quantify encapsulation rate under a fluorescent microscope.
For SMDB using PCR, DNA template molecules are encapsulated and amplified in the same manner as described above, except the primers used are FL178 (CCACTACGCCTCCGCTTTC) and FL179 (CCATCTCATCCCTGCGTGT), and input DNA is a library of long molecules with universal adaptors on either side. Input DNA concentration is adjusted until one in ten droplets are fluorescent under SYBR staining. To construct the library of seven known templates, eight DNA templates are amplified from 5ng of phage lambda genomic DNA (NEB: N3013S) using 500nM of primer sets (see oligonucleotides listed in Supplementary Table 1) using 1 × NEB phusion hotstart flex mix (NEB: M0536S) with the following cycling conditions: 98°C 3min, 35 cycles of: 98°C 15s, 62°C 30s, 72°C 3min, followed by 72°C 5min and optional holding at 12°C overnight. The PCR products are gel-extracted using 1% agarose gel and Zymo gel extraction kit. To attach constant sequence adaptors to all the fragments, 100ng of gel-extracted amplicons are added to an adaptor ligation mix of: 1μM adaptors, 0.20mM dNTPs, 0.5μl (60 units) of Bst 2.0 polymerase warmstart (NEB: M0538M), 2.5μl T4 DNA ligase from the quick ligation kit (NEB: M2200S), 1 × ligase buffer from the quick ligase kit. The reaction is incubated at 25°C for 15min then 65°C for 10min for heat inactivation, then DNA is purified using the Zymo DNA concentrator kit. The concentration of resulting DNA is quantified using the bioanalyzer high sensitivity kit and pooled together at equal molar concentration to generate the eight templates library.
For SMDB using MDA, reactions are performed using REPLI-g single cell kit (Qiagen, catalogue no. 150343). E. coli genomic fragments are from E. coli K12(DH10B) cells purchased from New England BioLabs (catalogue no. C3019H), lysed and purified using PureLink Genomic DNA Mini Kit (Life Technologies, catalogue no. K1820-00). Ten kilobase fragments are gel-extracted following a 10-min digestion with NEBNext dsDNA Fragmentase (NEB, catalogue no. M0348S) of 800ng DNA and quantified using a NanoDrop (Thermo Scientific). The fragmented input DNA is incubated with 3μl Buffer D2 and 3μl H2O for 10min at 65°C. After stopping by adding 3μl stop solution, a master mix comprising nuclease-free H2O, REPLI-g reaction buffer, and REPLI-g DNA polymerase is added. The MDA reactions are then emulsified in the manner described above and incubated at 30°C for 3h then 70°C for 20min for heat inactivation.
Droplets containing amplified templates, a Nextera Transposase reaction mixture composed of 1 × TD buffer, 2% w/v Tween 20, 2% w/v PEG 6000, and 1/10 volume of TDE from Nextera Kit (Illumina, catalogue no. FC-121-1031, or purified in lab as described49), deionized water, HFE 7500 with 2% w/v EA surfactant and 2M NaCl are loaded into 1ml syringes (BD scientific) and connected to the split-merge microfluidic device (Supplementary Fig. 1). The electrode is connected by clipping the output of a cold cathode fluorescent inverter connected to a DC power supply (Mastech) to the needle of the electrode syringe using an alligator clip. Setting a voltage of 2.0V at the power supply results in a ~2kV AC at the electrode, which causes droplets close to the electrode to merge. The resulting emulsion is collected in a 1ml syringe and incubated at 55°C for 10min and then 70°C for 20min in large water baths.
Fragmented template droplets, barcode droplets and a PCR mixture composed of 1 × Invitrogen Platinum Multiplex mix (ThermoFisher, catalogue no. 4464268), 400nM Primers FL127 (AATGATACGGCGACCACCGAGATCTACACTCGTCGGCAGCGTC) and FL129 (CAAGCAGAAGACGGCATACGAGATCAGCTGGCGTAATAGCG), 1 in 50 dilution of the NT buffer from the Nextera XT Kit (0.2% SDS) (Illumina, catalogue no. FC-131-1024), 1% Tween 20w/v, 1% PEG 6000, w/v, 2.5Uμl−1 Bst Polymerase 2.0 Warmstart (NEB catalogue no. M0538S) are loaded into a syringe and injected into the double merger device as shown in Supplementary Fig. 2. The emulsion is collected in a 0.5ml thin-walled PCR tube, and the oil is replaced with FC-40 with 5% w/v EA surfactant before thermal cycling at: 65°C for 5mins, 95°C for 2mins, then 25 cycles at 2°C/s ramp rates of 95°C for 15s, 60°C for 1min, 72°C for 1min, and then 72°C for 5min followed by optional 12°C hold overnight. After thermal cycling, the oil is replaced with HFE 7500 with 2% w/v EA surfactant, then loaded into a syringe injected into a pinched-flow fractionation device to remove large droplets as shown in Supplementary Fig. 3b. After removal of large droplets, the emulsion is broken by adding 20μl of 1H,1H,2H,2H-Perfluoro-1-octanol (Sigma Aldrich, catalogue no. 370533) and brief centrifugation in a micro-centrifuge. The aqueous top phase is collected and DNA is purified using a Zymo DNA concentrator kit.
Overall, 2ng of the barcoded library is added to a PCR mixture containing 1 × Phusion master mix and 400nM of primer FL127/129 and thermal cycled as follows: 98°C 3min, and 10 cycles of 98°C 10s, 62°C 20s, 72°C 1min, 72°C 5min. The resulting DNA is loaded into a Blue Pippin (Sage Biosciences) 100–600bp cassette to extract DNA from 300–700bp range to remove free untagged barcodes. The resultant DNA is concentrated using a Zymo DNA concentrator kit, quantified using the Bioanalyzer high sensitivity DNA chip (Agilent) and sequenced on the MiSeq using a custom index primer (FL166).
Barcodes are clustered using a python program dfs clustering (Supplementary Note 2), which uses raw Miseq fastq files and outputs barcode clusters and the IDs of their associated reads along with quality control metrics. Barcode clusters containing <500 reads are removed because they contain too few reads for analysis. For SNP calling, reads from each barcode cluster are mapped onto the template sequence using Bowtie 2 (ref. 50)—very-sensitive-local settings and outputted as a SAM file then converted into a BAM file using Samtools. To call SNPs, Samtools v1.2 mpileup is used with options-d 9999-u-V-I and Bcftools call is used with options -c -v to filter only positions that contain SNPs. The SNPs are then filtered for homozygous calls as there should be only one template per barcode cluster. The phylogenetic tree was constructed using consensus sequences generated by the replacing each position of the reference with the SNP called for each barcode cluster. Duplicate sequences are removed, and then the list of non-redundant consensus sequences are used to generate a phylogenetic tree using the maximum likelihood method in Phylip v3.696. For de novo assembly, reads from each barcode cluster are written into one file in fasta format with bases lower than Q20 replaced with N. Each fasta file is fed into the IDBA-UD assembler v1.1.1 (ref. 51) with parameters—mink 20—maxk 120—step 20—min_contig 2000—min_count 1—max_mismatch 3.
Sequencing data generated from SMDB of the seven templates control and E. coli genomic fragments libraries are available at the Sequence Read Archive (SRA) under accession code SRP072529.
Accession codes: Sequencing data generated from SMDB of the 7 templates control and E. coli genomic fragments libraries are available at the Sequence Read Archive (SRA) under accession code SRP072529.
How to cite this article: Lan, F. et al. Droplet barcoding for massively parallel single-molecule deep sequencing. Nat. Commun. 7:11784 doi: 10.1038/ncomms11784 (2016).
Supplementary Figures 1-8, Supplementary Table 1, Supplementary Notes 1-4 and Supplementary Methods.
We thank R. Hernandez and R. Andino for helpful scientific discussions. We thank Eric Chow and the Center for Advanced Technologies at UCSF for technical expertise with sequencing. We thank B. Demaree, S. Poust and C.Q. Lan for helpful comments on the manuscript. This work was supported by the National Science Foundation through a CAREER Award (Grant Number DBI-1253293); the National Institutes of Health (NIH) (Grant Numbers HG007233-01, R01-EB019453-01 and DP2-AR068129-01); and the Defense Advanced Research Projects Agency Living Foundries Program (Contract Numbers HR0011-12-C-0065, N66001-12-C-4211 and HR0011-12-C-0066). Funding for open access charge: (NIH grant number DP2-AR068129-01).
Author contributions F.L. and A.R.A. proposed the concept and prepared the manuscript. J.H. contributed to the conceptualization and design of the droplet barcodes. F.L. performed the experiments and analysis of data. A.Y. designed and implemented the barcode clustering algorithm.