|Home | About | Journals | Submit | Contact Us | Français|
In vitro selection has been an essential tool in the development of recombinant antibodies against various antigen targets. Deep sequencing has recently been gaining ground as an alternative and valuable method to analyze such antibody selections. The analysis provides a novel and extremely detailed view of selected antibody populations, and allows the identification of specific antibodies using only sequencing data, potentially eliminating the need for expensive and laborious low-throughput screening methods such as enzyme-linked immunosorbant assay. The high cost and the need for bioinformatics experts and powerful computer clusters, however, have limited the general use of deep sequencing in antibody selections. Here, we describe the AbMining ToolBox, an open source software package for the straightforward analysis of antibody libraries sequenced by the three main next generation sequencing platforms (454, Ion Torrent, MiSeq). The ToolBox is able to identify heavy chain CDR3s as effectively as more computationally intense software, and can be easily adapted to analyze other portions of antibody variable genes, as well as the selection outputs of libraries based on different scaffolds. The software runs on all common operating systems (Microsoft Windows, Mac OS X, Linux), on standard personal computers, and sequence analysis of 1–2 million reads can be accomplished in 10–15 min, a fraction of the time of competing software. Use of the ToolBox will allow the average researcher to incorporate deep sequence analysis into routine selections from antibody display libraries.
The selection of antibodies using in vitro methods, including phage,1 yeast2 and ribosome3 display has transformed the generation of therapeutic antibodies,4 and promises to do the same for research-quality antibodies.5,6 In particular, the ability to improve affinity,7,8 and select antibodies lacking cross-reactivity to closely related proteins5,6 can be performed relatively easily using in vitro methods, but requires extensive screening when traditional methods are used to generate monoclonal antibodies.
Until recently, the analysis of such antibody display libraries has been performed in a relatively blind fashion, with a moderately small number (96–384) of randomly picked clones being analyzed by enzyme-linked immunosorbant assay after the selection is complete, to identify binders for the target of interest. In phage and ribosome display, this is the only point at which concrete information on antibody activity can be obtained during a selection, and is the last step of the selection.
Antibodies are best characterized by full sequencing of the VH and VL domains. In the single chain fragment variable (scFv) format, this requires reads of at least 800 base pair (bp), which is only obtainable with high quality Sanger sequencing.9 The complementarity-determining regions (CDRs) of an antibody are the hypervariable loops responsible for binding to antigen, of which the heavy chain CDR3 (HCDR3) is the most diverse, and widely used as a surrogate for VH and scFv identity.10-12 HCDR3s are generated by the random combination of germline V, D and J genes,13,14 with additional junctional diversity created by nucleotide addition or loss (for a review see ref. 15–17), and subsequent targeted somatic hypermutation.18,19 As opposed to full-length scFv, the identification of specific HCDR3s requires far shorter reads, and provides a minimum assessment of diversity, in that VH domains with the same HCDR3 may contain additional differences elsewhere in the VH, or they may be paired with different light chains. In general, it is the HCDR3 that provides antibodies with their primary specificity.11,20
Deep sequencing21-23 refers to sequencing methods producing orders of magnitude more reads than traditional Sanger sequencing. Until recently, these technologies were dominated by systems that were expensive to purchase and operate, and required extensive preparation time before results could be obtained. They have been widely applied to the sequencing and analysis of genomes, and more recently to the investigation of diverse library selections,24-29 including the analysis of both in vitro antibody libraries24,26 and in vivo antibody repertoires,12,25,30-32 where HCDR3 is usually used as an antibody identifier. The results obtained from the analysis of library selections indicate that when only 96 or 384 clones are screened, many abundant, and potentially valuable clones, are lost,24,27 a result confirmed with peptide libraries,28,33 whereas if deep sequencing is applied to selection outputs, the most abundant clones can be unambiguously identified and isolated using specific primers. This also allows access to a far greater diversity of positive clones than the number obtained by random screening.34
To enable the use of deep sequencing methods more broadly in selections, the cost of sequencing and the downstream processes need to be streamlined. “Bench-top sequencers” (for review see ref. 35), are laser-printer sized, inexpensive to purchase and run and provide results in a matter of hours, rather than days, making them of great potential utility in this field. Sequence analysis is also challenging and generally performed by experts using specialized computer clusters. In this paper, we compare three different sequencing platforms (454, MiSeq and Ion Torrent PGM) and describe their straightforward implementation to both the analysis of a well-characterized naïve antibody library36 and selections from it. We provide the necessary HCDR3 primer sequences and easy-to-use open source informatics tools to make deep sequencing routinely available for antibody selection analysis (http://sourceforge.net/projects/abmining/).
The identification of HCDR3s is inherently difficult because of their extreme diversity: authentic HCDR3s may have features that render them atypical, even when functional. VDJFasta26 is a successful algorithm that uses a Hidden Markov Model to statistically analyze sequences upstream and downstream of putative HCDR3s. Although effective on 454 data, because of the read length, VDJFasta is unsuitable for shorter MiSeq and Ion Torrent reads. We developed a new HCDR3 recognition software package based on regular expression (RegEx) pattern, in which nucleic acid sequences encoding critical amino acids (aa) characteristic of HCDR3s and flanking sequences are used as identifiers. A naïve antibody library36 was sequenced using 454, MiSeq and Ion Torrent: a schematic representation of the primers mapping on the scFv is shown in Figure 1. The primers used are shown in Table 1, with a summary of the complete sequencing results reported in Table 2. The methods used to sequence using MiSeq and Ion Torrent are reported below. HCDR3s were identified in the 454 data set using either RegEx or VDJFasta. RegEx analysis was ~1 000 times faster than VDJFasta, and could be performed on a single personal computer, rather than a computer cluster. RegEx accuracy was shown to be comparable to VDJFasta by comparing the HCDR3s identified by the two algorithms. 84% of HCDR3s were recognized by both algorithms (Fig. 2A and and2B),2B), the cumulative total of identified HCDR3s ranked by the corresponding number of occurrences was identical for both (Fig. 2C), as was the length distribution of HCDR3s identified using RegEx or VDJFasta37 (Fig. 2D). Furthermore, the aa distribution at each position for all HCDR3s was essentially identical for HCDR3s recognized by either, or both, algorithms (Fig. 3A). Finally, we observed that the number of unique HCDR3s identified by Regex in the 454 data set was ~9% higher than the number identified by VDJFasta (Table 2; Fig. 2B), and that for any specific HCDR3 in this data set, RegEx identified ~10% more clones than VDJFasta. These data indicate that the VDJFasta identification parameters were occasionally too stringent, and appeared to exclude HCDR3s that otherwise appeared to be valid. Although there may be slight differences between the HCDR3s identified by the two algorithms, reflecting the innate difficulty of identifying HCDR3s, the majority are identified by both programs, making RegEx a valid, and extremely rapid, alternative to VDJFasta.
As the naïve antibody library described above was used to train the RegEx algorithm, we used an independent data set of human VH antibody sequences,38 to validate its functionality. Both RegEx and VDJFasta were used to identify HCDR3s from the combined data set containing 1 976 330 reads: the sequencing and analysis results are reported in Table 3, where RegEx again consistently identified ~10% more of the common HCDR3 sequences and significantly increased the number of unique HCDR3s recognized compared with VDJFasta (Fig. 2B). This result validates the regular expression as a universal recognition pattern for the analysis of human antibody libraries. The inherent speed of the regular expression search enabled us to create the AbMining ToolBox, a complete HCDR3 analysis package for antibody deep sequencing outputs using the popular next generation platforms. This software package is freely available at http://sourceforge.net/projects/abmining/ with instructions for the installation of the necessary packages for Windows, Mac and Linux operating systems. A detailed user guide for all the scripts is included in the ToolBox. These include frequency determination, barcode analysis, clustering and Hamming distance calculations, among others. We used the AbMining ToolBox to characterize the antibody library itself and selections using different sequencing platforms.
In order to sequence the antibody library by MiSeq and Ion Torrent, the HCDR3s of the antibody library were amplified by a set of 18 primers mapping upstream of HCDR3 in framework 3 and a downstream vector primer (Table 1; Fig. 1) designed to cover the entire VH diversity. The MiSeq and Ion Torrent sequences obtained from these amplifications were analyzed using the AbMining ToolBox, identifying and clustering the HCDR3s. The obtained data were compared with the 454 dataset.
Unlike the previous comparison, where the algorithms were assessed on the same data set, these sequencings represent independent samplings of the same extremely large population. When diversity greatly exceeds the number of sequencing reads, most sequences obtained from two independent samples will be different25,32 and only abundant HCDR3s are expected to be found in both populations. This is observed in Figures 4A-C, where the greatest number of sequences is unique for each data set. Similar results are obtained when two independent Ion Torrent runs are compared (Fig. 4D). Sequence distributions are broadest when 454 HCDR3s are compared with Ion Torrent or MiSeq (Fig. 4A and C) and tightest when comparing MiSeq to Ion Torrent (Fig. 4B), or resequencing (Fig. 4D), probably reflecting the use of similar primers in MiSeq and Ion Torrent, and different primers for 454. This makes it more difficult to compare the different sequencing methods at the individual HCDR3 level. However, aggregate properties, such as HCDR3 length distribution (Fig. 2D) and aa distributions at each HCDR3 position for all HCDR3 lengths, with the three sequencing platforms can be compared, and are essentially identical for the three platforms (Fig. 3B).
One possible concern of these deep sequencing platforms is that their error rates35 will overestimate the number of HCDR3s. To assess this, each individual HCDR3 of a defined length (4–21 aa, Kabat numbering) was compared with all other HCDR3s of the same length and the minimal Hamming distance for the closest HCDR3 determined for each. Figure 5A show the percentage of HCDR3s with the minimum calculated Hamming distance for aa sequences. 8–11% of HCDR3s were 1–2 Hamming aa distances away from at least one other HCDR3, with 454 having slightly higher values than MiSeq and Ion Torrent indicating that, within the context used here, error rates are similar for all platforms.
As the total combined number of reads obtained with all three platforms (7.9 × 106) exceeds 10% of the maximum potential VH diversity of this library, as measured by the number of transformants (7 × 107), we pooled all the HCDR3s identified using the AbMining ToolBox from all the different sequencing platforms and plotted the unique HCDR3s against the total number of reads (Fig. 5B). This provided a plot of unique HCDR3 accumulation, vs. number of reads, and reached a total of ~3.3 × 106 unique HCDR3s for the 7.9 × 106 reads. This number of unique HCDR3s includes those that differ by only one or two aa (Fig. 5A), which may be a consequence of sequencing errors or somatic hypermutation. The presence of these similar clones will tend to overestimate the functional HCDR3 diversity in this library; however, this reduction in functional diversity will be compensated for by additional diversity in HCDR1 and HCDR2, as well as VL recombination,26 which will link each identified HCDR3 with different numbers of VL chains.
In a final set of experiments, we selected antibodies against Ag85, a tuberculosis antigen, using a combination of phage and yeast display,34 and identified the 15 most abundant HCDR3 clones by analyzing Ion Torrent sequencing with the AbMining ToolBox. The frequencies of the most abundant binders identified by deep sequencing within the selected population range from 1.68% for the most abundant clone, to 0.32% for the 15th ranked clone. All clones bound the target specifically (Fig. 6), with no correlation between abundance rank and binding efficacy. In fact, the clone giving the third strongest signal was ranked 14th in abundance. This confirms the utility of deep sequencing and abundance analysis to identify positive clones that may otherwise be missed,24 especially when even the most abundant clones have relatively low frequencies, as observed in this particular selection.
We have demonstrated here that deep sequencing combined with the AbMining ToolBox package can be extremely effective in the analysis of antibody library diversity and selections. As HCDR3s are well-established antibody diversity surrogates,11,20 this allows the direct assessment of minimum antibody diversity in an antibody population, naïve or selected. Additional diversity in HCDR1 and HCDR2 are double that in HCDR3,26 and recombination pairs most HCDR3s with different VLs, further increasing library diversity estimates. Improvements in deep sequencing capabilities will increase the usable length of sequences, eventually allowing the sequencing of full VH/VL domains, which will also be easily identifiable using modified RegEx patterns.
Compared with other deep sequencing methods, the low cost and sequencing depth of Ion Torrent and MiSeq make them particularly useful in antibody selection, with Ion Torrent having the advantage of greater speed, and MiSeq the advantage of the greater number of reads. The output after a single round of phage antibody selection is usually 105,6 clones, representing the maximum subsequent attainable diversity. This is matched by present Ion Torrent and MiSeq capacities, making the identification of every clone in a selection output, ranked by abundance, now feasible in only five hours after PCR amplification (30 h for MiSeq). Analyses performed on a standard personal computer will allow sequencing information to directly influence selection outcome, and effectively democratize the use of deep sequencing in antibody selections.39
Although to date the application of deep sequencing to the analysis of selections from antibody and other libraries has been limited, it has already been proposed that deep sequencing after a single round of phage peptide library selection is sufficient to identify positive clones.28 We anticipate this will also become possible for antibody selections, as sequencing costs continue their downward trend, and the number, quality and lengths of reads increases. However, we expect the power of deep sequencing to go well beyond the identification of positive clones in early selection rounds. As more experience is obtained, it is likely that classes of antibodies with particular molecular (e.g., stability, biochemical liabilities in CDRs) and binding (e.g., hapten, protein, peptide) properties may be identifiable by their sequences, as will antibodies with undesirable properties (e.g., plastic or biotin binders40) that can be discarded. Furthermore, it may be possible to identify antibodies binding to one target, but not a closely related one, merely on the basis of antibody sequences obtained during selection, or antibodies binding to two different targets (e.g., murine and human versions of the same protein) by identifying common sequences in selections. We expect the deep sequencing of antibody selections to become an essential and integral part of the selection process as systems such as Ion Torrent and MiSeq become more widely available.
Although the methods described here were applied to HCDR3s in antibody libraries, it is clear that with modifications, the approach taken can also be used in the analysis of selections of other CDRs or other binding scaffolds, by simply modifying the RegEx pattern for the recognition of scaffold boundary sequences.
A specific set of primers was designed for the different sequencing platforms (Table 1). For 454 sequencing, 2 primers mapping to the pDAN5 vector upstream and downstream of the VH genes were designed. These contain the 454 specific sequencing adaptors.
For IonTorrent and MiSeq, a set of 18 forward primers mapping to the VH framework just upstream the HCDR3 were designed. They maximize the coverage of human framework 3 VH in multiplex reactions with a minimal set of perfect-match primers against germline V-segments. Primers were optimized for a common annealing temperature, GC content, minimal self-annealing or cross-annealing to other primers, and all contained a GC-clamp at the 3′ end. Coverage of a curated subset of the 454 data set showed that ~94% of antibody genes were matched, if up to 4 mismatches were permitted outside the 3′ GC-clamp region.
As reverse primer, a primer mapping to the pDAN5 vector just downstream of the VH gene was designed. Sequencing specific adaptors were introduced in both forward and reverse primers.
The scFv library analyzed here has been previously characterized.36 Briefly, a 7x107 primary library of assembled VL and VH domains was created from cDNA derived from the PBMC of 40 healthy donors and cloned into the pDAN5 phagemid vector. Plasmid DNA from this library was obtained and 0.3 fmol used as a template to prepare the amplicon samples for sequencing.
After PCR amplification, the amplicon was gel purified and quantified (Qbit, HS kit, Invitrogen). The sample was prepared for GS FLX Titanium Series Lib-A Chemistry (Roche) bi-directional amplicon sequencing according to the manufacturer’s instructions and sequenced on a 2 regions pico titer plate.
For Ion Torrent and MiSeq, the 18 forward primers (Table 1) were mixed in equimolar amounts and used for the PCR with Phusion High-Fidelity DNA polymerase (NEB). The ~240 bp amplicon was purified as previously described. The Ion Xpress Amplicon library protocol was used to prepare the sample for sequencing on the Ion 316 chips (Life Technologies). The MiSeq amplicon was prepared with a MiSeq reagent kit and run on a PE151 run.
The quality trimmed 454 sequencing reads were split into files containing 10000 sequences and used in VDJFasta as described in Glanville et al.26
The HCDR3 recognizing regular expression (RegEx) pattern used in this article was refined iteratively using the VDJFasta CDR3 data set obtained from the 454 sequences. Once a RegEx pattern was defined, it was used to identify HCDR3s from the 454 data set. The two CDR3 data sets were compared and the VDJFasta exclusive CDR3s were analyzed. The RegEx pattern was modified to include the VDJFasta exclusive CDR3s as well; the process was repeated until the RegEx was sufficiently inclusive and sensitive, with the final RegEx pattern being:
The pattern represents a balance between including as many CDR3s as possible, while minimizing the number of false positive sequences.
The AbMining ToolBox developed for this article is freely available at Sourceforge (http://sourceforge.net/projects/abmining/). The required software installation guide provides installation information for the necessary software packages, and the user guide contains detailed information how to use the toolbox’s scripts.
The raw data of the three platforms were used for optimizing the quality trimming parameters by means of AbMining ToolBox. Table 4 shows the detailed optimization of an Ion Torrent data set. Two parameters were tested: the quality average value (Q) and the window step value (step). The quality average value influences the overall quality of trimmed DNA reads. Low Q setting would allow too many sequencing errors to slip through; high Q setting would eliminate too many good sequences. The balance between the number of CDR3s identified and the number of CDR3s containing STOP codons (CDRX) was used to determine the optimal Q value.
For the input data, the filtering of the raw sequences was performed and optimized for all 3 platforms’ outputs. Tables 3A, B, and C show the quality trimming analysis for Ion Torrent, 454 and MiSeq data sets, respectively. For the Ion Torrent, the optimal Q value was 21. The step setting can be used to speed up the quality trimming. A bigger step value could result in significant time savings with a modest decrease in output quality (Table 4). For 454, Q20 was the best compromise average quality value (Table 5), while for MiSeq the Q value did not show any significant effect. A Q value of 21 was chosen for all sequence analysis (Table 6).
Phage display selection and yeast display sorting were performed as described by Ferrara et al.34 The naïve phage antibody library was used to select Ag85 antibodies: biotinylated Ag85 was used at 50 nM concentration in the first round of phage selection, and 5 nM in the second. After two rounds of phage selection, DNA encoding the selected scFv antibodies was recovered and used as template for PCR amplification and recloned into a yeast display vector. The obtained yeast library was further enriched by one round of sorting using flow cytometry (FACSAria, BD). The scFvs displayed on yeast cells showing both antigen binding and scFv display were sorted. Plasmid DNA was recovered from the sorted yeast and sequenced by Ion Torrent. The unique HCDR3s were identified and ranked by abundance using the ToolBox. The clones corresponding to the 15 most abundant HCDR3s found by Ion Torrent were identified by Sanger sequencing and tested for binding specificity by flow cytometry.
This work was supported by the National Institutes of Health [5U54DK093500–02 to ARMB]; and Los Alamos National Laboratory Directed Research Development Directed Research [20120029DR] funds.
No potential conflict of interest was disclosed.
Previously published online: www.landesbioscience.com/journals/mabs/article/27105