In this study, we address the two least efficient steps of phage display library selection: quantification and analysis of the displayed ligands. We introduce an integrated, robust, and readily available set of DNA-based molecular tools that will markedly improve combinatorial analysis in vitro and in vivo, and at an extremely low cost in labor and time.
Phage library quantification is currently scored through phage infection of host bacteria, serial dilution, and TU-counting (i.e., individual colony or plaque). In this process, phage recovery and library titer depend on a number of factors such as peptide-target affinity, bacterial toxicity of encoded peptides, viability of phage after panning and particle recovery, as well as inherent infection and replication properties of targeted phage clones. Indeed, certain phage encoded-peptide sequences may be non-accurately represented due to preferences in the bacteria codon-usage. Notably, molecular events of host binding, entry, and infection rates depend on the number of ligands displayed and on the nature of the hybrid fusion partner; these currently unquantifiable factors could affect infection and replication capabilities of particular phage clones and might influence the prospect of uncovering rare ligands and/or the true library size. Suggested methods, developed to overcome some of these technical challenges include bacterial infection-independent procedures, such as enzyme-linked immunosorbent assay (ELISA) with antibodies that bind specifically to the phage coat 
and/or phage-DNA-based approaches such as quantitative-PCR 
. However, biochemical approaches clearly still lack the required sensitivity when phage titers are low. Moreover, the concept of “high-throughput” pyrosequencing of phage display libraries has now evolved from a very complex protocol that yielded merely ~102
to a far easier technique that enables the sequencing of 106
amplicons as presented here. Finally, quantitative-PCR plus next-generation sequencing methodologies have not as yet been systematically compared to TU-counting plus Sanger sequencing in terms of speed, cost, and--most importantly--accuracy.
The next-generation phage display approach introduced here includes DNA-analysis of clones from the initial quantification steps to the final large-scale sequencing, which are relatively much faster and far less expensive. Replacement of bacterial overnight growth and TU-counting with DNA extraction and qPCR not only permits the quantification of non-infective/degraded phage, but it also yields phage homing results in only a few hours after tissue removal, simultaneously for dozens of samples in parallel. In real-time PCR phage quantification (qPhage), reproducible quantification was attained over a broad concentration range and was linear over at least eight orders of magnitude, far better than with the conventional approach and also much more sensitive than real-time PCR phage quantification reports 
Our qPhage strategy allows fast and precise validation of target tissue-specificity in the homing of selected phage or in large-scale evaluation of dozens of samples, which is particularly appealing for studies in vivo
, including patients 
. These goals are reachable with only small design modifications through simultaneous administration of multiple independent targeted phage particles in a single animal, followed by specific detection of each one with appropriate primers or probes in a multiplexed PCR. After homing validation, peptides shown after sequencing saturation of a number of vascular beds that appear to be specific to a particular target tissue, can be considered further as promising probes to be developed as agents for imaging and drug delivery in normal or tumor target sites.
Minor drawbacks remain. Host bacteria are still required for phage library generation and amplification between screening rounds; given the phage life cycle, this is unlikely to change. Another potential drawback is the need to re-clone the phage of interest (if desired), after its displayed peptide is determined. Nevertheless, deep sampling of the targeted peptide repertoire leads to a more reliable phage selection, and the regeneration of the selected particle(s) of interest can easily be accomplished with straightforward cloning protocols.
After titration with qPhage, the extracted DNA can be used directly for high-throughput determination of the displayed ligands (i.e., peptides or antibodies). In our tests, the generation of a large number of sequences allowed good coverage of the repertoire in all tissues studied, and included most of the sequences derived from conventional TU-counting. The availability of a larger nucleotide sequence dataset derived from high-throughput sequencing has also allowed the adoption of more stringent criteria to validate sequences. For example, we require that a peptide-encoding phage insert is accepted only if its sequence occurs at least twice, leading to the exclusion of singletons from the final dataset; such a “two-hit requirement” reduces sequencing errors in the final dataset of large-scale DNA sequencing-based approaches 
, and sharply increases our confidence in the displayed peptide list. This stringent criterion is not applicable to extremely diverse datasets in which repeats are not expected (i.e., first-round selection or library sequencing) or in reduced sequence datasets that are not large enough to cover the entire phage diversity; indeed, in such cases, the presence of sequencing errors or artifacts may be one of the factors potentially explaining why large-scale sequencing may not necessarily exhaust smaller sequence data sets.
As the new methodology proposed here is PCR-based (in contrast to a host bacteria-dependent approach), a number of potential advantages and disadvantages emerge. In general, a PCR-based approach is capable to reveal real binding peptides whose representation may be negatively impacted by the requirement of bacterial infection and multiplication. On the other hand, such PCR-based approach may acquire background noise due to errors in library construction or assembly of non-infective phage particles. Our analysis shows that most of the rejected sequences shown in are derived from “empty” phage particles or amplification artifacts. However, the bioinformatic filters implemented here have allowed the prompt identification and discarding of these artifacts, revealing the relevant sequences in an unprecedented scale.
As the conventional phage display approach has long been validated, a central concern of this work was to evaluate whether any biases were introduced in the new steps that produced the amplicons to be sequenced. Our comparative analysis based on GC content and homopolymer frequency in the inserts, as well as codon usage, and residue or peptide frequencies and overlaps indicated that (i) there was no preferential amplification of certain inserts but (ii) both datasets share essentially the same sequence properties. As the sequencing of homopolymer-containing regions is a well-known limitation of the 454-Roche pyrosequencing platform used here, this issue was investigated in detail. As presented in the online Supplementary Tables, the rejected 454 sequence-dataset contains more homopolymers than the accepted 454 sequence-set (chi-square test, P<0.001; Table S2
) and the more abundant classes of homopolymer-containing sequences (frequency >3) appear to be somewhat under-estimated (Table S3
). This suggests that insert sequences containing homopolymers >5nt are under-estimated after 454-pyrosequencing. However, when all accepted sequences were evaluated (Table S4
), we observed a non-statistically significant trend for a reduced frequency of homopolymers ≥5 in the 454-derived dataset when compared to the Sanger-derived sequence set (chi-square test, P
0.9955). The fact that both datasets are similar in terms of homopolymer-containing sequences is likely due to a simple fact. After PCR, each phage is amplified generating millions of copies of the original molecule. A certain percentage (~15%) of the homopolymer-containing amplicons, will not be correctly sequenced by 454. However, due to the massive capability of this approach, enough molecules will still be correctly sequenced and represented in the final dataset. Thus both sequence sets (454-pyrosequencing and Sanger) will be similar when the distinct sets of homopolymer sizes are considered.
To reinforce the similarity of both datasets, when sequences derived from both DNA sequencing methods (N
1645) or sequences exclusively found by 454-pyrosequencing (N
1202) were compared, we observe no significant differences in the frequency of homopolymers of all sizes (4 to 7 repeated bases). The comparison of homopolymer-containing inserts between these large groups and the group of sequences exclusively found by Sanger-sequencing is not informative, as it lacks precision due to its relatively small size (N
87, compared to >1000 for the other groups). However, it is interesting to note that the frequency of homopolymers in sequences found only by the Sanger method was higher for all classes of homopolymer sizes. This effect may be real, but is certainly small as we can see from the small size of this group (Table S5
As noted above, the sequencing of homopolymers is an established technical issue for the pyrosequencing methodology. However, from the analysis presented here, we can conclude that this technical limitation has only had a very small effect on the universe of phage particles revealed by the large-scale approach. As the 454-method allowed a high coverage of the Sanger dataset for all tissues evaluated (ranging from 78.6 to 96.3%), and it also uncovered a significant fraction of peptides (25.3 to 97.7%) not revealed by the low-throughput Sanger-sequencing approach, we conclude that the benefits of this approach certainly compensate the known disadvantages and challenges of this particular sequencing platform. As demonstrated for other platforms (such as the SOLiD, Applied Biosystems), technical alternatives exist for the particular sequencing of homopolymeric-rich regions. In the future, the integrated approach presented here may eventually be chosen for use with alternative high-throughput sequencing approaches other than the one developed by 454-Roche.
The large sequence dataset presented here has covered over 90% of the phage diversity of all human tissues we investigated and has provided a high-confidence list of tissue-specific ligands. Sets of high-confidence tissue-specific peptides along with improved statistical analysis of longer motifs can be undertaken after the sequencing of phage DNA recovered from a large number of tissues, as well as from specific tissue samples recovered by micro-dissection from paraffin-embedded tissues. In fact, large-scale sequencing of naïve (unselected and unamplified) libraries may--for the first time--provide an accurate measurement of their size (i.e., number of unique sequences), a result allowing the empiric (rather than theoretical) demonstration of the true randomness of insert sequences. In this study, it should be noted that we used targeting peptides for validation, but there is no reason that antibodies would not be as effective. Indeed, one might speculate that the DNA-based approaches introduced in this study will eliminate the need for “helper” phage for phage antibody-display selection, and finally enable its in vivo application.
One technical aspect merits an additional brief commentary. Next-generation sequencing approaches are being improved constantly, and the newest chemistry platforms [such as SOLiD™ (Applied Biosystems) or Illumina Genome Analyzer (Illumina, Inc.)] may actually permit the generation of >100 million sequencing “reads” per run without the inherent challenge of homopolymer sequencing of this pyrosequencing platform. For both sequencing technologies, a major limitation is the short length of possible DNA analytes (<100 nucleotides), which may prove suitable for combinatorial phage libraries (displaying small peptides). Nevertheless, because the accuracy and cost-effectiveness of such methods has not been vetted, it remains to be determined whether other massive sequencing platforms may eventually replace the platform used here.
Technological advances have already brought about a new era for genomics, epigenomics, and transcriptome studies. We predict the same will happen for phage display analysis. We show that the integration of DNA-based quantification and large-scale sequencing methodology presented here produces unbiased data and allows the full determination of the whole pool of ligand sequences available after “n” rounds of selection. Our results show that in tandem qPhage quantification and next-generation DNA sequencing will set a new gold standard for phage display for accuracy, running time, diversity coverage, and cost-effectiveness. Overall, the enabling platform introduced and optimized in this work is superior to TU-counting plus Sanger sequencing. As such, it may become the method-of-choice for a broad range of phage-display applications in silico, in cells, and in vivo; this will be particularly the case if the extreme molecular diversity observed during large-scale screenings in patients is considered.