The sequence of a protein determines its structure and function. Despite this clear correlation, understanding the relationship between protein sequence and function is a complex and largely unsolved problem. The contours of this problem have only expanded as the amount of protein sequence information, derived from DNA sequencing, has exploded. Thus, methods to rapidly couple protein sequence to protein function are needed.
One effective way to understand protein sequence–function relationships is through the examination of mutants. Mutational analysis has been applied both
in vitro and
in vivo, ranging from exploration of protein-protein interaction interfaces to analysis of the kinetics and thermodynamics of protein folding
1,2. An example is an alanine scan, in which amino acid residues are individually mutated to alanine
3. Residues that, when changed to alanine, result in loss or diminution of function (e.g. binding, catalysis or stability) are likely of functional importance.
Mutational scanning, however, suffers from bottlenecks that have limited its utility. For example, each mutant typically needed to be cloned, expressed and purified for an
in vitro property to be measured. The requirement to purify individual mutant proteins has been largely resolved by technologies in which a library of protein variants are linked to their encoding DNA sequences. Examples include protein display on the surface of phage, yeast and bacteria as well as ribosome display
4. These enable a large (10
6–10
12) pool of variants to be assayed for a particular function in parallel. Protein display experiments generally involve multiple rounds of selection (e.g. for binding) that eliminate unselected variants to yield a few highly active proteins. In addition, this scheme has been used to implement mutational scanning, including limited sampling of all possible single mutants in a short sequence
5. Despite these advances, protein display has, until recently
6–8, been limited by the requirement for Sanger sequencing of variants after selection, restricting to a few thousand the number of variants that can be examined.
Here, we demonstrate that protein display employing moderate selection pressure on a library of variants can be combined with high-throughput sequencing to furnish a high-resolution, fine-scale map of protein sequence–function relationships. Using T7 bacteriophage to display over 600,000 variants of the human
Yes
Associated
Protein 65 (hYAP65) WW domain
9, we performed six rounds of selection for binding to its cognate peptide ligand. The selection parameters were tuned to produce only moderate enrichment for better binding, which maintained a large number of library members rather than converging on a few high affinity variants. Short read Illumina sequencing of libraries from the starting pool and after three and six rounds of selection enabled quantitative tracking of the fate of hundreds of thousands of variants simultaneously. The data revealed a detailed sequence–function landscape that is remarkably concordant with known WW domain features. We observed strong agreement between mutational preferences and evolutionary conservation within the WW domain. Furthermore, we comprehensively addressed the question of how mutations impair protein function. The effectiveness of moderate selection combined with high throughput sequencing encourages a shift toward carrying out protein functional screens in a highly parallel fashion, which may reveal novel aspects of protein function in many
in vitro and
in vivo contexts.