Deciphering the mapping between DNA sequence and expression levels is key for understanding transcriptional regulation. However, despite many studies, the quantitative effect on expression of even the most basic organizational features of promoters are still poorly understood. For example, even for a single transcription factor (TF) binding site, we know little about the quantitative effects on expression levels of its location, orientation, and affinity; whether these effects are general, factor-specific, and/or promoter-dependent; and how they depend on the underlying nucleosome organization.
In principle, such questions can be answered through accurate expression measurements of promoters in which the above elements are systematically varied. Indeed, several medium-scale1–3
libraries were created in bacteria and yeast, in which regulatory elements were randomly ligated or mutagenized and the expression of the resulting promoters was measured. These studies provided much insight, but due to their random nature, they are not ideal for addressing the above questions. For example, studying the effect of binding site location on expression requires measurements of promoters that differ only in the location of the site and sampling many such locations. Clearly, many of the desired promoters would be missing from randomly ligated libraries. Indeed, controlled design of such promoter variants7–10
led to profound insights, but since the variants were constructed one by one, time and cost considerations have limited the scale of previous studies to at most dozens of variants.
A recent study demonstrated the benefit of using thousands of designed sequences for analyzing the effect of systematic mutations to six promoters11
. However, this method assays promoter strength using in vitro transcription and thus has limited utility for understanding promoter activity in vivo. While our paper was in review, two other methods were devised for parallel measurement of promoter activity in vivo12,13
. One method assayed the effect of an impressive library of >100,000 random mutations in three mammalian enhancers12
, but the random nature of the libraries limits this method’s utility for systematic dissection of regulatory logic. The other method13
used programmable microarrays14
to measure the effect of systematically designed mutations in two mammalian enhancers.
Here, we devised a high-throughput fluorescence-based method for obtaining parallel and highly accurate expression level measurements of thousands of fully designed promoters. Our approach differs from and has several advantages over previous methods. First, our parallel expression measurements are in excellent agreement with those of isolated strains (R2
=0.99), considerably better than the agreement reported by Melnikov et al.13
=0.45–0.75). Highly accurate expression measurements are critical for a quantitative understanding of transcriptional regulation. Second, in contrast to both recent methods12,13
that require a barcode within the RNA reporter, our method can avoid barcodes by fully sequencing each promoter, although our present study incorporated a barcode upstream of the designed promoter. Barcodes within the RNA affect reporter expression and thus limit accuracy13
. Third, while both published methods measure mean expression level over a cell population, our method obtains cell-to-cell (noise) expression variability measurements for each promoter, which also agree well with isolated strain measurements (R2
=0.43, Fig. S1
). Finally, by using protein fluorescence and not RNA as the readout, we can also study translational control, e.g., with libraries that alter the 5’ UTR or the codons of the fluorescent reporter. In addition, the need to physically couple a proximal barcode to the examined variable region limits both previous methods12,13
to studying cis
-effects, whereas our method can be used to examine the effects of sequence variation on fluorescent protein expression in trans
We applied our approach to design a library of 6500 promoters that directly measures several grammatical rules of transcriptional regulation such as the effect on expression of binding site location, number, orientation, and affinity. Our results provide insights into principles of transcriptional regulation, including a clear logistic function relationship between expression and site number; a dominance of TF identity over site number in determining high expression levels; a surprisingly large effect on expression of even small 1–7bp changes in site location; and for one TF, a striking ~10bp periodic relationship between expression and site location. Our approach can be adapted to other genomic regions and organisms to unravel diverse types of both cis and trans mappings between sequence and phenotype.