Finding genetic loci that are required for optimal growth under specific conditions helps inform the basic understanding of bacterial physiology and efforts to develop new therapeutics for pathogens. Previously, we and others have used transposon mutagenesis to infer the requirement for genes under different growth conditions by utilizing the information provided by genome annotations
[1]–
[6]. Deep sequencing, which allows us to map precisely the insertion site of every mutant, affords a higher resolution assessment of genetic requirement, beyond just genes. Here, we demonstrate that an unbiased sliding window approach harnesses the full potential of this increased resolution. This approach identified not only whole genes required for optimal growth but also other required elements, such as non-protein coding RNAs and protein domains within insertion-containing genes, which would otherwise obscured by gene-centric analysis. An alternative analysis that uses significant gaps in insertion—rather than quantitative insertion counts—was also able to assess the requirement of protein domains (DeJesus et al., unpublished data, submitted). This analysis likely identifies regions absolutely essential for viability rather than all regions required for optimal growth.
We found that many genes contain elements that are important for growth even though other regions are not required. In at least two cases,
ppm1 and
fhaA, published data have shown that the required regions encode specific protein domains. However, in other cases, these might represent non-protein-coding RNAs or cis regulatory elements. Bacteria encode many small RNAs many of which could be required for optimal growth and some of which are embedded within genes
[8]–
[10]. In addition, most genes have been annotated computationally, an uncertain pursuit that clearly can lead to misannotated start sites
[19]. Genes with only 5′ insertions could fall into this category.
Similarly, important non-protein-coding regions could have multiple roles. In some cases, we found that known RNAs, such as
rnpB, the catalytic RNA component of RNase P, and the tmRNA were required for optimal growth, supporting previous speculation
[20]–
[21]. Again, some other required regions might encode as yet unidentified non-coding RNA molecules. Still others might be promoters or other regulatory regions.
In this study, our resolution was limited by the specific properties of the
Himar1 transposon in mycobacteria. Our previous studies have shown that insertions are randomly distributed apart from the desired selection against insertion in essential regions
[11],
[22]. Despite this, we cannot assume that all sites lacking insertions represent required regions since unknown insertional biases of the transposon may exist. Thus, we defined a required region as one with a statistically underrepresented insertion count using a non-parametric test to account for such potentially unique biases within these data (). This allowed us to exclude, for example, windows with 6 or fewer TA sites, which demonstrably lacked power to distinguish a region as essential for growth relative to background variation. In GC-rich protein-coding regions, this limited our scope to windows of greater than 400 bp; less GC-rich intergenic regions allowed the assessment of windows greater than 250 bp. Thus, while we were able to identify many required protein domains and RNAs, it is certainly possible that smaller elements required for growth were missed due to these size constraints. This is a particular problem for non-coding RNAs that are often very small. For example, while we found 10 tRNAs required for growth, the remaining tRNAs reside in non-coding regions that did not have the requisite number of TA sites to determine requirement. Using the
Himar1 transposon in organisms with less of a GC bias, or in organisms in which a less restricted transposon exists, should result in increased resolution
[4].
The analysis we used provides a powerful tool to perform functional genome analysis. Importantly, this type of approach is useful not only for single conditions, as we described but can also be used to identify elements critical under one growth condition but not another
[23]–
[25]. This is particularly important in organisms like Mtb, an obligate pathogen that never grows under conditions precisely comparable to those we use
in vitro. Coupling high-density insertion libraries with deep sequencing and analytic methods such as that described here provides a powerful experimental tool for functional genome annotation.