|Home | About | Journals | Submit | Contact Us | Français|
Insertion sequences (ISs) are mobile genetic elements in bacterial genomes. In general, intergenic IS elements are probably less deleterious for their hosts than intragenic ISs, simply because they have a lower likelihood of disrupting native genes. However, since promoters, Shine–Dalgarno sequences, and transcription factor binding sites are intergenic and upstream of genes, I hypothesized that not all neighboring gene orientations (NGOs) are selectively equivalent for IS insertion. To test this, I analyzed the NGOs of all intergenic ISs in 326 fully sequenced bacterial chromosomes. Of the 116 genomes with enough IS elements for statistical analysis, 68 have significantly more ISs between convergently oriented genes than expected, and 46 have significantly fewer ISs between divergently oriented genes. This suggests that natural selection molds intergenic IS distributions because they are least intrusive between convergent gene pairs and most intrusive between divergent gene pairs.
Insertion sequences (ISs) are common transposable elements in bacterial genomes. Although IS elements can generate beneficial mutations (Cooper et al. 2001; Safi et al. 2004), they are generally considered genomic parasites because they only code for the enzyme required for their own transposition (Doolittle and Sapienza 1980; Orgel and Crick 1980). While an IS element inhabits a chromosomal location, it is inherited along with its host’s native genes, so its fitness is intimately tied to that of its host. Therefore, an IS that causes a deleterious mutation by disrupting an essential gene will probably be quickly eliminated from most natural populations, whereas an IS that inserts into a selectively neutral location will have a much greater chance of long-term survival (Lynch 2006). As a general rule, intergenic IS elements probably enjoy higher survival than those that integrate within genes, simply because they have a lower likelihood of disrupting native genes (Campbell 2002; Zaghloul et al. 2007). However, the question then arises: are all intergenic regions selectively equivalent for IS occupancy?
Bacterial genes can be transcribed from either the top (→) or bottom (←) DNA strand. Therefore, neighboring genes on bacterial chromosomes can occur in three possible orientations: tandem (→→ and ←←), convergent (→←), and divergent (←→). Because promoters, Shine–Dalgarno sequences, and transcription factor binding sites are upstream of genes, I hypothesized that the intergenic regions of the three neighboring gene orientations (NGOs) may not be selectively equivalent for IS insertion. Specifically, the intergenic region between: 1) ←→ neighbors will contain a promoter and a Shine–Dalgarno sequence for both genes, and possibly a transcription factor binding site for both, 2) →→ and ←← neighbors will contain a promoter (if the neighbors are not in the same operon) and a Shine–Dalgarno sequence for the respective downstream gene only, and possibly a transcription factor binding site for that gene, and 3) →← neighbors will contain no promoters, Shine–Dalgarno sequences, or transcription factor binding sites. Therefore, an IS that inserts between ←→ genes has a relatively high likelihood of disrupting the transcription or translation of its neighbors, an IS that inserts between →→ or ←← genes has a moderate likelihood of disrupting its neighbors, and an IS that inserts between →← genes will never disrupt its neighbors. Because of this discrepancy among intergenic regions, I hypothesized that intergenic ISs would be most common between →← oriented genes and least common between ←→ oriented genes in bacterial genomes.
I tested this hypothesis by analyzing the NGOs of all intergenic ISs from 326 fully sequenced bacterial chromosomes. Of these, 116 genomes have enough ISs to meet χ2 test assumptions (Cochran 1954). Remarkably, 64% of these genomes (N = 74) have observed intergenic IS quantities that deviate significantly (P ≤ 0.05) from expectations (under the null assumptions of random insertion and no natural selection) (table 1). These deviations are pervasive across the phylogenetic spectrum of Bacteria (table 1) and include a wide variety of IS families. Two NGOs exhibit extraordinary consistency in their contributions to these deviations: →← harbors significant IS excesses in 68 genomes and one significant deficit, and ←→ harbors two significant IS excesses and 46 significant deficits (fig. 1 and table 1). Overall, 105 of the 116 analyzed genomes contain more IS elements in the →← orientation than expected, and 104 contain fewer in the ←→ orientation than expected (the binomial probabilities of having distributions at least this skewed just by chance are 1.1 × 10−20 and 9.3 × 10−20, respectively) (table 1). These nonrandom IS distributions also extend to bacterial chromosomes that contain relatively few IS elements. Specifically, of the 131 genomes that do not contain enough ISs for statistical analysis (Cochran 1954) but that have ≥1 expected IS in each NGO, 117 genomes contain more IS elements in the →← orientation than expected, and 108 contain fewer in the ←→ orientation than expected (the binomial probabilities of having distributions at least this skewed just by chance are 1.0 × 10−21 and 1.1 × 10−14, respectively) (supplementary table S1, Supplementary Material online).
One possible explanation for these nonrandom IS distributions is a general insertion bias into →← and away from ←→ intergenic regions. I doubt that such a bias would result from target sequence specificity, largely because IS target site preferences are rarely very stringent or very long (Chandler and Mahillon 2002), so suitable insertion locations for many ISs occur thousands of times in each genome (Zaghloul et al. 2007). Instead, insertion bias could result from chromosomal differences between the three NGOs. For example, as bacterial genes are transcribed, DNA becomes positively supercoiled ahead of the polymerase and negatively supercoiled behind (Liu and Wang 1987). Consequently, the region between →← oriented genes may often be positively supercoiled, more so than between the other NGOs (and conversely, the region between ←→ genes may often be the most negatively supercoiled). If IS elements preferentially insert into positively supercoiled DNA, then this could explain the overabundances and underabundances of ISs between →← and ←→ oriented genes, respectively (fig. 1). However, no evidence exists for such an insertion bias, and some transposons prefer the opposite: negatively supercoiled DNA (Lodge and Berg 1990). Another possibility is that IS elements generally preferentially insert downstream of genes; for example, near transcription termination sequences. At least one IS element exhibits such a preference (Tetu and Holmes 2008), although this is not a ubiquitous tendency among ISs because some exhibit the opposite preference, inserting upstream of genes between Shine–Dalgarno sequences and start codons (Doran et al. 1997; Inglis et al. 2003). Therefore, insertion bias may affect the distribution of some IS elements in some bacterial genomes, although it is unlikely to explain the widespread bias exhibited across Bacteria (table 1).
Without any evidence for systematic IS insertion bias to explain these nonrandom IS distributions (table 1), the most likely explanation at present is that natural selection molds intergenic IS distributions. From a host bacterium’s perspective, all potential IS insertion locations are not equally viable, and natural selection eventually eliminates disadvantageous genotypes from most populations. In fact, few IS elements are probably truly selectively neutral because at the very least they appropriate host resources for transposase expression (Nuzhdin 1999). So unless a particular IS element beneficially impacts its host (Safi et al. 2004), the likely fate of most ISs is eventual extinction from their host population (Wagner 2006). For an individual IS locus, the likelihood of extinction is largely correlated to its fitness cost, with the most deleterious ISs eliminated most quickly, and those inserting in innocuous locations having the greatest potential for long-term survival (Lynch 2006). Therefore, the most innocuous ISs will be overrepresented in bacterial genomes, and the most deleterious will be underrepresented. The remarkable consistency with which intergenic IS elements are overrepresented and underrepresented between →← and ←→ oriented genes, respectively (fig. 1), suggests that these are generally relatively innocuous and deleterious insertion locations, thus supporting the hypothesis that differential selection pressure molds global intergenic IS distributions. Further fine-scale analyses of intergenic IS distributions (e.g., ISs may be less common between →→ and ←← neighbors when they are members of the same operon; ISs may be relatively rare next to highly expressed genes, no matter what their orientation) may shed additional light on the fate and impact of IS elements in bacterial genomes.
I obtained the primary annotations of all fully sequenced bacterial chromosomes from the Comprehensive Microbial Resource database (data releases 1.0–20.0) at The Institute for Genomic Research (http://cmr.tigr.org/tigr-scripts/CMR/CmrHomePage.cgi). Specifically, I obtained the locus name (i.e., the locus number), the common name, the nucleotide sequence, and the nucleotide positions of the 5' and 3' ends of all annotated proteins on each chromosome. My goal for each genome was to assess whether the observed quantities of intergenic IS elements located within each of the three NGOs differ from the quantities expected if insertion is random and not subsequently influenced by natural selection. This required four steps for each fully sequenced genome.
The first step was to find all chromosomal copies of intergenic IS elements. I used the BlastX program in the ISfinder database (http://www-is.biotoul.fr/is.html) (Siguier et al. 2006) to identify all coding sequences (CDSs) in each genome that exhibit homology to IS elements in the database. I considered a CDS with a best BlastX hit E value ≤10−10 to be an IS element (Touchon and Rocha 2007). Because I was only interested in the distribution of ISs between functional native bacterial genes, I took a relatively conservative approach when identifying intergenic IS elements (i.e., it is better to exclude some intergenic ISs than to include any intragenic ISs). Specifically, I eliminated the following IS elements from the analysis: 1) all intragenic ISs, including elements with at least one neighboring gene annotated as being truncated (or similar synonyms), conservatively assuming that the neighboring gene became degenerate following IS insertion into the gene; 2) all ISs bordered by genes with annotated frameshift or point mutations that introduce premature stop codons, conservatively assuming that these mutations preceded IS insertion; that is, the IS was never exposed to selection from two functional neighboring genes; 3) all ISs bordered by nonconsecutively numbered and therefore presumably nonneighboring genes (e.g., some are bordered by nonannotated gene remnants, which may have become degenerate following IS insertion); and 4) all ISs bordered by a phage-annotated gene, and those annotated as being or bordering an integron or an integrative genetic element (for the quantities of ISs eliminated for each of these reasons in each genome, see supplementary table S2, Supplementary Material online). Conversely, I included IS elements with both functional and nonfunctional transposases because ISs can affect their neighboring genes even if they are no longer mobile (e.g., by displacing promoters). Also, multiple IS insertions into the same intergenic space were included only once in the analysis.
The second step was to calculate the observed quantity of intergenic IS elements within each NGO (i.e., assessing whether the two neighboring genes are coded on the top or bottom DNA strand for each IS element). I did this by simply subtracting the nucleotide position of the 5' end from that of the 3' end for each neighbor, which produces a positive number for top strand genes and a negative number for bottom strand genes.
The third step was to calculate the expected quantity of intergenic IS elements within each NGO, assuming that IS insertion is random and not subsequently affected by natural selection. I calculated these expected quantities based on the premise that large and abundant NGO intergenic regions should receive more ISs than small and rare ones, all things being equal. Therefore, the expected quantities were calculated individually for each genome using the product of 1) the mean intergenic distance between neighboring native bacterial genes in the three NGOs and 2) the global proportion of each native gene pair NGO; for an example of this calculation, see table S3 (Supplementary Material online).
Finally, the fourth step was to use a χ2 goodness-of-fit test to assess whether the observed quantities of intergenic IS elements within each NGO deviate from the expected quantities. The assumptions of the χ2 test are that no cell has an expected value <1.0 and that ≤20% of cells have expected values <5.0 (Cochran 1954). Therefore, many fully sequenced genomes do not contain enough intergenic IS elements for statistical analysis (all 116 genomes with enough intergenic ISs are included in table 1, and the remaining 210 genomes are included in table S1, Supplementary Material online). I did not Bonferroni-adjust the χ2 test P values (Moran 2003), although all χ2 values that would be significant with a Bonferroni correction are indicated in table 1. To identify the NGOs contributing to each significant χ2 deviation, I performed cell-by-cell comparisons of observed and expected quantities using an adjusted residual method, considering any adjusted residual with an absolute value >2 to contribute significantly to the overall χ2 deviation (Agresti 1996).
I thank Huansheng Cao, Kevin Dougherty, Evelyn Fetridge, Catherine Ruggiero, Chad Thompson, and several anonymous reviewers for helpful comments on this manuscript, and Elizabeth Coffey for early contributions to this project. This work was supported by the National Institutes of Health (grant number 1R15GM081862-01A1). This is contribution number 246 of the Louis Calder Center—Biological Field Station, Fordham University.