It has been observed that the majority of bacterial genes tend to be located on the leading strand in a genome, and the percentage of such genes has a large variation across different bacteria, ranging from ~45% to ~90% (1
). A number of studies have been carried out aiming to provide explanations for such observations. A key factor considered in these studies is the different mechanisms used by bacterial cells in replication of the leading and the lagging strands when cell replication and transcription occur simultaneously (3
). Specifically, during chromosomal replication, deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) polymerases move in the same direction on the leading strand but in opposite directions on the lagging strand, creating the possibility of head-on collisions between the two polymerases during transcription of some genes on the lagging strand, hence making the lagging strand the less efficient one between the two (1
). In an earlier study, Brewer (3
) suggested that bacterial cells may be under a selection pressure to have highly expressed genes reside on the leading strand. Rocha and Danchin (5
) recently argued that it is really the essentiality instead of the needed expression levels of genes that may have driven certain genes to the leading strand. Although this interpretation seems to be correct, it provides only a partial answer as essential genes account for only a small portion of the whole gene set encoded in a bacterial genome, e.g. ~10% in Escherichia coli
) and ~10% in Bacillus subtilis
). Price et al.
) observed that longer operons tend to be on the leading strand and suggested that there may be a selection pressure to have such an arrangement to avoid interruptions during transcription of such operons. Furthermore, Rocha (6
) observed that the presence/absence of the DNA polymerase PolC
in a genome is highly correlated with bacterial genomes having at least 70% of their genes on the leading strand or not. Hu et al.
) proposed that replication-associated purine asymmetry may also contribute to the strand bias in a genome. In addition, Lin et al.
) found that the essential genes on the leading strand are enriched in only a few of sub-categories of clusters of orthologous groups (14
). Although this analysis provided useful insights of functional preference of genes to the leading and lagging strand, a larger analysis involving more genes and organisms is needed to ensure the generality of the observation. More importantly, the general issues of why the majority of bacterial genes tend to be located on the leading strands and why the percentage of leading strand genes has such a large variation across different organisms remain largely unanswered.
We present in this study a computational analysis of all the sequenced bacterial genomes aiming to provide a more general explanation to the above two observations. Our key findings are (i) genes of different functional categories have different level of tendency to be on the leading strand; (ii) genes of some functional categories such as transcription factor have higher preferences to be on the lagging strands; (iii) there is at least one balancing force that keeps genes from all moving to the leading strand during evolution, i.e. a more balanced genome facilitates a higher gene density in a genome and (iv) the percentage of leading-strand genes for a bacterium can be accurately explained in terms of genes in some functional categories outlined in (i) to (ii), genome size and gene density. On the basis of these findings, we believe that the percentage of genes on the leading versus lagging strand in a genome is the result of two sets of balancing forces, one that tends to drive genes of certain functional categories to the leading strands to make the bacteria more efficient in their responses to environmental changes and one that tends to keep the genome as compact as possible to stay energetically efficient when replicating and maintaining the genome.