In bacteria and archaea, viruses are the primary infectious agents, acting as virulent, often deadly pathogens. A form of adaptive immune defense known as CRISPR-Cas enables microbial cells to acquire immunity to viral pathogens by recognizing specific sequences encoded in viral genomes. The unique biology of this system results in evolutionary dynamics of host and viral diversity that cannot be fully explained by the traditional models used to describe microbe-virus coevolutionary dynamics. Here, we show how the CRISPR-mediated adaptive immune response of hosts to invading viruses facilitates the emergence of an evolutionary mode we call distributed immunity - the coexistence of multiple, equally-fit immune alleles among individuals in a microbial population. We use an eco-evolutionary modeling framework to quantify distributed immunity and demonstrate how it emerges and fluctuates in multi-strain communities of hosts and viruses as a consequence of CRISPR-induced coevolution under conditions of low viral mutation and high relative numbers of viral protospacers. We demonstrate that distributed immunity promotes sustained diversity and stability in host communities and decreased viral population density that can lead to viral extinction. We analyze sequence diversity of experimentally coevolving populations of Streptococcus thermophilus and their viruses where CRISPR-Cas is active, and find the rapid emergence of distributed immunity in the host population, demonstrating the importance of this emergent phenomenon in evolving microbial communities.
Quantifying diversity is of central importance for the study of structure, function and evolution of microbial communities. The estimation of microbial diversity has received renewed attention with the advent of large-scale metagenomic studies. Here, we consider what the diversity observed in a sample tells us about the diversity of the community being sampled. First, we argue that one cannot reliably estimate the absolute and relative number of microbial species present in a community without making unsupported assumptions about species abundance distributions. The reason for this is that sample data do not contain information about the number of rare species in the tail of species abundance distributions. We illustrate the difficulty in comparing species richness estimates by applying Chao's estimator of species richness to a set of in silico communities: they are ranked incorrectly in the presence of large numbers of rare species. Next, we extend our analysis to a general family of diversity metrics (‘Hill diversities'), and construct lower and upper estimates of diversity values consistent with the sample data. The theory generalizes Chao's estimator, which we retrieve as the lower estimate of species richness. We show that Shannon and Simpson diversity can be robustly estimated for the in silico communities. We analyze nine metagenomic data sets from a wide range of environments, and show that our findings are relevant for empirically-sampled communities. Hence, we recommend the use of Shannon and Simpson diversity rather than species richness in efforts to quantify and compare microbial diversity.
Chao estimator; Hill diversities; metagenomics; Shannon diversity; Simpson diversity; species abundance distribution
Leaf vein networks are critical to both the structure and function of leaves. A growing body of recent work has linked leaf vein network structure to the physiology, ecology and evolution of land plants. In the process, multiple institutions and individual researchers have assembled collections of cleared leaf specimens in which vascular bundles (veins) are rendered visible. In an effort to facilitate analysis and digitally preserve these specimens, high-resolution images are usually created, either of entire leaves or of magnified leaf subsections. In a few cases, collections of digital images of cleared leaves are available for use online. However, these collections do not share a common platform nor is there a means to digitally archive cleared leaf images held by individual researchers (in addition to those held by institutions). Hence, there is a growing need for a digital archive that enables online viewing, sharing and disseminating of cleared leaf image collections held by both institutions and individual researchers.
The Cleared Leaf Image Database (ClearedLeavesDB), is an online web-based resource for a community of researchers to contribute, access and share cleared leaf images. ClearedLeavesDB leverages resources of large-scale, curated collections while enabling the aggregation of small-scale collections within the same online platform. ClearedLeavesDB is built on Drupal, an open source content management platform. It allows plant biologists to store leaf images online with corresponding meta-data, share image collections with a user community and discuss images and collections via a common forum. We provide tools to upload processed images and results to the database via a web services client application that can be downloaded from the database.
We developed ClearedLeavesDB, a database focusing on cleared leaf images that combines interactions between users and data via an intuitive web interface. The web interface allows storage of large collections and integrates with leaf image analysis applications via an open application programming interface (API). The open API allows uploading of processed images and other trait data to the database, further enabling distribution and documentation of analyzed data within the community. The initial database is seeded with nearly 19,000 cleared leaf images representing over 40 GB of image data. Extensible storage and growth of the database is ensured by using the data storage resources of the iPlant Discovery Environment. ClearedLeavesDB can be accessed at http://clearedleavesdb.org.
Database; Images; Leaves; Digital archiving; Digital curation
Tip-driven growth processes underlie the development of many plants. To date, tip-driven growth processes have been modeled as an elongating path or series of segments, without taking into account lateral expansion during elongation. Instead, models of growth often introduce an explicit thickness by expanding the area around the completed elongated path. Modeling expansion in this way can lead to contradictions in the physical plausibility of the resulting surface and to uncertainty about how the object reached certain regions of space. Here, we introduce fiber walks as a self-avoiding random walk model for tip-driven growth processes that includes lateral expansion. In 2D, the fiber walk takes place on a square lattice and the space occupied by the fiber is modeled as a lateral contraction of the lattice. This contraction influences the possible subsequent steps of the fiber walk. The boundary of the area consumed by the contraction is derived as the dual of the lattice faces adjacent to the fiber. We show that fiber walks generate fibers that have well-defined curvatures, and thus enable the identification of the process underlying the occupancy of physical space. Hence, fiber walks provide a base from which to model both the extension and expansion of physical biological objects with finite thickness.
The Spanish government recently announced an official fast-track path to citizenship for any individual who is Jewish and whose ancestors were expelled from Spain during the inquisition-related dislocation of Spanish Jews in 1492. It would seem that this policy targets a small subset of the global Jewish population, that is, restricted to individuals who retain cultural practices associated with ancestral origins in Spain. However, the central contribution of this manuscript is to demonstrate how and why the policy is far more likely to apply to a very large fraction (i.e., the vast majority) of Jews. This claim is supported using a series of genealogical models that include transmissible “identities” and preferential intra-group mating. Model analysis reveals that even when intra-group mating is strong and even if only a small subset of a present-day population retains cultural practices typically associated with that of an ancestral group, it is highly likely that nearly all members of that population have direct genealogical links to that ancestral group, given sufficient number of generations have elapsed. The basis for this conclusion is that not having a link to an ancestral group must be a property of all of an individual’s ancestors, the probability of which declines (nearly) superexponentially with each successive generation. These findings highlight unexpected incongruities induced by genealogical dynamics between present-day and ancestral identities.
Recent studies have shown that proximal arrangement of multiple genes can have complex effects on gene expression. For example, in the case of heterologous gene expression modules, certain arrangements of the selection marker and the gene expression cassette may have unintended consequences that limit the predictability and interpretability of module behaviors. The relationship between arrangement and expression has not been systematically characterized within heterologous modules to date. In this study, we quantitatively measured gene expression patterns of the selection marker (KlURA3 driven by the promoter, pKlURA) and the gene expression cassette (GFP driven by the galactose-inducible GAL1 promoter, pGAL1) in all their possible relative arrangements in Saccharomyces cerevisiae. First, we observed that pKlURA activity depends strongly on the relative arrangement and the activity of pGAL1. Most notably, we observed transcriptional suppression in the case of divergent arrangements: pKlURA activity was reduced when pGAL1 was inactive. Based on our nucleosome occupancy data, we attribute the observed transcriptional reduction to nucleosome repositioning. Second, we observed that pGAL1 activity also depends on the relative arrangement of pKlURA. In particular, strains with divergent promoters showed significantly different pGAL1 activation patterns from other strains, but only when their growth was compromised by lack of uracil. We reasoned that this difference in pGAL1 activation patterns arises from arrangement-dependent pKlURA activity that can affect the overall cell physiology (i.e., cell growth and survival in the uracil-depleted condition). Our results underscore the necessity to consider ramifications of promoter arrangement when using synthetic gene expression modules.
bidirectional promoter; coexpression; heterologous gene expression; nucleosome positioning
Gram-positive bacteria can transport molecules necessary for their survival through holes in their cell wall. The holes in cell walls need to be large enough to let critical nutrients pass through. However, the cell wall must also function to prevent the bacteria's membrane from protruding through a large hole into the environment and lysing the cell. As such, we hypothesize that there exists a range of cell wall hole sizes that allow for molecule transport but prevent membrane protrusion. Here, we develop and analyse a biophysical theory of the response of a Gram-positive cell's membrane to the formation of a hole in the cell wall. We predict a critical hole size in the range of 15–24 nm beyond which lysis occurs. To test our theory, we measured hole sizes in Streptococcus pyogenes cells undergoing enzymatic lysis via transmission electron microscopy. The measured hole sizes are in strong agreement with our theoretical prediction. Together, the theory and experiments provide a means to quantify the mechanisms of death of Gram-positive cells via enzymatically mediated lysis and provides insights into the range of cell wall hole sizes compatible with bacterial homeostasis.
enzybiotic; biophysics; membrane dynamics; microbiology
Bacteriophages are the most abundant biological life forms on Earth. However, relatively little is known regarding which bacteriophages infect and exploit which bacteria. A recent meta-analysis showed that empirically measured phage-bacteria infection networks are often significantly nested, on average, and not modular. A perfectly nested network is one in which phages can be ordered from specialist to generalist such that the host range of a given phage is a subset of the host range of the subsequent phage in the ordering. The same meta-analysis hypothesized that modularity, in which groups of phages specialize on distinct groups of hosts, should emerge at larger geographic and/or taxonomic scales. In this paper, we evaluate the largest known phage-bacteria interaction data set, representing the interaction of 215 phage types with 286 host types sampled from geographically separated sites in the Atlantic Ocean. We find that this interaction network is highly modular. In addition, some of the modules identified in this data set are nested or contain submodules, indicating the presence of multi-scale structure, as hypothesized in the earlier meta-analysis. We examine the role of geography in driving these patterns and find evidence that the host range of phages and the phage permissibility of bacteria is driven, in part, by geographic separation. We conclude by discussing approaches to disentangle the roles of ecology and evolution in driving complex patterns of interaction between phages and bacteria.
microbial ecology; viruses; biogeography; networks
The direct “metagenomic” sequencing of genomic material from complex assemblages of bacteria, archaea, viruses and microeukaryotes has yielded new insights into the structure of microbial communities. For example, analysis of metagenomic data has revealed the existence of previously unknown microbial taxa whose spatial distributions are limited by environmental conditions, ecological competition, and dispersal mechanisms. However, differences in genotypes that might lead biologists to designate two microbes as taxonomically distinct need not necessarily imply differences in ecological function. Hence, there is a growing need for large-scale analysis of the distribution of microbial function across habitats. Here, we present a framework for investigating the biogeography of microbial function by analyzing the distribution of protein families inferred from environmental sequence data across a global collection of sites. We map over 6,000,000 protein sequences from unassembled reads from the Global Ocean Survey dataset to protein families, generating a protein family relative abundance matrix that describes the distribution of each protein family across sites. We then use non-negative matrix factorization (NMF) to approximate these protein family profiles as linear combinations of a small number of ecological components. Each component has a characteristic functional profile and site profile. Our approach identifies common functional signatures within several of the components. We use our method as a filter to estimate functional distance between sites, and find that an NMF-filtered measure of functional distance is more strongly correlated with environmental distance than a comparable PCA-filtered measure. We also find that functional distance is more strongly correlated with environmental distance than with geographic distance, in agreement with prior studies. We identify similar protein functions in several components and suggest that functional co-occurrence across metagenomic samples could lead to future methods for de-novo functional prediction. We conclude by discussing how NMF, and other dimension reduction methods, can help enable a macroscopic functional description of marine ecosystems.
Viruses are the most abundant life forms on Earth, with an estimated 1031 total viruses globally. The majority of these viruses infect microbes, whether bacteria, archaea or microeukaryotes. Given the importance of microbes in driving global biogeochemical cycles, it would seem, based on numerical abundances alone, that viruses also play an important role in the global cycling of carbon and nutrients. However, the importance of viruses in controlling host populations and ecosystem functions, such as the regeneration, storage and export of carbon and other nutrients, remains unresolved. Here, we report on advances in the study of ecological effects of viruses of microbes. In doing so, we focus on an area of increasing importance: the role that ocean viruses play in shaping microbial population sizes as well as in regenerating carbon and other nutrients.
Characterizing root system architecture (RSA) is essential to understanding the development and function of vascular plants. Identifying RSA-associated genes also represents an underexplored opportunity for crop improvement. Software tools are needed to accelerate the pace at which quantitative traits of RSA are estimated from images of root networks.
We have developed GiA Roots (General Image Analysis of Roots), a semi-automated software tool designed specifically for the high-throughput analysis of root system images. GiA Roots includes user-assisted algorithms to distinguish root from background and a fully automated pipeline that extracts dozens of root system phenotypes. Quantitative information on each phenotype, along with intermediate steps for full reproducibility, is returned to the end-user for downstream analysis. GiA Roots has a GUI front end and a command-line interface for interweaving the software into large-scale workflows. GiA Roots can also be extended to estimate novel phenotypes specified by the end-user.
We demonstrate the use of GiA Roots on a set of 2393 images of rice roots representing 12 genotypes from the species Oryza sativa. We validate trait measurements against prior analyses of this image set that demonstrated that RSA traits are likely heritable and associated with genotypic differences. Moreover, we demonstrate that GiA Roots is extensible and an end-user can add functionality so that GiA Roots can estimate novel RSA traits. In summary, we show that the software can function as an efficient tool as part of a workflow to move from large numbers of root images to downstream analysis.
The structure of hierarchical networks in biological and physical systems has long been characterized using the Horton-Strahler ordering scheme. The scheme assigns an integer order to each edge in the network based on the topology of branching such that the order increases from distal parts of the network (e.g., mountain streams or capillaries) to the “root” of the network (e.g., the river outlet or the aorta). However, Horton-Strahler ordering cannot be applied to networks with loops because they they create a contradiction in the edge ordering in terms of which edge precedes another in the hierarchy. Here, we present a generalization of the Horton-Strahler order to weighted planar reticular networks, where weights are assumed to correlate with the importance of network edges, e.g., weights estimated from edge widths may correlate to flow capacity. Our method assigns hierarchical levels not only to edges of the network, but also to its loops, and classifies the edges into reticular edges, which are responsible for loop formation, and tree edges. In addition, we perform a detailed and rigorous theoretical analysis of the sensitivity of the hierarchical levels to weight perturbations. In doing so, we show that the ordering of the reticular edges is more robust to noise in weight estimation than is the ordering of the tree edges. We discuss applications of this generalized Horton-Strahler ordering to the study of leaf venation and other biological networks.
The gene composition of bacteria of the same species can differ significantly between isolates. Variability in gene composition can be summarized in terms of gene frequency distributions, in which individual genes are ranked according to the frequency of genomes in which they appear. Empirical gene frequency distributions possess a U-shape, such that there are many rare genes, some genes of intermediate occurrence, and many common genes. It would seem that U-shaped gene frequency distributions can be used to infer the essentiality and/or importance of a gene to a species. Here, we ask: can U-shaped gene frequency distributions, instead, arise generically via neutral processes of genome evolution?
We introduce a neutral model of genome evolution which combines birth-death processes at the organismal level with gene uptake and loss at the genomic level. This model predicts that gene frequency distributions possess a characteristic U-shape even in the absence of selective forces driving genome and population structure. We compare the model predictions to empirical gene frequency distributions from 6 multiply sequenced species of bacterial pathogens. We fit the model with constant population size to data, matching U-shape distributions albeit without matching all quantitative features of the distribution. We find stronger model fits in the case where we consider exponentially growing populations. We also show that two alternative models which contain a "rigid" and "flexible" core component of genomes provide strong fits to gene frequency distributions.
The analysis of neutral models of genome evolution suggests that U-shaped gene frequency distributions provide less information than previously suggested regarding gene essentiality. We discuss the need for additional theory and genomic level information to disentangle the roles of evolutionary mechanisms operating within and amongst individuals in driving the dynamics of gene distributions.
Bacteria; Neutral model; Pan-genome; Population genomics; Selection
The processes responsible for the evolution of key innovations, whereby lineages acquire qualitatively new functions that expand their ecological opportunities, remain poorly understood. We examined how a virus, bacteriophage λ, evolved to infect its host, Escherichia coli, through a novel pathway. Natural selection promoted the fixation of mutations in the virus’s host-recognition protein, J, that improved fitness on the original receptor, LamB, and set the stage for other mutations that allowed infection through a new receptor, OmpF. These viral mutations arose only after the host evolved reduced expression of LamB, whereas certain other host mutations prevented the phage from evolving the new function. This study shows the complex interplay between genomic processes and ecological conditions that favor the emergence of evolutionary innovations.
Cell fate determination is usually described as the result of the stochastic dynamics of gene regulatory networks (GRNs) reaching one of multiple steady-states each of which corresponds to a specific decision. However, the fate of a cell is determined in finite time suggesting the importance of transient dynamics in cellular decision making. Here we consider cellular decision making as resulting from first passage processes of regulatory proteins and examine the effect of transient dynamics within the initial lysis-lysogeny switch of phage λ. Importantly, the fate of an infected cell depends, in part, on the number of coinfecting phages. Using a quantitative model of the phage λ GRN, we find that changes in the likelihood of lysis and lysogeny can be driven by changes in phage co-infection number regardless of whether or not there exists steady-state bistability within the GRN. Furthermore, two GRNs which yield qualitatively distinct steady state behaviors as a function of phage infection number can show similar transient responses, sufficient for alternative cell fate determination. We compare our model results to a recent experimental study of cell fate determination in single cell assays of multiply infected bacteria. Whereas the experimental study proposed a “quasi-independent” hypothesis for cell fate determination consistent with an observed data collapse, we demonstrate that observed cell fate results are compatible with an alternative form of data collapse consistent with a partial gene dosage compensation mechanism. We show that including partial gene dosage compensation at the mRNA level in our stochastic model of fate determination leads to the same data collapse observed in the single cell study. Our findings elucidate the importance of transient gene regulatory dynamics in fate determination, and present a novel alternative hypothesis to explain single-cell level heterogeneity within the phage λ lysis-lysogeny decision switch.
Multicellular organisms, single-celled organisms, and even viruses can exhibit alternative responses to various internal and environmental conditions. At the cellular level, alternative fate determination is usually described as the result of the inherent bistability of gene regulatory networks (GRNs). However, the fate of a cell is determined in finite time suggesting the importance of transient dynamics to cellular decision making. Here, we present a quantitative gene regulatory model of how bacteriophages determine the fate of an infected bacterium. We find that increasing the number of infecting phages increases the chance of quiescent (i.e., lysogeny) vs. productive (i.e. lysis) viral growth, in agreement with prior studies. However, unlike previous theoretical studies, the bias in cell fate is a result of the transient divergence of stochastic gene expression dynamics. We compare and contrast our theoretical model with recent observations of cell fate measured at the single-cell level within multiply-infected cells. Predicted heterogeneity in cell fate is shown to agree with data when including a previously unidentified gene dosage compensation mechanism, which represents an alternative hypothesis to how multiple phages interact in influencing cell fate. Together, our results suggest the importance of quantitative details of transient gene regulation in driving stochastic fate determination.
The dual concepts of pan and core genomes have been widely adopted as means to assess the distribution of gene families within microbial species and genera. The core genome is the set of genes shared by a group of organisms; the pan genome is the set of all genes seen in any of these organisms. A variety of methods have provided drastically different estimates of the sizes of pan and core genomes from sequenced representatives of the same groups of bacteria.
We use a combination of mathematical, statistical and computational methods to show that current predictions of pan and core genome sizes may have no correspondence to true values. Pan and core genome size estimates are problematic because they depend on the estimation of the occurrence of rare genes and genomes, respectively, which are difficult to estimate precisely because they are rare. Instead, we introduce and evaluate a robust metric - genomic fluidity - to categorize the gene-level similarity among groups of sequenced isolates. Genomic fluidity is a measure of the dissimilarity of genomes evaluated at the gene level.
The genomic fluidity of a population can be estimated accurately given a small number of sequenced genomes. Further, the genomic fluidity of groups of organisms can be compared robustly despite variation in algorithms used to identify genes and their homologs. As such, we recommend that genomic fluidity be used in place of pan and core genome size estimates when assessing gene diversity within genomes of a species or a group of closely related organisms.
The development of effective environmental shotgun sequence binning methods remains an ongoing challenge in algorithmic analysis of metagenomic data. While previous methods have focused primarily on supervised learning involving extrinsic data, a first-principles statistical model combined with a self-training fitting method has not yet been developed.
We derive an unsupervised, maximum-likelihood formalism for clustering short sequences by their taxonomic origin on the basis of their k-mer distributions. The formalism is implemented using a Markov Chain Monte Carlo approach in a k-mer feature space. We introduce a space transformation that reduces the dimensionality of the feature space and a genomic fragment divergence measure that strongly correlates with the method's performance. Pairwise analysis of over 1000 completely sequenced genomes reveals that the vast majority of genomes have sufficient genomic fragment divergence to be amenable for binning using the present formalism. Using a high-performance implementation, the binner is able to classify fragments as short as 400 nt with accuracy over 90% in simulations of low-complexity communities of 2 to 10 species, given sufficient genomic fragment divergence. The method is available as an open source package called LikelyBin.
An unsupervised binning method based on statistical signatures of short environmental sequences is a viable stand-alone binning method for low complexity samples. For medium and high complexity samples, we discuss the possibility of combining the current method with other methods as part of an iterative process to enhance the resolving power of sorting reads into taxonomic and/or functional bins.
Theoretical models for allometric relationships between organismal form and function are typically tested by comparing a single predicted relationship with empirical data. Several prominent models, however, predict more than one allometric relationship, and comparisons among alternative models have not taken this into account. Here we evaluate several different scaling models of plant morphology within a hierarchical Bayesian framework that simultaneously fits multiple scaling relationships to three large allometric datasets. The scaling models include: inflexible universal models derived from biophysical assumptions (e.g. elastic similarity or fractal networks), a flexible variation of a fractal network model, and a highly flexible model constrained only by basic algebraic relationships. We demonstrate that variation in intraspecific allometric scaling exponents is inconsistent with the universal models, and that more flexible approaches that allow for biological variability at the species level outperform universal models, even when accounting for relative increases in model complexity.
Allometry; elastic similarity; fractal; geometric similarity; hierarchical Bayes; leaves; scaling; stress similarity; trees
Shifting the perspective of the questions we ask will ensure that network theory continues to excite the network theorists, but more importantly, that it remains vital to progress in biological research.
Trade-offs have been put forward as essential to the generation and maintenance of diversity. However, variation in trade-offs is often determined at the molecular level, outside the scope of conventional ecological inquiry. In this study, we propose that understanding the intracellular basis for trade-offs in microbial systems can aid in predicting and interpreting patterns of diversity. First, we show how laboratory experiments and mathematical models have unveiled the hidden intracellular mechanisms underlying trade-offs key to microbial diversity: (i) metabolic and regulatory trade-offs in bacteria and yeast; (ii) life-history trade-offs in bacterial viruses. Next, we examine recent studies of marine microbes that have taken steps toward reconciling the molecular and the ecological views of trade-offs, despite the challenges in doing so in natural settings. Finally, we suggest avenues for research where mathematical modelling, experiments and studies of natural microbial communities provide a unique opportunity to integrate studies of diversity across multiple scales.
Ecological genomics; experimental evolution; mathematical models; micro-organisms; metabolism; parasitism; trade-offs; viruses
The CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) system is a recently discovered type of adaptive immune defense in bacteria and archaea that functions via directed incorporation of viral and plasmid DNA into host genomes. Here, we introduce a multiscale model of dynamic coevolution between hosts and viruses in an ecological context that incorporates CRISPR immunity principles. We analyze the model to test whether and how CRISPR immunity induces host and viral diversification and the maintenance of many coexisting strains. We show that hosts and viruses coevolve to form highly diverse communities. We observe the punctuated replacement of existent strains, such that populations have very low similarity compared over the long term. However, in the short term, we observe evolutionary dynamics consistent with both incomplete selective sweeps of novel strains (as single strains and coalitions) and the recurrence of previously rare strains. Coalitions of multiple dominant host strains are predicted to arise because host strains can have nearly identical immune phenotypes mediated by CRISPR defense albeit with different genotypes. We close by discussing how our explicit eco-evolutionary model of CRISPR immunity can help guide efforts to understand the drivers of diversity seen in microbial communities where CRISPR systems are active.
Evolutionary biology; host–parasite interactions; immune defense; microbial ecology; viral evolution