Consider the two GPMs depicted in . The key feature differentiating the two mappings is that the first is injective (one-to-one), whereas the second is not. The non-injective mapping is associated with a greater complexity. This association is in agreement with an information theoretic analysis. The injective mapping has maximal entropy, indicating that much information is gained about the mapping by performing a single measurement. On the other hand, the non-injective mapping has lower entropy, indicating that less information is gained by making a measurement. It is this lack of information gained via measurement that gives biologists the intellectual sensation of “complexity”. The lower entropy of the non-injective map also reflects the fact that such a map has greater order. Complex genetic interactions must be present in the system to give rise to this order. On the other hand, an injective map is less ordered, and a genetic system wherein the genes are acting independently of one another will typically lead to such a map. This provides further justification for the association of non-injectivity with complexity.
With this as a guiding principle, a rigorous set-theoretic definition for the genetic complexity of a GPM is derived in Text S1
. Because a GPM involves a mapping between two sets, set theory is the appropriate framework with which to formulate a definition of the genetic complexity. The derivation begins with the assumptions of 1) an organism with sexual reproduction and a set of founder genotypes, leading to a set of recombinants, 2) a GPM in which each genotype maps to one phenotype, and 3) the association of complexity with non-injectivity of the GPM, as motivated above. The application of the definition to situations outside this set of assumptions is discussed at length in Text S1
. The association of complexity with non-injectivity leads to a definition which is proportional to the difference between a metric of ‘genotype complexity’ and a metric of ‘phenotype complexity’ in the GPM. The bulk of the derivation concerns rigorously establishing the sets of relevant genotypes and phenotypes for a given GPM. The end result is
is the genotype complexity of the map: n
is the number of phenotypically relevant loci in the system and m
is the geometric mean of the number of phenotypically unique alleles per informative locus. p
is the phenotype complexity and is equal to the number of measurably distinguishable phenotypes produced by the system across all relevant genotypes. See Text S1
for rigorous formulations of these informal definitions. In practice, the genetic complexity is computed by enumerating the phenotypically relevant genotypes, subtracting the number of phenotypes produced, and dividing by the normalization factor. We thus see that the complexity measures the surplus of genotypic diversity over phenotypic diversity. Our definition therefore captures the increasing complexity of non-injective mappings. The denominator of Equation 1 simply provides a normalization factor that insures that the complexity doesn't grow linearly with the size of the genetic system. In Text S1
, we discuss considerations and approaches for the experimental application of Equation 1. An example GPM involving a simple diploid system and its corresponding genetic complexity is shown in .
An example genotype-to-phenotype map and the calculation of its genetic complexity.
We stress that the genetic complexity is a property a GPM, not of a single genotype and its corresponding genetic network. Thus, the genetic complexity of a GPM is specific to a defined set of genotypes and the experimental protocol and precise parameters of the phenotypic measurements to be carried out on the genotypes. Multiple phenotype measurements can be accommodated readily. In order for the complexity to be a meaningful quantity, the set of genotypes in the GPM should consist of all possible combinations of the alleles under consideration. We refer to this theoretical set of genotypes as a ‘Mendelian library’.
The definition also allows for stochastic behavior. If the phenotype of a particular genotype is measured several times and it is found that two or more distinct phenotypes are produced, then each distinct phenotype will contribute to the quantity p
. See Text S1
for details. Note that as a consequence the complexity C
can be negative, indicating that more phenotypes are produced than the number of genotypes in a mapping. GPMs with negative complexity are simply GPMs that are less complex than a GPM with zero complexity. If a single genotype gives rise to multiple phenotypes in a GPM with negative complexity, then not only will the underlying genotype be determinable from a phenotype measurement, but additional information will be gained in the form of which of the multiple phenotypes was realized in a particular case. Although in its initial formulation the definition treats a GPM as a weightless bipartite graph, the definition readily accommodates extensions in which multiple phenotypes arising from a single genotype are weighted, i.e., occur with different frequencies. Phenotypic ambiguity could result also from outlier measurements and from biological (true) noise. Methods
for precluding the impact of outliers while accommodating biological noise are presented in Text S1
, Kruglyak discussed complexity in the context of genome-wide association studies and noted that genetic complexity can be thought of as fractal, with complexity manifest at more than one level. Equation 1 captures much of this idea, with both gene variation and gene combinatorics contributing positively to complexity. However, by the implications of Equation 1, gene variation and gene combinatorics are not complexity per se. Genetic complexity is a surplus of genotype diversity for a given level of phenotype diversity.
Having considered some general features of the definition of C
, we next wish to investigate its consequences when applied to genetic systems. Though C
can be estimated from real-world data (Text S1
), exact calculation of C
requires a complete tabulation of a GPM for a system in which the variables in the definition are well-defined. Therefore, as an initial investigation, we applied the definition to model genetic systems for which the relevant quantities can be computed easily. We first carried out a systematic examination of small Boolean network models. We then turned our attention to various features of the complexity of the cell-cycle network of yeast.
Genetic Complexity of Boolean Networks
Dynamic Boolean network models (“Boolean networks”, henceforth) have long been used to model biological networks 
. Boolean networks consist of a set of nodes which, at any moment of time, can either be ‘on’ or ‘off’. For example, the formalism is often applied to genetic networks, where the two states correspond to the gene being expressed or not expressed. Although this is a great simplification of the actual biological state of a genetic network and much dynamical information is sacrificed in this formalism, Boolean networks have proven to be a surprisingly fruitful class of models for capturing the large-scale properties and behaviors of genetic networks (see 
for a review and further references). Boolean networks are an ideal arena in which to apply the definition of genetic complexity, as they provide models of genetic systems wherein all relevant quantities are well-defined and discrete. In the context of Boolean network models, a genotype is specified by selecting an update function for each node, which determines how the node updates as a function of the network state. Once a genotype is specified, any initial state will flow into an attractor state, which corresponds to a phenotype. The attractor can either be a single state which updates to itself (a fixed attractor) or a series of states through which the network continuously cycles (a periodic attractor). The Boolean network framework allows also for asynchronous update rules for different nodes. In this work, we deal only with synchronous Boolean networks. Details on our employment of Boolean networks are given in the Methods
As with any mapping, a GPM is not fully specified until the domain on which the mapping acts is delineated. In the case of a GPM, this consists of a specification of the genotypes present in the system. In the abstract arena of Boolean networks, we could conceivably choose any arbitrary set of genotypes to form genotype libraries. However, to identify the effects of some property of the networks on the complexity, we should compare genotype libraries (in which each genotype takes the form of a truth table specifying a Boolean network) that differ only in the property of interest, and that otherwise include all relevant genotypes. In this section, we compare the complexities of network libraries as a function of order
(number of nodes in the networks), size
(number of edges in the networks) and topology
. In all cases, the library of truth tables is generated by selecting the subset of all possible truth tables with the desired number of edges and topology from the set of all truth tables with a given number of nodes. The determination of the edge structure of a network from its truth table is described in the Methods
Complexity by order and size
Initial questions to ask are how the complexity depends on the order and on the size of networks. The natural expectation is that if all other properties of networks are held fixed while the number of nodes and edges is increased the genetic complexity will increase. We obtained results in accord with this expectation.
To test the dependence of complexity on network order, for each order we computed the complexity of the library of networks that comprises all possible genotypes with the appropriate number of nodes. Calculating the complexity of such libraries is computationally tractable only for orders of three or less (for example, there are 1018
possible genotypes with order four). We therefore calculated the complexity for all libraries with order less than four. The results are presented in Table S1
. The complexity increases monotonically with the order of the library.
Another question to ask is how the complexity of a map depends on the size of the networks considered. To address this, for a given order and size we computed the complexity of the library of networks that comprises all possible genotypes with the appropriate number of nodes and edges. Our method of counting edges in a Boolean network is described in the Methods
section. The complexities by size for all order-three libraries are given in Table S2
. We found that the complexity increases monotonically as the size increases.
Complexity by topology
The topology of a genetic network encodes much information about the function of the network. Conserved topological motifs impart similar properties in different biological systems 
. The importance of network topology to function suggests that there may be a relationship between topology and genetic complexity. The final question that we addressed using small Boolean networks is how complexity depends on the topology of the genotype libraries. To this end, we computed the complexity for libraries that comprise all genotypes within a topology class. A given topology class consists of all genotypes with the same order and the same set of directed edges between nodes. The precise formulation is given in the Methods
We found that, for a fixed order and size, the relative complexity of topology classes is determined by two pieces of information:
- The variance of the distribution of in-degrees in the topology class
- The number of periodic attractors produced by the topology class
The in-degree distribution is the set of numbers of incoming edges incident to each node (see the Methods
section for further details). Topology classes that have a low-variance distribution of in-degrees (recall that we are comparing classes with equal order and size, and thus the same number of inputs to distribute among the nodes) always have a lower complexity than classes with higher-variance in-degree distributions. This result was found to hold for every combination of network order and size considered, which range from 2 nodes and 1 edge up to 7 nodes and 4 edges. In total, the complexity was computed for 2585 topology classes containing 109
networks. No exception to the above rule was found. This strongly suggests that the rule will continue to hold for yet larger networks. A sample of data supporting this result is given in .
The relationship between in-degree distribution and complexity for all topology classes with three nodes and three edges.
The difference in complexity associated with different in-degree distributions is most pronounced when one topology class has a node with zero inputs and the second does not. This difference can be understood heuristically as follows. If a topology class has a node with zero in-degree, then that node's update rule does not depend on the current state of any node. The node is immediately locked into whatever state it is in at time t
0. Having one node locked into an invariant state reduces the total number of phenotypes that can be realized by the topology class, resulting in a genotypic surplus. This increases the complexity. Similar reasoning applies to other cases of differing in-degree distributions.
This effect is also active across libraries of different sizes, and accounts for the unexpectedly small difference in Table S2
between the library of size 6 and that of size 7. For an order-three network, if there are 7 or more edges it is not possible to distribute the edges such that any node has zero inputs. Therefore no nodes will be locked into an initial state. Thus, the increase in complexity when adding the seventh edge is smaller than the general trend leads one to expect.
If two classes have the same distribution of in-degrees but a different topology, then the relative complexity is determined by which class realizes more periodic attractors. The class with more periodic attractors has a lower complexity. An example of this is given in . The relationship between complexity and number of periodic attractors is proven formally in the Methods
section. It rests on two facts:
The relationship between complexity and the number of loops for all topology classes with in-degrees (1, 1, 1).
- Any two Boolean-network topology classes with the same in-degree distribution have the same number of genotypes.
- Every topology class will realize all fixed point attractors.
Because of fact 1 above, the relative genetic complexity of two topology classes with the same in-degree distribution is determined solely by the number of attractors each class realizes. Because of fact 2, the difference in the number of attractors between two such classes is completely determined by the difference in the number of periodic attractors. The topology class with more periodic attractors will have more phenotypes overall, and thus will have a lower complexity.
The number of periodic attractors realized provides the second level of structure for the relative genetic complexity of topology classes. However, calculating the number of periodic attractors produced is typically just as difficult as calculating the complexity in the first place. One would like to have a way to predict the relative complexity of two libraries based solely on topological features that can be immediately assessed when presented with two libraries. We found that the number of loops in a topology class provides a qualitative predictor for the number of periodic attractors produced, and thus of the relative complexity of topology classes. An example of the relationship between number of loops and complexity is shown in and .
All topology classes with in-degrees (1,1,1), listed in order of decreasing complexity.
It was proposed by R. Thomas that negative feedback loops are necessary for the dynamical realization of periodic attractors 
. Furthermore, it was recently shown that certain types of loop structures are sufficient to cause dynamical behaviors such as periodic attractors 
. The relationship between number of loops and periodic attractors identified in this work is in line with these results. The relationship can also be understood heuristically, as a topology class with many loops has many feedback mechanisms. Feedback mechanisms are necessary to create the oscillatory behavior that gives rise to periodic attractors. Though the relationship between number of loops and number of periodic attractors produced does not hold strictly in every case, it does apply in most cases.
Thus, given two topology classes with the same order and the same size, we can make an informed prediction about their relative complexity. If the in-degree distributions of the two topology classes differ, then we know with certainty which will have lower complexity. If the in-degree distributions are the same, the topology class with more loops is likely to have lower complexity. One class of biological networks that exhibit a paucity of loops are gene-regulation networks employing master regulators 
. These networks tend to utilize a hierarchical structure, whereby information flows strictly from one level (the master regulators) to the next (the regulated target genes). A target of one master regulator can also be a master regulator in its own right, leading to a hierarchical structure with multiple levels. Such networks will feature few loops, and therefore have greater complexity. These hierarchical network structures frequently drive developmental processes that must be robust to environmental fluctuations and that are strongly unidirectional 
. Our results on small Boolean networks therefore suggest that genetically robust networks in general have greater genetic complexity.
Relationship to controllability
In a recent paper 
, Liu et al.
presented an analysis of the controllability of genetic networks with different topological structures. Though their analysis took place in the framework of continuous linear systems, and can therefore not be rigorously applied to the discrete nonlinear Boolean networks that we have been considering here, their qualitative results suggest a relationship between genetic complexity and controllability. In this context, controllability measures how many nodes of a network must be controlled exogenously in order to be able to realize all possible states in the state space of the network from any possible initial state. A network where fewer nodes must be directly controlled is more controllable.
The authors of 
found that the minimal number of nodes that need to be controlled in order to fully control the network is equal to the number of ‘unmatched’ nodes. Unmatched nodes are nodes that either have no inputs from other nodes in the network (in-degree of zero), or have inputs only from nodes that have multiple outputs, and thus cannot be individually controlled by any other node in the network. It was shown that the degree distribution plays a primary role in determining the controllability of a network, and that the difficulty of control increases monotonically with degree heterogeneity. These results are highly reminiscent of the relationship between complexity and in-degree distribution that we identified above. We also found that the in-degree distribution plays the primary role in determining genetic complexity, and that an increase in heterogeneity (an increase in in-degree variance) leads to greater complexity. These observations strongly suggest that more genetically complex topologies are also more refractory to exogenous control.
Complexity and the Cell Cycle Network
Having studied the complexity of libraries of generic Boolean networks, we examined the complexity of a specific biological system. The cell cycle network (CCN) of S. cerevisiae
has been studied extensively as a model network and is well understood. Furthermore, there exists for the CCN a Boolean network formulation 
, allowing the straightforward computation of its genetic complexity. The CCN Boolean network formulation of Li et al.
uses a threshold network framework. The network is shown in and the details of the threshold network formalism are discussed in the Methods
The threshold network formulation of the cell-cycle network of S. cerevisiae.
Genetic complexity characterizes a library of networks, not a single network, so in order to analyze the complexity of the CCN we must first construct a library which the yeast CCN determines. In the following analysis, for any given threshold network, its corresponding library consists of all threshold networks that have the same topology as the base network and where each node can exist in one of three states: 1) the wild type allele, where the update rule for the node is given by its inputs as defined by Li et al.
; 2) the null allele, where the node always updates to ‘off’; and 3) a constitutively active allele, where the node always updates to ‘on’. This choice of library is natural, given its biological and experimental relevance, and its symmetry between null and constitutively active states. In this section, when we mention the complexity of a network, we are referring to the complexity of the library generated from the network in the manner described above. For the yeast CCN, which has 11 nodes, this leads to a library of 3∧11
We found that for the library generated by the yeast CCN, when all 177147 genotypes were allowed to update from all 2048 initial states, no periodic attractors were produced. Because all fixed attractors were guaranteed to be produced by our construction of the CCN library, the lack of periodic attractors indicates that the complexity of the yeast CCN is maximal. It might seem surprising that a model of a cyclic process gives rise to no periodic attractors. However, this is consistent within the framework of the model, where the G1 phase of the cell cycle is a steady state of the system and initiation of the cell cycle corresponds to an external perturbation. This reflects the biological reality that, at any given time, most cells are not actively progressing through the cell cycle. We then systematically perturbed a number of features of the yeast CCN in order to identify which aspects of the network were most crucial to maintaining its maximal complexity.
First, we calculated the complexity of the 29 networks created by removing a single edge from the network. We found two edges which when removed resulted in a network with lower complexity than the CCN. One of these edges is a positive edge from Clb1,2 to Cdc20, and the other is a negative edge from Clb5,6 to Sic1. Both edges relay information from B-type cyclins, which govern the transition from the G2 phase to the M phase, suggesting that this process contributes significantly to the complexity of the CCN. It is known that the G2/M transition is a key checkpoint of the cell cycle, and that the B-type cyclins play a crucial role in this process 
We next systematically perturbed the inputs to each node. For each node, we considered all possible reassignments of its inputs, holding all other features of the network fixed. We then calculated how often the complexity is decreased by perturbing the inputs to each node. We found that there is a clear separation between nodes whose inputs are more or less important to maintaining maximal complexity, as shown in . The core set of nodes that have the greatest impact on the complexity of the network are highlighted in . We found that the nodes whose inputs are more important for complexity are precisely those nodes with out-degree greater than two. This can be understood intuitively, as those nodes with greater out-degree play a larger role in determining the states of other nodes in the network. Perturbing their inputs will therefore have a larger effect on the dynamics of the network as a whole.
The average effect of perturbations of the incoming interactions for each node in the CCN of S. cerevisiae.
We saw evidence in the previous section that networks with more loops tend to give rise to more periodic attractors. Investigations of the CCN also support this conclusion. This can be seen in two ways. In one analysis, we have added all possible single edges to the network and calculated the complexity. When adding an edge, we can also compute how many loops are created, and of what size. We found that those edge additions which lead to the creation of several small loops or very many large loops are more likely to produce periodic attractors (and thereby decrease the complexity) than a random edge addition.
We can also make a connection between the large-scale loop structure and information flow in the CCN and the number of periodic attractors produced. The dominant modes of information flow in the CCN are as shown in Figure S1
. The flow of information reflects that fact that the majority of interactions present in the CCN relay information in the pattern indicated. 86% of the edges in the network account for information flowing as indicated. Initial input flows from the top of the network, down either side, and up into the center. There are few edges that connect the bottom of the network to the top, thereby completing the loops in the large-scale information flow. We have found that the addition of edges from the bottom nodes to the top nodes is more likely to decrease complexity than a random edge addition, once again confirming that topologies with more loop structure give rise to lower GPM complexity.