In order to quantify how much information can be transmitted, we must first define it. In Shannon’s theory, information is conceptually considered to be knowledge that enables the state of a system (or signal or input) to be distinguished from among many available potential states. Examples include whether a coin flip turns up heads or tails, or whether a roll of a die is 1, 2, 3, 4, 5, or 6, or (as Paul Revere famously employed) whether one light is lit to signal a land invasion versus two lights to signal invasion by sea. The more states that are available for selection, the more information can be obtained when the selection is made. Importantly, in this model of information, the meaning or identity of a state is irrelevant. It only matters that each state can be encoded as a unique symbol, such as heads/tails, the numbers 1 through 6, or one or two lights.

As a consequence of the above definition, any system capable of taking on multiple states can be a source of information, and that state can be mathematically represented as a random variable that can take on multiple values. If we are initially uncertain about the value of this random variable, but then later ascertain its value, we will have resolved the system’s state and thus gained information. The amount of information that can be gained, or equivalently the amount of uncertainty associated with the random variable that can be reduced, can be quantified by the Shannon entropy which is described further below.

The analysis of how well information can be transmitted relies on the concept of a communication channel, which is a system that links an input source of information to some output (). Any channel can be mathematically represented by a random variable for the input and another random variable for the output, where the values of the two variables depend on each other. Consequently, measuring the output value can help resolve the input value, and the amount of information thus gained can be quantified using the mutual information, described below.

Entropy

A central concept to information theory is the Shannon entropy, a separate concept from the thermodynamic entropy. The Shannon entropy (hereafter referred to simply as entropy) quantifies how unpredictable the value of a random variable is, and hence can be thought of as a

*measure of uncertainty*. For a discrete random variable

*X* which can take on the values

*x*_{1},

*x*_{2}, ..,

*x*_{n}, with the respective probabilities

*p*(

*x*_{1}),

*p*(

*x*_{2}), …,

*p*(

*x*_{n}), the entropy is defined as [

9]

We have chosen, by the usual convention, to define entropy using a base 2 logarithm so that the entropy is measured in bits. Note also that since each of the *p*(*x*_{i}) is between zero and one inclusive, each term in the sum is non-positive (taking, by convention, the value of 0 log 0 to be identically zero), and thus entropy is necessarily non-negative.

A few case examples can help to demonstrate that this formula indeed provides an intuitive measure of uncertainty. In the first example, let *X* be the outcome of a flip of an evenly weighted coin. This random variable has two outcomes each with probability of ½, i.e. *p*(*x*_{1}) = *p*(*x*_{2}) = ½. Thus, the entropy of *X* is

The 1 bit entropy is consistent with the uncertainty associated with the two equally probable outcomes for the flip of a fair coin.

Now, in the second example, consider instead the example of rolling a fair 6-sided die. In comparison to the fair coin, there are more outcomes (also equally probable), hence we would expect the entropy to be greater. Indeed, for this example, *p*(*x*_{1}) = *p*(*x*_{2}) = … = *p*(*x*_{6}) = 1/6, so that

As anticipated, the value of 2.59 bits confirms our expectation that the greater number of possible outcomes confers higher entropy to rolling of a fair die than to flipping of a fair coin.

In our third example, consider the flip of an unfair coin. If the coin is now weighted so that it is three times as likely landing heads instead of tails, the entropy of a single coin flip decreases from 1 bit to 0.81 bits.

The reduction in entropy reflects a greater degree of certainty as to the outcome of the flip of the unfair coin compared to a fair coin. More generally, we can compute entropy as a function of the probability of landing heads, which yields a concave down graph (). The plot shows that the entropy is maximized when the probability is exactly 50%, and provides a clear example of a more general property of entropy: it is maximized when all outcomes are equally probable. Stated differently, when all outcomes are equally likely, then uncertainty is at its greatest.

Frequently, new users of information theory are confused as to the interpretation of partial bits. In computers, information storage is measured in a whole number of bits, but in information theory entropy (and related quantities) can take on non-integer values. Thus, 0.81 bits of entropy can be interpreted as being equivalent to a system that can take on between 1 and 2 states.

Continuing the example of an unfair coin, in the extreme, if the coin can only land on heads then the entropy is −1log_{2} 1 − 0log_{2} 0 = 0. The zero entropy reflects the fact that in this fourth example, the random variable can only take on one predetermined value and thus there is no uncertainty in its value. This is an example of entropy taking on its lowest possible value (recall from above that the entropy is non-negative).

Extending the concept of Shannon entropy towards biology, we again imagine an experiment in which a population of cells is exposed to various stimuli and the resulting individual cell responses are recorded. We may observe that, depending on the stimulus, the population of cells may exhibit relatively wide or narrow response distributions (). The response entropy can then be used as a metric of dispersion, because, as the variance of the response of a population of cells increases, there is higher uncertainty as to what response value an individual cell will take on and hence, greater associated entropy.

Together, these examples demonstrate that the entropy is an intuitive measure of the uncertainty associated with a random variable, and is a simple function of the number of possible states and the probabilities associated with those states. The specific form given in

Eq. 1 can be shown to be required for any measure of uncertainty which satisfies certain sensible axioms, particularly continuity, symmetry (the entropy does not depend on the assignment of the

*x*_{i}’s), maximality (entropy is maximized for a uniform distribution), and additivity (if the possible outcomes are partitioned into subsets, the overall entropy is the probability-weighted sum of the entropy of subsets) [

9]. Armed with this understanding of entropy, we now turn to a special use of the entropy: mutual information.

Mutual Information

A communication channel, such as a phone cable, fiber-optic line, or biochemical pathway, allows information to be transmitted from one place to another. Regardless of its underlying physical basis or complexity, any channel can be reduced to a “black box” that maps an input to an output. Because the input is not known *a priori* it can be considered to be a random variable, thus the output whose value depends on the input is also a random variable. Useful communication occurs if knowing the output value allows the input value sent through the channel to be fully or partially determined.

Mutual information quantifies this concept in terms of the amount of information that the value of one random variable contains about the value of another random variable. Using

*S* to represent the input (signal or sender, depending on context) and

*R* to represent the output (response or receiver), we can define their mutual information

*I*(

*R;S*) as:

where

*H* designates entropy. As discussed above, entropy is a measure of uncertainty, thus

*H*(

*S*) can be interpreted to be the overall uncertainty one has about the input

*S*, and

*H*(

*S*|

*R*) is the residual uncertainty about

*S* after the value of the response

*R* is known. Hence, the above definition can be interpreted to mean that mutual information is the reduction in the uncertainty about

*S* given the value of

*R*. Equivalently, the mutual information measures how accurately the input value can be determined based upon the output value.

To illustrate how mutual information measures communication accuracy, consider the simple example in which Samantha relays the result of a fair coin flip (*S*) over her phone to Roy who then records the result (*R*). For a fair coin flip, *H*(*S*) *=* 1 bit since both sides are equally probable as discussed above. In the first scenario, assume that the phone has no static and there is never any miscommunication between Samantha and Roy. In this case, *R* tells us exactly the value of *S*. When *R* is heads, *S* is always heads; and when *R* is tails, *S* is always tails. Hence in this example there is no residual uncertainty about *S* once *R* is known and *H*(*S*|*R*) *=* 0 bits. Together, the information that *R* provides about *S* is *I*(*R;S*) *=* 1 − 0 = 1 bit, as one would expect.

In the second scenario, imagine that the phone has static and there is sometimes miscommunication, e.g. suppose that 25% of the time Roy records the incorrect result. Then, knowing the value of *R* still leaves some uncertainty as to the value of *S*. Specifically, the conditional entropy is

where both

*H*(

*S*|

*R*=heads) and

*H*(

*S*|

*R*=tails) correspond to the unfair coin example whose entropy was computed to be 0.81 bits in the previous section. Together, this means that

*R* provides only

*I*(

*R;S*)

*=* 1 − 0.81 = 0.19 bits of information about

*S*. As expected, the mutual information is lower compared to the case of perfect communication.

In the third scenario, consider the extreme case in which Roy cannot at all tell what Samantha says on the phone and must guess randomly as to the result of the coin flip. It is easy to determine that the residual uncertainty about

*S* is quite high:

and hence

*I*(

*R;S*)

*=* 1 − 1

*=* 0 bits. The mutual information in this case implies zero transmission of information, matching with the expected result for this example. Together, these three examples help illustrate the concept that mutual information measures communication accuracy in terms of a reduction in uncertainty.

We can extend the idea of mutual information to biochemical signaling processes, by continuing with our previous example of entropy in cellular response distributions. The population of cells mentioned earlier is now subject to different stimuli and seeking to identify them. In particular, consider one strong and one weak stimuli generating corresponding response distributions (). If we were to randomly select a cellular response from either distribution and then attempt to resolve which stimulus was used, it is evident that the more the response distributions are separated the greater the accuracy in predicting the original stimulus. Conceptually, mutual information measures this accuracy. Thus, a large overlap between the weak and strong stimulus response distributions confounds our ability to discern the original stimulus leading to a corresponding drop in the mutual information between the stimulus and cellular response.

Mutual information has additional mathematical properties which are relevant to understanding its application to biological systems. First, by substituting in the definition of entropy, and expanding and rearranging the probabilities, we can arrive at an alternative definition of mutual information:

The alternative definition, given by the last equation above, is symmetric with respect to *R* and *S*. This symmetry implies that *R* gives as much information about *S* as *S* gives about *R.* Since this implies that *I*(*R;S*) and *I*(*S;R*) are identical, we have used the conventional semicolon notation to indicate that the order of the arguments within the parentheses is irrelevant. Mathematically, another important consequence of the symmetry is that

This relation is critical since we usually wish to quantify the reduction in uncertainty about the signal provided by the response (represented the former equality), but it is usually far easier to experimentally measure the distribution of responses (represented by the latter equality).

Next, we note that the lower bound of mutual information is zero. This lower bound is achieved if the quantities

*S* and

*R* are statistically independent of each other, for then

*p*(

*s*,

*r*) =

*p*(

*s*)

*p*(

*r*) and the logarithm in

Eq. 3 is always zero. Such independence might be achieved in the case of a communication channel affected by large amount of noise (see the third example above), and for such channels it is sensible that if there is no statistical dependence between the input and output, the value of one cannot provide any information (reduction in uncertainty) about the other. The converse is also true, that if their mutual information is zero then

*S* and

*R* are statistically independent. A practical implication of this property is that, outside its use in quantifying information transmission fidelity, mutual information can be used as a general tool for detecting whether there is any statistical dependency between two variables of interest.

At the other extreme, the upper bound of mutual information is the smaller of *H*(*S*) and *H*(*R*). The proof stems from the fact that entropy cannot be negative, so *H*(*S*|*R*) ≥ 0 thereby implying that *I*(*R;S*) ≤ *H*(*S*), and symmetrically, *I*(*R;S*) ≤ *H*(*R*). This upper bound can be reached, for instance, if *S*|*R* can only take on a single outcome, implying unambiguous identification of the signal that generated the specific response. In such a case *H*(*R*|*S*) will equal zero and *I*(*R;S*) will equal *H*(*S*) (see the first example provided above). Conceptually, the upper bound is only reached when there is no “noise” in the communication channel between *S* and *R*, such that the response leaves no residual uncertainty about the signal. More importantly, the upper bound also implies that the range of values that the input and output can take can limit the effectiveness of the communication channel. For instance, if we have a signal *S* that can take on 1,000,000 values (an entropy as high as log_{2} 1,000,000 ≈ 20 bits) but the output *R* that can only take on one of two values (entropy at most 1 bit) then the mutual information between *S* and *R* is necessarily 1 bit or less. As a result, a communication channel relying on a rich signal but poor output, or vice versa, can be limited in its ability to transmit information.

Notably, in most real world examples, there is a statistical dependence between

*S* and

*R* but the relationship is not “noiseless”. In such cases, the mutual information is positive but not as large as either the entropy of

*S* (or

*R*). The exact amount of mutual information will depend on the structure of the noise, that is, the particular way in which

*R* is a noisy representation of

*S*. In the biological context, the noise may include both intrinsic and extrinsic noise as all sources of noise can potentially confound accurate signaling. The effect of noise is fully encapsulated in the joint probability distribution between

*S* and

*R*, which as shown in

Eq. 3, also fully determines the mutual information. Thus, to compute the mutual information for a real world communication channel we must be able to measure the complete joint distribution between its input and output. Indeed, we note that one of the fundamental abstractions in information theory is that any channel can be represented by such a joint distribution, thus enabling an information theoretic analysis to be performed when the input-output properties of the channel are known but the underlying mechanisms generating those properties are unknown. Thus, by carefully choosing the input and output of interest, one can apply information theory to a broad variety of cell signaling systems, although one must be mindful of whether the properties of the channel would change depending on the context (e.g., whether the channel properties might be different in different cell types, or whether it might be retroactively affected by the presence of downstream processes, etc.).

Channel Capacity

The mutual information is the amount of information transmitted through a channel for a particular input probability distribution, and is not purely an intrinsic property of the channel. To see why, note that the mutual information *I*(*R;S*) is fully specified by the joint distribution *p*(*r*, *s*), which can be decomposed as the product of two probabilities: *p*(*r*, *s*) = *p*(*r*|*s*)*p*(*s*). The conditional distribution *p*(*r*|*s*) reflects uncertainty resulting from noise in the communication channel and is a property of the channel itself. In comparison, the marginal distribution *p*(*s*) reflects the range of the signals imposed on the channel which might be different for different uses of the channel. In other words, *p*(*s*) is a property of the source of the signal rather than a property of the channel itself.

For many real-world communication channels, it can be of interest to know the maximum amount of information that can possibly be transmitted through the channel. For a given channel (i.e. for a fixed *p*(*r*|*s*))*,* this quantity is known as the channel capacity *C*(*R;S*) and is mathematically defined as

In other words, the channel capacity is the mutual information maximized over all possible distributions of the signal (i.e., all possible signal sources or uses). Thus, capacity is effectively the data bandwidth and is an intrinsic property of the channel itself. The relevance of the capacity is further bolstered by the Noisy Channel Coding Theorem, a fundamental result in information theory. This theorem states that despite some degree of noise in the system a message can be sent across the channel and properly discriminated from other potential messages with an arbitrarily small amount of error, given that the entropy of the potential messages (i.e., the information source) is below the channel capacity. For an information source with an entropy greater than the capacity, there exists no way to transmit it so that messages can be discriminated from each other in an errorless manner. Thereby, the theorem ensures that capacity places a hard upper bound to how accurately data can be transmitted through a channel. Turning back to our original example, if Samantha would like to communicate to Roy the result of a die roll using a channel with a binary output, the Noisy Channel Coding Theorem would confirm our beliefs that there exists no way to do so in an errorless manner as the entropy of the die is ~2.59 bits which is greater than the 1 bit capacity of the binary channel.

We note that for many biological signaling channels,

*p*(

*r*|

*s*) can be readily experimentally measured, whereas

*p*(

*s*) cannot be easily estimated, particularly if

*S* corresponds to commonly very low ligand concentrations and infrequent signaling events. Hence, the amount of information corresponding to a particular signal source can be difficult to evaluate. However, channel capacity can be easily inferred by determining which

*p*(

*s*) yields the maximum amount of information. Typically,

*p*(

*r*|

*s*) can be sampled by providing a controlled input stimulus to the system and measuring the distribution of responses, which can then be repeated for many different stimulus values. Such an experiment implicitly requires imposing an artificial set of stimuli on the biological system of interest. On the other hand, the relevant

*p*(

*s*) is the natural frequency at which each stimulus value would be encountered by the system. The frequencies may be unknown or, at present, not easily experimentally determined. Nonetheless, for biological channels the capacity may yield insights as to the magnitude of the actual amount of information transmitted, because under the efficient coding hypothesis, biological systems whose primary function is communication can be expected to have evolved to be optimally matched to the information sources that feed them [

13]. For instance, in the anterior-posterior patterning system of the embryos of the

*Drosophila melanogaster* fruit fly, the measured mutual information between an input morphogen signal and an output transcription factor was ~90% of the capacity of the system [

7] (see below for further discussion). Thus, examination of the maximum capability of a biological system may shed light on its actual data throughput.

Rate Distortion Theory

Often any improvement in the capacity or quality of a communication channel comes at an associated cost (e.g., increased energy required), or alternatively it may be that an errorless communication channel would be impractical or unfeasible to construct. In such scenarios, cognizance of the relationship between the increase in the amount of tolerable channel distortion and the corresponding decrease in required channel capacity would aid in the understanding of biological communication channels. Fortunately, Rate Distortion Theory, a branch of information theory, provides a mathematical framework to examine the trade-off between capacity and the acceptable error (i.e., distortion) limit *D*.

Rate distortion analysis is performed in the context of a specific distortion function, which measures the error between the sent and received message. For instance, if the sent message is a scalar *s* and the received message is a scalar *r*, then a commonly used distortion function *d(s,r)* is the square of the difference of the sent and received message, *d*(*s*, *r*) = (*s* − *r*)^{2}.

The rate distortion function

*R*(

*D*) is then defined as the minimum amount of mutual information required to ensure that the average level of distortion is less than or equal to

*D*. Mathematically, this is represented as:

We note that the choice of the distortion function is important as differing measures of distortion will cause the minimization to often arrive at non-identical results. A major result of rate distortion theory is that the capacity of a communication channel must be at least *R*(*D*) in order to ensure that the average distortion is less than or equal to *D*. Furthermore, *R*(*D*) is a continuous non-increasing function, thus if the acceptable level of error is increased then the required capacity stays the same or decreases.

In a typical application of rate distortion theory, Roy and Samantha are once again are attempting to relay the results of a coin flip over a telephone line having some degree of static. In this scenario, they have an option of incrementally improving the communications channel at an associated economic cost to reduce the chance of miscommunication. Rate distortion theory helps solve the problem of determining the minimal channel quality such that the communication error does not exceed an amount that is acceptable to both parties.

Data Processing Inequality

Finally, we discuss the data processing inequality, which essentially states that at every step of information processing, information cannot be gained, only lost. More precisely, for a Markov chain *X* → *Y* → *Z*, the data processing inequality states that *I*(*X;Z*) ≤ *I*(*X;Y*). That is, *Z* contains no more information about *X* as *Y* does. Colloquially, the data processing inequality is analogous to the children’s game of “broken telephone”. As individuals are lined up and told to pass a message by whispering it to the next person in line, the addition of any extra people can only serve to distort the original message but not improve upon it hence the “broken telephone”.

The relevance of the data processing inequality is twofold. First, it places bounds on the performance of a biological system that contains multiple communication channels in series. For instance, consider *X → Y* to represent cytokine signaling to a transcription factor and *Y→ Z* to represent transcription factor signaling to the concentration of an expressed protein. Assuming no other sources of information, then the amount of information that the expressed protein (*Z*) provides about the cytokine signal (*X*) cannot be more than the information that the transcription factor (*Y*) provides about the cytokine (*X*). If the information between *X* and *Y* is particularly limiting, this can place strict bounds on the fidelity of the response *Z*.

Second, the data processing inequality has implications for experimental measurements. For instance, consider the chain *S* (signal) *→ R* (actual response) *→ R′* (measured response). Although an experimentalist might wish to quantify the mutual information between the signal and actual response, *I*(*S;R*)*,* she is confined to measuring *I*(*S;R′*). For *I*(*S;R′*) to be close in value to *I*(*S;R*) the noise between *R* and *R′* resulting from experimental error must be minimal. Thus, it is critical to pay close attention to the degree of experimental noise when attempting to measure mutual information.