It is widely agreed that neural activity in the brain is more than the sum of its parts—coherent percepts, thoughts, and actions require the coordinated activity of many neurons in a network, not the independent activity of many individual neurons. It is not so clear, however, how to build bridges between this intuition about collective behavior and the activity of individual neurons.
One set of ideas is that the activity of the network as a whole may be confined to some very low dimensional trajectory, such as a global, coherent oscillation. Such oscillatory activity is observable in the summed electrical activity of large numbers of neurons—the EEG—and should be reflected as oscillations in the (auto–)correlation functions of spike trains from individual neurons. On a more refined level, dimensionality reduction techniques like PCA allow the activity patterns of a neural network to be viewed on a low-dimensional manifold, facilitating visualization and intuition 
. A very different idea is provided by the Hopfield model, in which collective behavior is expressed in the stabilization of many discrete patterns of activity, combinations of spiking and silence across the entire network 
. Taken together, these many patterns can span a large fraction of the full space of possibilities, so that there need be no dramatic dimensionality reduction in the usual sense of this term.
The claim that a network of neurons exhibits collective behavior is really the claim that the distribution of states taken on by the network has some nontrivial structure that cannot be factorized into contributions from individual cells or perhaps even smaller subnetworks. Our goal in this work has been to build a model of this distribution, and to explore the structure of that model. We emphasize that building a model is, in this view, the first step rather than the last step. But building a model is challenging, because the space of states is very large and data are limited.
An essential step in searching for collective behavior has been to develop experimental techniques that allow us to record not just from a large number of neurons, but from a large fraction of the neurons in a densely interconnected region of the retina 
. In large networks, even measuring the correlations among pairs of neurons can become problematic: individual elements of the correlation matrix might be well determined from small data sets, but much larger data sets are required to be confident that the matrix as a whole is well determined. Thus, long, stable recordings are even more crucial than usual.
To use the maximum entropy approach, we have to be sure that we can actually find the models that reproduce the observed expectation values (, ) and that we have not, in the process, fit to spurious correlations that arise from the finite size of our data set (). Once these tests are passed, we can start to assess the accuracy of the model as a description of the network as a whole. In particular, we found that the pairwise model began to break down at a network size
(). However, by adding the K-spike constraint that reproduces the probability of K
out of N
neurons spiking synchronously (), which is a statistic that is well-sampled and does not greatly increase the model's complexity, we could again recover good performance (–). Although the primary goal of this work was to examine the responses of the retina under naturalistic stimulation, we also checked that the K-pairwise models are able to capture the joint behavior of retinal ganglion cells under a very different, random checkerboard stimulation (Figure S7
). Despite a significantly smaller amount of total correlation (and a complete lack of long-range correlation) in the checkerboard stimuli compared to natural scenes, the pairwise model still deviated significantly from data at large N
and the K-spike constraint proved necessary; this happened even though the total amount of correlation in the codewords is smaller for the checkerboard stimulus. Characterizing more completely the dependence of the statistical properties of the neural code on the stimulus type therefore seems like one of the interesting avenues for future work.
How can we interpret the meaning of the K-spike constraint and its biological relevance? One possibility would be to view it as a global modulatory effect of, e.g., inhibitory interneurons with dense connectivity. Alternatively,
might be an important feature of the neural code for downstream neurons. For example, if a downstream neuron sums over its inputs and fires whenever the sum exceeds a threshold,
will be informative about the rate of such threshold crossings. We note that
is a special case of a more general linear function,
are arbitrary weights and θ
is a threshold (
). An interesting question to explore in the future would be to ask if the K
-statistic really is the most informative choice that maximally reduces the entropy of the K-pairwise model relative to the pairwise model, or whether additional modeling power could be gained by optimizing the weights
, perhaps even by adding several such projection vectors as constraints. In any case, the K-pairwise model should not be regarded as “the ultimate model” of the retinal code: it is a model that emerged from pairwise constructions in a data-driven attempt to account for systematic deficiencies of Ising-like models on large populations. Similarly, systematic deviations that remain, e.g., at the low and high ends of the effective field
as in , might inform us about further useful constraints that could improve the model. These could include either new higher-order interactions, global constraints, or couplings across time bins, as suggested by Refs 
Perhaps the most useful global test of our models is to ask about the distribution of state probabilities: how often should we see combinations of spiking and silence that occur with probability P? This has the same flavor as asking for the probability of every state, but does not suffer from the curse of dimensionality. Since maximum entropy models are mathematically identical to the Boltzmann distribution in statistical mechanics, this question about the frequency of states with probability P is the same as asking how many states have a given energy E; we can avoid binning along the E axis by asking for the number of states with energies smaller (higher probability) or larger (lower probability) than E. shows that these cumulative distributions computed from the model agree with experiment far into the tail of low probability states. These states are so rare that, individually, they almost never occur, but there are so many of these rare states that, in aggregate, they make a measurable contribution to the distribution of energies. Indeed, most of the states that we see in the data are rare in this sense, and their statistical weight is correctly predicted by the model.
The maximum entropy models that we construct from the data do not appear to simplify along any conventional axes. The matrix of correlations among spikes in different cells () is of full rank, so that principal component analysis does not yield significant dimensionality reduction. The matrix of “interactions” in the model () is neither very sparse nor of low rank, perhaps because we are considering a group of neurons all located (approximately) within the radius of the typical dendritic arbor, so that all cells have a chance to interact with one another. Most importantly, the interactions that we find are not weak (), and together with being widespread this means that their impact is strong. Technically, we cannot capture the impact within low orders of perturbation theory (Methods: Are the networks in the perturbative regime?
), but qualitatively this means that the behavior of the network as a whole is not in any sense “close” to the behavior of non–interacting neurons. Thus, the reason that our models work is likely not because the correlations are weak, as had been suggested 
Having convinced ourselves that we can build models which give an accurate description of the probability distribution over the states of spiking and silence in the network, we can ask what these models teach us about function. As emphasized in Ref 
, one corollary of collective behavior is the possibility of error correction or pattern completion—we can predict the spiking or silence of one neuron by knowing the activity of all the other neurons. With a population of N
100 cells, the quality of these predictions becomes quite high (). The natural way of testing these predictions is to look at the probability of spiking vs. time in the stimulus movie. Although we make no reference to the stimulus, we reproduce the sharp peaks of activity and extended silences that are so characteristic of the response to naturalistic inputs, and so difficult to capture in conventional models where each individual neuron responds to the visual stimulus as seen through its receptive field 
One of the dominant concepts in thinking about the retina has been the idea that the structure of receptive fields serves to reduce the redundancy of natural images and enhance the efficiency of information transmission to the brain 
(but see 
). While one could argue that the observed redundancy among neurons is less than expected from the structure of natural images or movies, none of what we have described here would happen if the retina truly “decorrelated” its inputs. Far from being almost independent, the activity of single neurons is predicted very well by the state of the remaining network, and the combinations of spiking and silence in different cells cluster into basins of attraction defined by the local minima of energy in our models. While it is intriguing to think about the neural code as being organized around the “collective metastable states,” some of which we have identified using the maximum entropy model, further work is necessary to explore this idea in detail. Unlike our other results, where we could either compare parameter-free predictions to data, or put a bound on the entropy of the code, it is harder to compare the model's energy landscape (and its local minima) to the true energy landscape, for which we would need to be able to estimate all pattern probabilities directly from data. It is therefore difficult to assess how dependent the identified collective states are on the form of the model. Nevertheless, for any particular model assignment of activity patterns to collective states, one could ask how well those collective modes capture the information about the stimuli, and use that as a direct measure of model performance. We believe this to be a promising avenue for future research.
120 neurons, our best estimate of the entropy corresponds to significant occupancy of roughly one million distinct combinations of spiking and silence. Each state could occur with a different probability, and (aside from normalization) there are no constraints—each of these probabilities could be seen as a separate parameter describing the network activity. It is appealing to think that there must be some simplification, that we won't need a million parameters, but it is not obvious that any particular simplification strategy will work. Indeed, there has been the explicit claim that maximum entropy approach has been successful on small (
) groups of neurons simply because a low-order maximum model will generically approximate well any probability distribution that is sufficiently sparse, and that we should thus not expect it to work for large networks 
. Thus, it may seem surprising that we can write down a relatively simple model, with parameters that number less than a percent of the number of effectively occupied states (
effective states at N
120) and whose values are directly determined by measurable observables, and have this model predict so much of the structure in the distribution of states. Surprising or not, it certainly is important that, as the community contemplates monitoring the activity of ever larger number of neurons 
, we can identify theoretical approaches that have the potential to tame the complexity of these large systems.
Some cautionary remarks about the interpretation of our models seem in order. Using the maximum entropy method does not mean there is some hidden force maximizing the entropy of neural activity, or that we are describing neural activity as being in something like thermal equilibrium; all we are doing is building maximally agnostic models of the probability distribution over states. Even in the context of statistical mechanics, there are infinitely many models for the dynamics of the system that will be consistent with the equilibrium distribution, so we should not take the success of our models to mean that the dynamics of the network corresponds to something like Monte Carlo dynamics on the energy landscape. It is tempting to look at the couplings
between different neurons as reflecting genuine, mechanistic interactions, but even in the context of statistical physics we know that this interpretation need not be so precise: we can achieve a very accurate description of the collective behavior in large systems even if we do not capture every microscopic detail, and the interactions that we do describe in the most successful of models often are effective interactions mediated by degrees of freedom that we need not treat explicitly. Finally, the fact that a maximum entropy model which matches a particular set of experimental observations is successful does not mean that this choice of observables (e.g., pairwise correlations) is unique or minimal. For all these reasons, we do not think about our models in terms of their parameters, but rather as a description of the probability distribution
itself, which encodes the collective behavior of the system.
The striking feature of the distribution over states is its extreme inhomogeneity. The entropy of the distribution is not that much smaller than it would be if the neurons made independent decisions to spike or be silent, but the shape of the distribution is very different; the network builds considerable structure into the space of states, without sacrificing much capacity. The probability of the same state repeating is many orders of magnitude larger than expected for independent neurons, and this really is quite startling (). If we extrapolate to the full population of ~250 neurons in this correlated, interconnected patch of the retina, the probability that two randomly chosen states of the system are the same is roughly one percent. Thus, some combination of spiking and silence across this huge population should repeat exactly every few seconds. This is true despite the fact that we are looking at the entire visual representation of a small patch of the world, and the visual stimuli are fully naturalistic. Although complete silence repeats more frequently, a wide range of other states also recur, so that many different combinations of spikes and silence occur often enough that we (or the brain) can simply count them to estimate their probability. This would be absolutely impossible in a population of nearly independent neurons, and it has been suggested that these repeated patterns provide an anchor for learning 
. It is also possible that the detailed structure of the distribution, including its inhomogeneity, is matched to the statistical structure of visual inputs in a way that goes beyond the idea of redundancy reduction, occupying a regime in which strongly correlated activity is an optimal code 
Building a precise model of activity patterns required us to match the statistics of global activity (the probability that K
out of N
neurons spike in the same small window of time). Several recent works suggested alternative means of capturing the higher-order correlations 
. Particularly promising and computationally tractable amongst these models is the dichotomized Gaussian (DG) model 
that could explain correctly the distribution of synchrony in the monkey cortex 
. While DG does well when compared with pairwise models on our data, it is significantly less successful than the full K-pairwise models that we have explored here. In particular, the DG predictions of three-neuron correlations are much less accurate than in our model, and the probability of coincidences is underestimated by an amount that grows with increasing N
). Elsewhere we have explored a very simple model in which we ignore the identity of the neurons and match only the global behavior 
. This model already has a lot of structure, including the extreme inhomogeneity that we have emphasized here. In the simpler model we can exploit the equivalence between maximum entropy models and statistical mechanics to argue that this inhomogeneity is equivalent to the statement that the population of neurons is poised near a critical surface in its parameter space, and we have seen hints of this from analyses of smaller populations as well 
. The idea that biological networks might organize themselves to critical points has a long history, and several different notions of criticality have been suggested 
. A sharp question, then, is whether the full probability distributions that we have described here correspond to a critical system in the sense of statistical physics, and whether we can find more direct evidence for criticality in the data, perhaps without the models as intermediaries.
Finally, we note that our approach to building models for the activity of the retinal ganglion cell population is entirely unsupervised: we are making use only of structure in the spike trains themselves, with no reference to the visual stimulus. In this sense, the structures that we discover here are structures that could be discovered by the brain, which has no access to the visual stimulus beyond that provided by these neurons. While there are more structures that we could use—notably, the correlations across time—we find it remarkable that so much is learnable from just an afternoon's worth of data. As it becomes more routine to record the activity of such (nearly) complete sensory representations, it will be interesting to take the organism's point of view 
more fully, and try to extract meaning from the spike trains in an unsupervised fashion.