Sensory processing in the brain includes three key operations: multisensory integration—the task of combining cues into a single estimate of a common underlying stimulus; coordinate transformations—the change of reference frame for a stimulus (e.g., retinotopic to body-centered) effected through knowledge about an intervening variable (e.g., gaze position); and the incorporation of prior information. Statistically optimal sensory processing requires that each of these operations maintains the correct posterior distribution over the stimulus. Elements of this optimality have been demonstrated in many behavioral contexts in humans and other animals, suggesting that the neural computations are indeed optimal. That the relationships between sensory modalities are complex and plastic further suggests that these computations are learned—but how? We provide a principled answer, by treating the acquisition of these mappings as a case of density estimation, a well-studied problem in machine learning and statistics, in which the distribution of observed data is modeled in terms of a set of fixed parameters and a set of latent variables. In our case, the observed data are unisensory-population activities, the fixed parameters are synaptic connections, and the latent variables are multisensory-population activities. In particular, we train a restricted Boltzmann machine with the biologically plausible contrastive-divergence rule to learn a range of neural computations not previously demonstrated under a single approach: optimal integration; encoding of priors; hierarchical integration of cues; learning when not to integrate; and coordinate transformation. The model makes testable predictions about the nature of multisensory representations.
Over the first few years of their lives, humans (and other animals) appear to learn how to combine signals from multiple sense modalities: when to “integrate” them into a single percept, as with visual and proprioceptive information about one's body; when not to integrate them (e.g., when looking somewhere else); how they vary over longer time scales (e.g., where in physical space my hand tends to be); as well as more complicated manipulations, like subtracting gaze angle from the visually-perceived position of an object to compute the position of that object with respect to the head—i.e., “coordinate transformation.” Learning which sensory signals to integrate, or which to manipulate in other ways, does not appear to require an additional supervisory signal; we learn to do so, rather, based on structure in the sensory signals themselves. We present a biologically plausible artificial neural network that learns all of the above in just this way, but by training it for a much more general statistical task: “density estimation”—essentially, learning to be able to reproduce the data on which it was trained. This also links coordinate transformation and multisensory integration to other cortical operations, especially in early sensory areas, that have have been modeled as density estimators.