|Home | About | Journals | Submit | Contact Us | Français|
Object vision in human and nonhuman primates is often cited as a primary example of adult plasticity in neural information processing. It has been hypothesized that visual experience leads to single neurons in the monkey brain with strong selectivity for complex objects, and to regions in the human brain with a preference for particular categories of highly familiar objects. This view suggests that adult visual experience causes dramatic local changes in the response properties of high-level visual cortex. Here, we review the current neurophysiological and neuroimaging evidence and find that the available data support a different conclusion: adult visual experience introduces moderate, relatively distributed effects that modulate a pre-existing, rich and flexible set of neural object representations.
Sensory information processing in adult mammalian brains is highly malleable  with neural processing at all levels adapting to both the short- and long-term properties of the incoming information. In vision, prominent examples include short-term adaptation to input statistics in the retina , primary visual cortex (V1)  and subsequent cortical stages , and long-term reorganization due to changed visual input .
Nevertheless, cortical neural plasticity has often been viewed as more likely in visual regions selective for complex objects than in the input stage of processing, V1 . It is unlikely that cortical representations could be constructed, a priori, to represent all possible objects that might be encountered throughout life. Indeed, human beings can recognize an almost infinite number of objects, most of us sharing the ability to individuate thousands of faces despite their similarity (two eyes, nose, and mouth in a standard configuration). Further, we often develop expertise in the recognition of particular types of objects such as cars, birds or wild mushrooms .
In apparent support of extensive learning processes, the past two decades have seen numerous electrophysiological and brain imaging studies reporting effects of learning on high-level visual representations. By learning we mean the effects of any form of experience, whether that be through passive exposure or some sort of explicit training, such as learning to categorize  or discriminate  between objects. Most of these studies have focused on the lateral occipital, occipitotemporal and inferior temporal cortices in humans and the inferior temporal cortex in monkeys, which we collectively refer to as “IT cortex”. However, some studies have argued for a role of the whole cortical visual processing hierarchy in learning about objects and their features [10,11,12]. In this review, we critically evaluate the evidence and hypotheses about visual object learning in IT cortex in the context of the functional properties of these brain regions. We focus on long- rather than short-term learning effects (Text Box 1), in adults rather than during development (Text Box 2). We will conclude that, despite the common suggestion that object learning is supported by strong and often focal changes in neural representations [13,14], such as the possibility that a subpopulation of units tuned to complex images is created by and emerges due to a learning process [15,16], the empirical evidence suggests that these changes are moderate and distributed.
Individual neurons and subregions in IT cortex are selective for complex or moderately complex stimuli, responding more strongly to particular images or categories of images than others [17,18,19]. The observed properties suggest that IT cortex contains explicit representations of stimulus dimensions such as shape, at least some visual categories and possibly even semantic categories, with a relative tolerance for changes in other stimulus characteristics such as position, size, lighting, and, to a lesser degree, clutter and viewpoint (see Fig. 1). These representations are surprisingly versatile, with easy “read-out” of object category and identity , while simultaneously containing information about other stimulus characteristics like position [21,22].
Many studies of IT cortex have employed images of complex, real-life objects. With these images, it has proven difficult to determine the underlying dimensions that best describe the neural tuning functions (Fig. 1). Other experiments have included parametric variations of shapes and have reported systematic tuning functions for a variety of shape dimensions [23,24,25,26], analogous to the tuning in V1 for more simple stimulus dimensions (e.g. grating orientation). However, such tuning is demonstrated in relatively restricted stimulus sets, and we are far from understanding the representational space of more complex images. In this review we will describe the stimulus selectivity of IT neurons and its modulation by learning in terms of tuning curves for dimensions and features, but this selectivity could also be characterized in terms of more abstract principles, such as the statistical structure of images [27,28].
At the single-neuron level, we differentiate two tuning properties that might be modified by learning: the optimal stimulus, and the tuning function around that optimum. For example, consider a neuron that shows moderate tuning for two separate stimulus dimensions (Fig. 2a–b). Learning might increase or decrease the selectivity across one or both dimensions, it might change the optimal stimulus, and/or it might change the dimensions that are encoded by the neuron (Fig. 2c–e). Importantly, these changes in selectivity will interact with changes in response strength (Fig. 2f–h). For example, a sharpening of tuning could be achieved either by reducing responses to non-preferred stimuli, by increasing responses to preferred stimuli, or a combination of both.
What factors are likely to be important in driving these sorts of changes? Whereas theories of coding in V1 propose an optimal representation of visual images given the local spatiotemporal statistical properties of natural images , theories of object coding typically propose that these representations reflect more global statistical properties: the dependencies between more distant image patches  and at a longer time scale . Many of these spatial dependencies might be picked up by bottom-up Hebbian learning mechanisms sensitive for conjunctions of features , possibly augmented with mechanisms sensitive to probability statistics . The temporal dependencies might require a similar learning mechanism with a memory trace . In addition to these bottom-up mechanisms, the properties of object representations must also be sensitive to top-down information about which distinctions between images are most relevant or informative. Recent computational modeling emphasizes that tuning for moderately complex features would be the optimal compromise between the bottom-up statistics and the top-down requirements imposed on object representations .
At the population level there are two important aspects to consider: sparseness and clustering . Sparseness refers to the distribution of learning changes across the population of IT neurons. At one extreme, learning could modify the response properties of only a small number of neurons. At the other extreme, learning could modify the properties of all neurons. Clustering, on the other hand, refers to the spatial organization of learning-related changes. Learning could increase clustering of neurons with similar selectivity perhaps optimizing interactions between neurons.
If learning effects are sparse, then which factors determine which neurons will be modified most by learning? At least three factors have been proposed. First, learning might specifically involve a small subset of neurons that prior to learning showed limited responsiveness . After training these initially unresponsive neurons would be tuned to complex novel objects such as `paperclips' [13,37,38], requiring both sharpening of tuning (Figure 2c) and a shift in the optimal stimulus (Figure 2d). Second, learning might specifically involve face-selective neurons and regions [39,40]. This prediction is rooted in the hypothesis that expertise is the major cause for the selectivity of face-selective regions. Before training these brain regions would be characterized by strong face selectivity, but not necessarily by a response to the to-be-trained objects. Third, studies of orientation discrimination learning in retinotopic areas have suggested that learning might modify the tuning of the neurons that are most informative for solving the discrimination task , and a similar mechanism could be at work during object learning.
To understand visual object learning, it is critical to consider the degree of sparseness and clustering and the factors that determine which neurons are most affected. With the limited sampling afforded by single unit recording, it would be very easy to miss sparse learning-related changes, even if the few modified neurons show very large changes in tuning properties. Similarly, given the spatial resolution of functional imaging methods, it may only be possible to detect distributed learning-related changes and even then, only when there is significant clustering of those changes.
We will show below that the current empirical evidence is consistent with general predictions from the computational proposals: learning changes neural tuning and responsiveness according to both bottom-up stimulus characteristics and top-down task constraints. However, almost no studies have been computationally motivated to investigate more specific hypotheses, so the link between the empirical findings and the theories is tenuous.
There is one important caveat to the learning mechanisms discussed above. The diverse selectivity observed in IT cortex suggests that for any given task some neurons will be more informative than others. One possible mechanism underlying object learning might be to optimize the read-out of the most informative IT neurons [42,43,44], as has also been proposed in the domains of orientation and motion learning [45,46], without any changes of the response properties within IT cortex.
Experiments typically compare neural responses to learned objects with an unlearned baseline, either obtained before learning [9,47], after learning to a set of unlearned objects [48,49] or in different, `naive' subjects [50,51]. Importantly, studies do not typically track changes in the properties of single neurons across days and weeks because of technical limitations, and longitudinal studies have been limited to 1–2 hours at most [52,53,54]. FMRI allows changes to be tracked across long periods of time in individual voxels, but each voxel contains hundreds of thousands of neurons. So, all inferences are necessarily based on the comparison of population statistics between experimental and baseline conditions.
Learning changes the selectivity or tuning for experienced objects, with or without an additional role for task constraints. In monkeys, some studies have reported a strong increase in selectivity for trained compared to untrained objects [38,50] as well as an increased selectivity for relevant compared to irrelevant stimulus dimensions  (Fig. 3a–b). Later studies, which controlled for pre-existing selectivity biases and spatial attention, reported much smaller effects, often only a few spikes per second [47,48,55,56] Fig. 3c–d, 4a–b). Using an indirect fMRI adaptation method to determine average changes in selectivity in a neural population [57,58,59], human studies have confirmed a general increase in selectivity for trained objects in object-selective cortex, but without a difference in selectivity between relevant and irrelevant dimensions. Overall, effects have been found consistently by comparing trained with untrained objects, but more specific effects such as differences between relevant and irrelevant dimensions seem to be small and harder to detect.
Studies have also reported decreased selectivity for stimuli that become associated or predictive of each other [60,61]. The mechanism behind such associative coding might underlie the creation of tolerance for image transformations such as position or orientation changes . However, the effects have mostly been reported in multimodal cortical regions (perirhinal and entorhinal cortex) rather than unimodal visual areas (area TE in monkeys), appear weaker in TE  and are dependent on the integrity of perirhinal and entorhinal cortex . Thus, there is ambiguity as to whether associative coding is an instance of visual object learning or of semantic memory.
While the temporal association in the classic paired-associate task is artificial, other studies have used temporal dynamics during training that resemble natural vision. For example, a comparison of view invariance between familiar and unfamiliar objects, where the familiar objects had been placed in the monkeys living environment, suggested stronger invariance for the familiar objects . However, too few neurons were tested in the critical comparisons to allow strong conclusions. A recent within-session longitudinal study induced associations between objects with temporal dynamics that come close to those encountered in free viewing, and demonstrated that tolerance can be modulated in this situation  (Fig. 3e–f). Nevertheless, it remains unclear whether these findings reflect the same mechanisms that are responsible for the degree of tolerance typically observed in IT cortex.
As noted earlier, learning may not just modify the degree of selectivity, but also the optimal stimuli and tuning dimensions (Fig. 2d–e). Some studies have provided circumstantial evidence for such changes. For example, Logothetis and colleagues found neurons in monkey IT cortex with a strong preference for trained views of complex paperclip objects, but noted that “no selective responses were ever encountered for those views that the animal systematically failed to recognize”  (p. 558). Similarly, other studies [50,65] have also highlighted subpopulations of neurons with particularly high and selective responses for trained objects, but not for novel objects. However, none of these studies provided sufficient control conditions to firmly establish the nature of the underlying effects. Perhaps the best evidence for at least some minor changes in tuning dimensions comes from a study in which monkeys were trained to categorize multi-part baton stimuli based on the combination of parts present . Comparing the patterns of selectivity for trained and untrained batons revealed greater coding for the specific combination of parts, rather than the presence of individual parts, in trained batons. With single-part properties as the original dimensions, the coding for part combinations is an example of the generation of a new tuning dimension reflecting the interaction between parts (as illustrated in Fig. 2e).
The simplest characteristic at the population level is the overall response in experimental and baseline conditions. Some studies in monkeys have reported an increased population response [38,50,65,66,67], other studies a decreased response [48,49,68], other studies little or no effect . Human imaging experiments suggest that both increases and decreases in response magnitude may occur across distributed areas of cortex [9,10,70,71]. Given the theoretical considerations above, it is not surprising that both increases and decreases have been observed – the results may depend on the initial tuning of the neurons with respect to the learned stimuli. The disparity across physiology studies probably also reflects the limited and non-uniform sampling (a few hundred neurons at best). Further, if the effects of learning are sparse with limited clustering, it will be difficult to isolate them.
These considerations bring us to the question of how sparse training effects are. Plotting the distribution of effects of interest, such as selectivity for trained and untrained objects, often reveals a general shift in this distribution (Fig. 4b,e). However, this does not necessarily imply all neurons are affected equally by learning. Without tracking single neurons over time it is very difficult to establish the nature and distribution of learning effects. Nevertheless, fMRI has already revealed that training changes the pattern of selectivity across voxels  (Fig. 4f), indicating heterogeneous training effects across voxels that are not fully distributed. One other fMRI study has suggested relatively focal effects of learning to read a particular alphabet (Hebrew) in one part of IT cortex, the visual word-form area . In addition, a single-unit study in monkeys has suggested that visual experience increases the clustering of neurons in perirhinal cortex with similar response properties , which should also change and even enhance the pattern of selectivity as measured with fMRI.
Since the training effects are not fully distributed, then the question is which neurons will be affected most? In the theoretical section we mentioned three possible factors. Almost all empirical efforts so far have focused on the role of face selectivity and some fMRI studies have indeed found strong training effects in face-selective cortex [39,40,73]. However, most of these studies did not systematically compare face-selective cortex with other regions in IT cortex. Other, more recent studies have shown that relative moderate learning effects are distributed throughout IT cortex [9,58,74,75], without any relationship to face selectivity.
Another candidate factor is how informative a neuron is for the task at hand. This hypothesis is supported by studies of the effect of orientation discrimination learning in retinotopic areas V1–V4 . The preferred orientation of a neuron turned out to be a strong predictor of the strength of training effects. Without reference to the preferred orientation of neurons, these studies of orientation discrimination would support similar conclusions as the IT data do: small effects distributed across a broad neural population (e.g., compare Fig. 4b with Fig. 3 in ref. ). The problem with applying this approach to the learning of more complex objects is that, as described above, visual objects occupy a multi-dimensional space of which the dimensions are not very well known. One potential way out of this problem would be to abandon the notion of an explicit representation of features or dimensions, and describe the tuning curves of neurons and experience-related changes in terms of relatively abstract notions of the statistical properties of complex visual images [27,28].
Nevertheless, `informativeness' is a promising candidate to encompass all existing findings about the neural basis of visual object learning, and is closely related to a formal computational model [35,76]. Furthermore, it can also explain at which level in the visual processing hierarchy effects would be most abundant: the level at which the selectivity of neurons fits best with the task at hand [50,77]. Finally, it might also explain why, sometimes, neural training effects are small despite large behavioral effects. Indeed, given the diverse tuning properties observed in IT cortex, learning may merely modify the read-out of this rich, `informative' neural population [42,43].
Despite the prevalent view that IT cortex is highly plastic, the current evidence remains limited. Well-controlled studies find relatively small effects that seem to be widely distributed. The evidence is strong enough to uphold the view that learning modulates at least some aspects of object encoding in IT cortex, but more detailed questions remain unanswered (Text Box 3). Future studies need to combine computational models with empirical experiments in order to predict when the available representations are sufficient for task performance (no further changes necessary), and when not. Furthermore, studies should relate effects of learning to the specific properties of neurons in order to pinpoint the potentially small sub-population of IT neurons that are targeted by learning. This sub-population might be defined by tuning properties (e.g., preferred objects or features), type of neurons (e.g., inter-neurons), or anatomical position relative to the organizational units in IT cortex (e.g., columns and patches). Answering such questions would be much easier if learning studies could track the response properties of single neurons over days and weeks. This methodology is currently being developed [78,79].
We have highlighted one specific hypothesis, that the strength of learning effects might be related to the pre-learning usefulness of neurons for the learned task and stimuli. This hypothesis seems quite obvious, given that learning effects are intimately bounded by the nature of the pre-learning representation. However, at the same time, it unites many benefits: it provides a link to computational arguments , and it offers a clear view on the expected strength of learning effects given the already existing representations. However, to test this hypothesis convincingly further technical developments are necessary to track single-neuron responses over long periods of time.
There are very good reasons to differentiate learning effects according to the time scale over which those effects are seen. The molecular mechanisms are very different for modifications over seconds and minutes compared to those over hours and days. For example, long-term changes have to involve protein synthesis and gene transcription. The distinction between short and long term has been made in the literature on perceptual learning of basic visual properties such as orientation selectivity, and there are a few prototypical examples. Typical short-term effects are visual after-effects in which the continuous presentation of one adapting stimulus (e.g., a leftward oriented line) changes the perception of another stimulus (e.g., a vertical line ) presented immediately after the adapting stimulus. Typical long-term effects are the decrease in orientation or texture discrimination thresholds that are induced gradually during several weeks of training [81,82]. At least some studies have suggested a role of sleep in these long-term paradigms [83,84], pointing to a qualitative difference between short-term and long-term learning in the underlying mechanisms. The literature of visual object learning also contains studies of short-term adaptation and multiple-day learning. Nevertheless, the distinction is less clear. Object adaptation and priming studies often include intervening stimuli, and such effects tend to integrate over many tens of stimulus presentations [85,86,87]. Is it appropriate to refer to this as a short-term effect? On the other hand, object learning studies in humans tend to include a relatively short training period of on average a few hours (ranging from less than 1 hour to at most 10 one-hour sessions) (e.g. [9,88,89,90,91,92]), and the role of sleep in these paradigms has not been investigated. Is this a long-term effect? Whereas we can make an operational distinction between within-session and between-session learning, for now it is premature to attach too much weight to this distinction. Nevertheless, we will restrict our discussion to studies traditionally believed to focus on long-term learning effects, and refer to other reviews for a discussion of more short-term, within-session adaptation effects .
Object representations in the adult brain are the result of a long developmental history that affects all levels of the visual processing hierarchy. Even in V1, many of the properties of neurons and maps, such as orientation selectivity, directional selectivity, and ocular dominance, depend on visual experience early in life (first few months after birth) [94,95]. After this time period, some complex response properties are already apparent in monkey IT cortex , but the physiological data in animals are not systematic enough to make quantitative comparisons of object representations as a function of age . Evidence from several non-invasive studies suggests that face processing is the consequence of an interplay between face-specific innate mechanisms and visual experience . Further, behavioral studies in human and monkey infants have revealed experience-dependent face-specific biasing mechanisms that are further tuned by visual experience [99,100]. Early visual experience has a special importance for face processing given that visual deprivation in the first few months of life disrupts the development of the holistic processing that is characteristic for face perception in normal adults . Nevertheless, neural markers of category-specific object processing in the human brain change up to adolescence . In sum, although plasticity during development is probably much higher than during adulthood, the currently available data suggest for both periods that visual experience introduces incremental effects that modulate a pre-existing set of neural object representations.
We thank Annie Chan, Assaf Harel, Dwight Kravitz, Sue-Hyun Lee and Rufin Vogels for comments on the manuscript, and Wouter De Baene, James DiCarlo, Chou Hung, Xiong Jiang, Nuo Li, Christopher Moore, Charan Ranganath, Maximilian Riesenhuber, and Rufin Vogels for providing stimuli and data values. Support was provided by the Human Frontier Science Program (grant CDA 0040/2008 to H.O.d.B.), the Fund for Scientific Research – Flanders (grant 1.5.022.08 to H.O.d.B.), and the NIH Intramural Research Program at NIMH.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.