The study of visual perception abounds with examples of surprising results, and perhaps none of these has generated more controversy than the speed of object recognition. Some complex objects can be recognized with amazing speed even while attention is engaged on a different task. Some simple objects need lengthy attentional scrutiny, and performance breaks down in dual-task experiments . These results are fundamental to our understanding of the visual cortex, as they clearly show the interplay of the representation of information in the brain, attentional mechanisms, binding and consciousness.
We argue that the lack of a common terminology is a significant contributor to this controversy, and define several different levels of tasks as: Detection – is a particular item present in the stimulus, yes or no?; Localization – detection plus accurate location; Recognition – localization plus detailed description of stimulus; Understanding – recognition plus role of stimulus in the context of the scene.
It is clear from performance results that detection is not possible for all stimuli, and the difference must be in the internal representation of the different stimuli. For detection to be possible, the fast, feed-forward activation of a neuron (or pool of neurons) must represent the detected stimulus, which is consistent with the experimental finding that only highly over-learned and biologically relevant stimuli or broad stimulus categories can be detected. In detection tasks localization is poor or absent , so location needs to be recovered based on this initial representation. Given that detailed location and extent information is only available in the early processing areas, this must be accomplished by the ubiquitous feedback connections in the visual cortex. Once the location of a stimulus has been recovered and distracters inhibited, one or more subsequent feed-forward passes through the system can create a detailed representation of the selected stimulus.
Here we present a computational demonstration of how attention forms the glue between the sparse, fast, and parallel initial representation that supports object detection and the slow, serial, and detailed representations needed for full recognition. The Selective Tuning (ST) model of (object based) visual attention  can be used to recover the spatial location and extent of the visual information that has contributed to a categorical decision. This allows for the selective detailed processing of this information at the expense of other stimuli present in the image. The feedback and selective processing create the detailed population code corresponding to the attended stimulus. We suggest and demonstrate a possible binding mechanism by which this is accomplished in the context of ST, and show how this solution can account for existing experimental results.
Artif Intell 1995, 78(1–2):507-545. Publisher Full Text