In a visual scene, we perceive similar visual objects as being grouped together. We propose a high-level, mechanistic model of perceptual grouping where stimuli that are sufficiently similar are dynamically clustered forming a visual category. The model assumes that each stimulus can be characterized by a continuous feature, like an orientation, which is encoded in the neuronal activity of a recurrent network. The connectivity of the network is sufficiently structured to allow for the generation of feature-selective stable states, or bumps, while maintaining the stability of a non-specific, low-activity state associated with spontaneous activity. The emergence of bumps corresponds to the formation of stimulus categories and is triggered by external inputs, idealized as temporal sequences of feature values. We suppose that each feature value in the sequence is sub-threshold; it leads to a response that, compared to the stimulus presentation rate, decays slowly after the stimulus is removed. When the sequence of stimuli is presented, the network sums temporally the responses triggered separately by each item. Only if the input stream consists of an adequate number of similar feature values, will there be successive hits within a critical time window in one or more localized regions of feature space that will give rise to the emergence of bump activity states. The network encodes therefore the distribution of feature values of the temporal input stream and, by extension, the visual scene.