The visual recognition of goal-directed movements is crucial for the learning of actions, and possibly for the understanding of the intentions and goals of others. The discovery of mirror neurons has stimulated a vast amount of research investigating possible links between action perception and action execution [1,2]. However, it remains largely unknown what is the real extent of this putative visuo-motor interaction during visual perception of actions and which relevant computational functions are instead accomplished by possibly purely visual processing.
We present a neurophysiologically inspired model for the visual recognition of hand movements. It demonstrates that several experimentally shown properties of mirror neurons can be explained by the analysis of spatio-temporal visual features within a hierarchical neural system that reproduces fundamental properties of the visual pathway and premotor cortex. The model integrates several physiologically plausible computational mechanisms within a common architecture that is suitable for the recognition of grasping actions from real videos: (1) A hierarchical neural architecture that extracts 2D form features with subsequently increasing complexity and invariance to position along the hierarchy [3-5]. (2) Extraction of optimal features on different hierarchy levels by eliminating features which are not contributing to correct classification results. (3) Simple recurrent neural circuits for the realization of temporal sequence selectivity [6-8]. (4) A simple neural mechanism that combines the spatial information about goal object and its affordance and the information about the end effector and its movement. The model is validated with video sequences of both monkey and human grasping actions.
We show that simple well-established physiologically plausible mechanisms can account for important aspects of visual action recognition and experimental data of the mirror neuron system. Specifically, these results are independent of explicit 3D representations of objects and the action. Instead, it realizes predictions over time based on learned 2D pattern sequences arising in the visual input. Our results complements those of existing models  and motivates a more detailed analysis of the complementary contributions of visual pattern analysis and motor representations on the visual recognition of imitable actions.
Supported by DFG, the Volkswagenstiftung, and Hermann und Lilly Schilling Foundation.