Making predictions about future rewards and adapting the behavior accordingly is crucial for any higher organism. One theory specialized for prediction problems is temporal-difference (TD) learning. Experimental findings suggest that TD learning is implemented by the mammalian brain. In particular, the resemblance of dopaminergic activity to the TD error signal  and the modulation of corticostriatal plasticity by dopamine  lend support to this hypothesis. We recently proposed the first spiking neural network model to implement actor-critic TD learning , enabling it to solve a complex task with sparse rewards. However, this model calculates an approximation of the TD error signal in each synapse, rather than utilizing a neuromodulatory system.
Here, we propose a spiking neural network model which dynamically generates a dopamine signal based on the actor-critic architecture proposed by Houk . This signal modulates as a third factor the plasticity of the synapses encoding value function and policy. The proposed model simultaneously accounts for multiple experimental results, such as the generation of a TD-like dopaminergic signal with realistic firing rates in conditioning protocols , and the role of presynaptic activity, postsynaptic activity and dopamine in the plasticity of corticostriatal synapses . The excellent agreement between the predictions of our synaptic plasticity rules and the experimental findings is particularly noteworthy, as the update rules were postulated employing a purely top-down approach.
We performed simulations in NEST  to test the learning behavior of the model in a two dimensional grid-world task with a single rewarded state. The network learns to evaluate the states with respect to its reward proximity and adapt its policy accordingly. The learning speed and equilibrium performance are comparable to those of a discrete time algorithmic TD learning implementation.
The proposed model paves the way for investigations of the role of the dynamics of the dopaminergic system in reward-based learning. For example, we can use lesion studies to analyze the effects of dopamine treatment in Parkinson's patients. Finally, the experimentally constrained model can be used as the centerpiece of closed-loop functional models.
Partially funded by EU Grant 15879 (FACETS), BMBF Grant 01GQ0420 to BCCN Freiburg, Next-Generation Supercomputer Project of MEXT, Japan, and the Helmholtz Alliance on Systems Biology.