Much of the success of single agent deep reinforcement learning(DRL) in recent years can be attributed to the use of experiencereplay memories (ERM), which allow Deep Q-Networks (DQNs)to be trained efficiently through sampling stored state transitions.However, care is required when using ERMs for multi-agent deepreinforcement learning (MA-DRL), as stored transitions can be-come outdated when agents update their policies in parallel . Inthis work we applyleniency to MA-DRL. Lenient agents mapstate-action pairs to decaying temperature values that control theamount of leniency applied towards negative policy updates thatare sampled from the ERM. This introduces optimism in the value-function update, and has been shown to facilitate cooperation intabular fully-cooperative multi-agent reinforcement learning prob-lems. We evaluate our Lenient-DQN (LDQN) empirically against therelated Hysteretic-DQN (HDQN) algorithm  as well as a mod-ified version we callscheduled-HDQN, that uses average rewardlearning near terminal states. Evaluations take place in extendedvariations of the Coordinated Multi-Agent Object TransportationProblem (CMOTP) . We find that LDQN agents are more likelyto converge to the optimal policy in a stochastic reward CMOTPcompared to standard and scheduled-HDQN agents.
Recommended citation: Palmer, G., Tuyls, K., Bloembergen, D., & Savani, R. (2018). Lenient Multi-Agent Deep Reinforcement Learning. In AAMAS (pp. 443-451). International Foundation for Autonomous Agents and Multiagent Systems Richland, SC, USA/ACM..