Publication

Way off-policy batch deep reinforcement learning of implicit human preferences in dialog

Dec. 1, 2019

Projects

ELSA: Empathy learning, socially-aware agents

Groups

Share this publication

Jaques N., Ghandeharioun, A., Shen, J., Ferguson, C., Jones, N., Lapedriza, A., Gu, S., Picard, R. (2019). Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog, NeurIPS workshop, Conversational AI.

Abstract

Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. These are critical shortcomings for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment – e.g. systems that learn from human interaction. Thus, we develop a novel class of off-policy batch RL algorithms, which are able to effectively learn offline, without exploring, from a fixed batch of human interaction data. We leverage models pre-trained on data as a strong prior, and use KL-control to penalize divergence from this prior during RL training. We also use dropoutbased uncertainty estimates to lower bound the target Q-values as a more efficient alternative to Double Q-Learning. The algorithms are tested on the problem of open-domain dialog generation – a challenging reinforcement learning problem with a 20,000-dimensional action space. Using our Way Off-Policy algorithm, we can extract multiple different reward functions post-hoc from collected human interaction data, and learn effectively from all of these. We test the real-world generalization of these systems by deploying them live to converse with humans in an open-domain setting, and demonstrate that our algorithm achieves significant improvements over prior methods in off-policy batch RL.

1907.00456.pdf

Hierarchical reinforcement learning for open-domain dialog

Saleh, A.*, Jaques, N.*, Ghandeharioun, A., Shen, J. H., & Picard, R. (2020). Hierarchical reinforcement learning for open-domain dialog. AAAI 2020.

Publication Research

Approximating interactive human evaluation with self-play for open-domain dialog systems

Ghandeharioun, A.*, Shen, J. H.*, Jaques, N.*, Ferguson, C., Jones, N., Lapedriza, A., & Picard, R. (2019). Approximating interactive human evaluation with self-play for open-domain dialog systems. In Advances in Neural Information Processing Systems (pp. 13658-13669).

Event Events

Positivity Resonates: Effects of face-to-face social connection on human wellbeing

Special lecture by psychologist Barbara FredricksonAre the moments of positive interpersonal connection we experience positive health behav…