Off-policy Learning in Large-scale POMDP-based Dialogue Systems

Lucie Daubigney 1, 2 Matthieu Geist 2 Olivier Pietquin 2
1 MAIA - Autonomous intelligent machine
Inria Nancy - Grand Est, LORIA - AIS - Department of Complex Systems, Artificial Intelligence & Robotics
2 IMS - Equipe Information, Multimodalité et Signal
UMI2958 - Georgia Tech - CNRS [Metz], SUPELEC-Campus Metz
Abstract : Reinforcement learning (RL) is now part of the state of the art in the domain of spoken dialogue systems (SDS) optimisation. Most performant RL methods, such as those based on Gaussian Processes, require to test small changes in the policy to assess them as improvements or degradations. This process is called on policy learning. Nevertheless, it can result in system behaviours that are not acceptable by users. Learning algorithms should ideally infer an optimal strategy by observing interactions generated by a non-optimal but acceptable strategy, that is learning off-policy. Such methods usually fail to scale up and are thus not suited for real-world systems. In this contribution, a sample-efficient, online and off-policy RL algorithm is proposed to learn an optimal policy. This algorithm is combined to a compact non-linear value function representation (namely a multilayers perceptron) enabling to handle large scale systems.
Complete list of metadatas

Cited literature [19 references]  Display  Hide  Download

https://hal-supelec.archives-ouvertes.fr/hal-00684819
Contributor : Sébastien van Luchene <>
Submitted on : Tuesday, June 5, 2012 - 8:36:05 AM
Last modification on : Wednesday, July 31, 2019 - 4:18:03 PM
Long-term archiving on : Thursday, September 6, 2012 - 2:20:30 AM

File

Supelec763.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00684819, version 1

Citation

Lucie Daubigney, Matthieu Geist, Olivier Pietquin. Off-policy Learning in Large-scale POMDP-based Dialogue Systems. ICASSP 2012, Mar 2012, Kyoto, Japan. pp.4989-4992. ⟨hal-00684819⟩

Share

Metrics

Record views

605

Files downloads

218