Sample Efficient On-line Learning of Optimal Dialogue Policies with Kalman Temporal Differences

Olivier Pietquin; Matthieu Geist; Senthilkumar Chandramohan

Communication Dans Un Congrès Année : 2011

Sample Efficient On-line Learning of Optimal Dialogue Policies with Kalman Temporal Differences

(1, 2) , (1, 2) , (1, 2)

1
2

Olivier Pietquin

Fonction : Auteur
PersonId : 4024
IdHAL : olivier-pietquin
ORCID : 0000-0002-5386-465X
IdRef : 142821861

SUPELEC-Campus Metz

Georgia Tech Lorraine [Metz]

Matthieu Geist

Fonction : Auteur
PersonId : 6945
IdHAL : matthieu-geist

SUPELEC-Campus Metz

Georgia Tech Lorraine [Metz]

Senthilkumar Chandramohan

Fonction : Auteur
PersonId : 888330

SUPELEC-Campus Metz

Georgia Tech Lorraine [Metz]

Résumé

Designing dialog policies for voice-enabled interfaces is a tailoring job that is most often left to natural language processing experts. This job is generally redone for every new dialog task because cross-domain transfer is not possible. For this reason, machine learning methods for dialog policy optimization have been investigated during the last 15 years. Especially, reinforcement learning (RL) is now part of the state of the art in this domain. Standard RL methods require to test more or less random changes in the policy on users to assess them as improvements or degradations. This is called on policy learning. Nevertheless, it can result in system behaviors that are not acceptable by users. Learning algorithms should ideally infer an optimal strategy by observing interactions generated by a non-optimal but acceptable strategy, that is learning off-policy. In this contribution, a sample-efficient, online and off-policy reinforcement learning algorithm is proposed to learn an optimal policy from few hundreds of dialogues generated with a very simple handcrafted policy.

Domaines

Apprentissage [cs.LG]

Fichier principal

IJCAI_2011_OPMGSC.pdf (241.63 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Sébastien Van Luchene : Connectez-vous pour contacter le contributeur

https://centralesupelec.hal.science/hal-00618252

Soumis le : jeudi 1 septembre 2011-11:30:43

Dernière modification le : jeudi 13 avril 2023-09:26:12

Archivage à long terme le : dimanche 4 décembre 2016-17:17:24

Dates et versions

hal-00618252 , version 1 (01-09-2011)

Identifiants

HAL Id : hal-00618252 , version 1

Citer

Olivier Pietquin, Matthieu Geist, Senthilkumar Chandramohan. Sample Efficient On-line Learning of Optimal Dialogue Policies with Kalman Temporal Differences. IJCAI 2011, Jul 2011, Barcelona, Spain. pp.1878-1883. ⟨hal-00618252⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

SUPELEC CNRS UNIV-FCOMTE CENTRALESUPELEC UMI-GTL

103 Consultations

408 Téléchargements

Sample Efficient On-line Learning of Optimal Dialogue Policies with Kalman Temporal Differences

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager