Around Inverse Reinforcement Learning and Score-based Classification

Matthieu Geist 1 Edouard Klein 2 Bilal Piot 2 Yann Guermeur 3 Olivier Pietquin 2
2 IMS - Equipe Information, Multimodalité et Signal
UMI2958 - Georgia Tech - CNRS [Metz], SUPELEC-Campus Metz
3 ABC - Machine Learning and Computational Biology
LORIA - ALGO - Department of Algorithms, Computation, Image and Geometry
Abstract : Inverse reinforcement learning (IRL) aims at estimating an unknown reward function optimized by some expert agent from interactions between this expert and the system to be controlled. One of its major application fields is imitation learning, where the goal is to imitate the expert, possibly in situations not encountered before. A classic and simple way to handle this problem is to see it as a classification problem, mapping states to actions. The potential issue with this approach is that classification does not take naturally into account the temporal structure of sequential decision making. Yet, many classification algorithms consist in learning a \textit {score function}, mapping state-action couples to values, such that the value of the action chosen by the expert is higher than the others. The \textit{decision rule} of the classifier maximizes the score over actions for a given state. This is curiously reminiscent of the \textit{state-action value function} in reinforcement learning, and of the associated \textit{greedy policy}. Based on this simple statement, we propose two IRL algorithms that incorporate the structure of the sequential decision making problem into some classifier in different ways. The first one, SCIRL (Structured Classification for IRL), starts from the observation that linearly parameterizing a reward function by some features imposes a linear parametrization of the Q-function by a so-called feature expectation. SCIRL simply uses (an estimate of) the expert feature expectation as the basis function of the score function. The second algorithm, CSI (Cascaded Supervised IRL), applies a reversed Bellman equation (expressing the reward as a function of the Q-function) to the score function outputted by any score-based classifier, which reduces to a simple (and generic) regression step. These two algorithms come with theoretical guarantees and perform competitively on toy problems.
Document type :
Conference papers
Complete list of metadatas

https://hal-supelec.archives-ouvertes.fr/hal-00916936
Contributor : Sébastien van Luchene <>
Submitted on : Wednesday, December 11, 2013 - 8:44:38 AM
Last modification on : Wednesday, July 31, 2019 - 4:18:03 PM

Identifiers

  • HAL Id : hal-00916936, version 1

Citation

Matthieu Geist, Edouard Klein, Bilal Piot, Yann Guermeur, Olivier Pietquin. Around Inverse Reinforcement Learning and Score-based Classification. 1st Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM 2013), Oct 2013, Princeton, New Jersey, United States. ⟨hal-00916936⟩

Share

Metrics

Record views

180