posted on 2017-11-01, 00:00authored byAndrea Tirinzoni
Most work on inverse reinforcement learning, the problem of recovering the unknown reward function being optimized by a decision-making agent, has focused on cases where optimal
demonstrations are provided under single dynamics. We analyze the more general settings where the learner has access to sub-optimal demonstrations under several different dynamics.
We argue that several problems, such as learning under covariate shift or risk aversion, can be modeled in this way.
We propose an adversarial formulation where the learner tries to imitate a constrained, worst-case estimate of the demonstrator’s control policy. We adopt the method of Lagrange multipliers to remove the constraints and produce a convex optimization problem.
We prove that the constraints imposed by the multiple dynamics lead to an NP-Hard optimization subproblem, the computation of a deterministic policy maximizing the total expected reward from several different Markov decision processes. We propose a tractable approximation by reducing the latter to the optimal control of partially observable Markov decision processes.
We show the performance of our algorithm on two synthetic data problems. In the first one, we try to recover the reward function of a randomly generated Markov decision process, while in the second we try to rationalize a robot navigating through a grid and demonstrating goal-directed behavior.