BASHIRI-DISSERTATION-2022.pdf (3.83 MB)
Imitation Learning Under Suboptimal Demonstrations
thesisposted on 2022-05-01, 00:00 authored by Mohammad Ali Bashiri
The purpose of imitation learning (IL) is to efficiently learn a desired behavior by imitating an expert’s behavior. The behaviors of interest are usually complex and goal-oriented. In practice, the goals or rewards for such behaviors are difficult for a human to specify. However, the experts are able to provide demonstrations of the desired task even when they do not know the underlying mathematical model of learning. The main challenge in imitation learning, however, is the scarcity of high-quality demonstrations. This is due to noisy or suboptimal demonstrations, especially when humans are involved in the data-collecting process or when collecting high-quality data is expensive. In this thesis, we study the problem of imitation learning under noisy demonstrations or when demonstrations with varying quality are available. We first study distributionally robust imitation learning (DRoIL), an adversarial approach for imitation learning that is naturally designed to perform robustly against noisy demonstrations. We establish a close connection between DRoIL and Maximum Entropy Inverse Reinforcement Learning, a well-studied imitation learning method. We show that DRoIL can be seen as a framework that maximizes a generalized concept of entropy. We develop a novel approach to transform the objective function into a convex optimization problem over a polynomial number of variables for a certain class of loss functions. We also study the problem of imitation learning when demonstrations with varying degrees of quality are available. We assume additional information on the quality of demonstrations, such as when rankings or pairwise preferences are available. For this setting, we develop Multiple Ranked Distributionally Robust Imitation Learning (MRDRoIL), a novel IL method that directly incorporates the ranked demonstration by employing inverse reinforcement learning techniques. In our method, we robustly learn a higher-quality reward function by minimizing a given loss with respect to the worst-case estimated policy that matches the features of demonstrated data while preserving the rankings of demonstrated data. We provide two efficient optimization algorithms to solve the corresponding problem. In our experiments, we show the significant benefits of DRoIL’s new optimization method on synthetic data and a highway driving environment. We also compare MRDRoIL with other preference-based and ranking imitation learning methods and show that MRDRoIL performs competitively against them.
Degree GrantorUniversity of Illinois at Chicago
Degree namePhD, Doctor of Philosophy
Committee MemberZhang, Xinhua Kash, Ian Reyzin, Lev Ratliff, Nathan
Submitted dateMay 2022