File(s) under embargo
until file(s) become available
Reward Estimation of Risk Sensitive Agents via Gradient Based Inverse Reinforcement Learning
thesisposted on 01.05.2020, 00:00 by Federico Sandrelli
In a standard Reinforcement Learning problem, an agent learns how to act in an environment in order to maximize some function of the return called the Objective Function. While this might be optimal for most use cases, there are scenarios in which it might be useful to include some notion of risk in the agent’s objective. For example, a car learning how to drive itself should avoid heading full speed towards a wall, just because it hasn’t tried that yet. This issue has brought to more conservative approaches for exploration, with Risk-Averse Reinforcement Learning. The literature offers many examples of risk measures used to train risk-sensitive agents, such as the Conditional Value at Risk, the Sharpie Ratio and Mean-Variance. In other scenarios instead, the agent learns through observation rather than through experience. Given demonstrations form an expert, the agent learns its objective function through which it can formulate an effective behaviour. The core of our thesis is to merge two fundamental concepts within Reinforcement Learning: Risk-Aversion and the Inverse optimization problem. The idea consists of assuming a Risk-Averse expert who is minimizing some notion of Risk and we want to retrieve the behavior of the expert deriving the reward function and the degree of his Risk- Aversion. In this work, we assume to have an expert that is maximizing the Mean-Variance Objective Function. In the second chapter of this document, the reader will find a comprehensive overview of the Reinforcement Learning framework, starting from the more basic concepts and gradually building up to the main focus of the work which is Risk Averse Inverse Reinforcement Learning. This is followed by the literature on this topic, focusing in particular on the papers that have inspired this project. Finally, the innovative section of the research is presented, along with the specifications of the algorithm and the experiments and empirical proofs. The last chapter is left for our final remarks and thoughts about future improvements and implementations of the work.