MDP Transformation for Risk Averse Reinforcement Learning via State Space Augmentation
thesisposted on 2020-05-01, 00:00 authored by Davide Santambrogio
While the standard Reinforcement Learning objective is to minimize (maximize) the expected cumulative cost (reward) over the horizon, finite or infinite, the Risk Averse RL approach tries to optimize not only the common objective function but involves also the minimization of what is defined as Risk. There are several ways to express the Risk of the given random cost (reward) received by the environment and the main ones are the utility functions and the proper risk measures. However, the issue in dealing with the Risk is that its minimization is not as simple as the standard optimization problem and sometimes it could become also NP-hard affecting the optimality of the found solution. The common approach historically used by the literature to find the optimal solution of this problem is based on the idea of creating new ad hoc algorithms capable to optimize the new objective function instead of the cumulative cost only. This, of course, makes the algorithm not general and specific to that kind of Risk measure or Utility function. More other, there are more standard algorithms respect to the specific ones and they are usually more studied and optimized because they are more frequently used. This is why some researchers came up with another alternative approach based on the transformation of the Markov Decision Process itself. Transforming the MDP means modifying the characteristics of the optimization problem’s formalization in terms of state space, transition kernel or Reward function to incorporate the Risk Aversion in the structure of the problem. In this way, it is possible to apply the standard Reinforcement Learning algorithms as if they are optimizing the usual objective function, while instead, they are taking into account the Risk too. The issue with this method is that we should consider how much the transformation costs in terms of computational power (and so time). Furthermore, there could be also an increment of the cost for the optimization itself after the transformation of the Markov Decision Process. In both these cases, the approach does not worth it and it would be better to adopt an ad hoc algorithm following the basic strategy for Risk-Averse optimization. My thesis deals with these problems with Risk Averse Reinforcement Learning. The idea is to modify the Markov Decision Process via a state-space augmentation that gives a piece of partial information about the history of the current chosen policy and then find the optimal solution of the modified problem through a standard Reinforcement Learning algorithm. After a proper background to present the basics needed to understand the research work, we will describe some of the interesting papers we studied about Risk Aversion and MDP transformation that gave me the idea of the state of the art in this field. Then we will explain the transformation that we decided to adopt and finally how we applied the standard RL algorithm to find the optimal Risk Sensitive policy. Finally, we will give some conclusions talking about the practical results and some possible future works in this field.