Bayes-Optimal, Robust, and Distributionally Robust Policies for Uncertain MDPs
We explore the performance of different policy-making strategies within the framework of uncertain Markov decision processes (uMDPs). In an uMDP, the agent interacts with an MDP that is sampled from the uncertainty set based on a distribution, referred to as the true prior distribution. We analyze the performance of Bayes-optimal, robust, and distributionally robust policies. Bayes-optimal policies aim to maximize expected rewards based on a given subjective prior distribution over the uncertainty set. Robust policies are designed to perform well under the worst-case scenarios across all possible MDPs. Distributionally robust policies combine elements of Bayes-optimal policies and robust policies, seeking to perform optimally across a neighborhood of a given subjective prior distribution. We test these policies in scenarios where the subjective prior is either an estimate of the true prior or manually chosen, such as uniform or Dirichlet distributions. Our findings indicate that no subjective prior can guarantee that a Bayes-optimal policy under that subjective prior will consistently outperform robust policies. Thus, manually choosing a subjective prior, such as a uniform distribution commonly used in many applications, may result in poor performance for Bayes-optimal policies compared to robust policies. Furthermore, we show that, for any estimated subjective prior with a non-zero estimation error, robust policies can still outperform Bayes-optimal policies under certain uMDPs. Additionally, we find that distributionally robust policies, which integrate the advantages of both robust and Bayes-optimal approaches, typically yield superior performance, particularly when the subjective prior closely approximates the true prior distribution. These theories are empirically validated through experiments conducted in a simulated highway driving environment