posted on 2023-12-01, 00:00authored byParshan Pakiman
We revisit the well-established approximate linear programming approach to Markov decision processes (MDPs). This model-based reinforcement learning (RL) algorithm has strong theoretical properties and has been successfully applied to many Operations Management and Operations Research applications. However, ensuring this approach delivers near-optimal control policies for new applications and problem instances poses known challenges. These challenges include (i) designing suitable approximation architectures and (ii) formulating approximate linear programs (ALPs) and fine-tuning their parameters. Specifically, designing approximation architectures that deliver near-optimal MDP bias/value function approximations (B/VFAs) involves tedious trial-and-error and requires exploiting the problem structure. In addition, formulating an ALP that guarantees its B/VFA corresponds to high-quality control policies requires improving previously suggested ALP formulations and designing innovative parameter-tuning approaches. While prior research has suggested solutions to these challenges for specific applications and problem instances, bridging the gap to design an application-agnostic ALP method that is computationally effective and delivers theoretical guarantees remains a significant question.
This doctoral thesis research presents novel ALP methodologies for discounted-cost and average-cost MDPs to mitigate the aforementioned practical challenges. Our methods leverage random basis functions commonly used in Machine Learning and extend them to the ALP framework. Random basis functions allow us to replace the hand-engineering of B/VFAs with computationally low-cost sampling of random basis function parameters from known distributions. Our methods also involve generating multiple randomized approximations to the MDP bias/value function instead of constructing a single deterministic B/VFA, as predominantly done in the existing literature. Therefore, our randomized multiple-shot approximation approaches involve iteratively solving a sequence of ALP models. We connect two consecutive ALP models in this sequence using "guiding" constraints that utilize ALP B/VFA obtained in the previous iteration to guide the computation of the B/VFAs in the current iteration. We show that our self-guiding mechanisms lead to near-optimal B/VFAs and control policies by iteratively refining their own ALP formulations and parameters. We establish several theoretical properties of our methods, including probabilistic convergence rates and policy performance bounds, that are new to the ALP and RL literature. We also apply our application-agnostic algorithms to challenging inventory control and options pricing problems. We show that they deliver excellent control policies and performance bounds, and they improve upon or compete with existing problem-specific benchmarks. More broadly, our research takes a meaningful step toward easy-to-implement model-based RL methods that are guaranteed to compute near-optimal policies and performance bounds in both discounted-cost and average-cost MDP settings.