Multiagent Stochastic Planning with Bayesian Policy Recognition
thesisposted on 01.07.2016, 00:00 by Alessandro Panella
We consider an autonomous agent facing a stochastic, partially observable, multiagent environment. In order to compute an optimal plan, the agent must accurately predict the actions of the other agents, since they influence the state of the environment and ultimately the agent's utility. To do so, we propose a special case of interactive partially observable Markov decision process (I-POMDP), in which the agent does not explicitly model the other agents' beliefs and intentions, and instead models the other agents as stochastic processes implemented by probabilistic deterministic finite state controllers (PDFCs). The agent maintains a probability distribution over the PDFC models of the other agents, and updates this distribution using Bayesian inference. Since the number of nodes of these PDFCs is unknown and unbounded, the agent places a Bayesian nonparametric prior distribution over the infinitely dimensional set of PDFCs. This allows the size of the learned models to adapt to the complexity of the observed behavior. Deriving the posterior distribution is in this case too complex to be amenable to analytical computation; therefore, we provide a Markov chain Monte Carlo (MCMC) algorithm that approximates the posterior beliefs over the other agents PDFCs, given a sequence of (possibly imperfect) observations about their behavior. Experimental results show how the learned models converge behaviorally to the true ones. Moreover, we describe how the learned PDFCs can be embedded in the learning agent's own decision making process. We consider two settings, one in which the agent first learns, then interacts with other agents, and one in which learning and planning are interleaved. We show how the agent's performance increases as a result of learning in both situations. Moreover, we analyze the dynamics that ensue when two agents are simultaneously learning about each other while interacting, showing in an example environment that coordination emerges naturally from our approach. Moreover, we demonstrate how an agent can exploit the learned models to complement its possibly noisy observations about the environment.