Bayesian Model of Communication for Partially Observable Multi-agent Environment
thesis
posted on 2024-08-01, 00:00authored bySarit Adhikari
We study the problem of communication and interaction among theory of mind (ToM) agents acting in a partially observable stochastic environment under the principled framework of Communicative Interactive Partially Observable Markov Decision Processes (CIPOMDPs). Under the formalism of CIPOMDPs, agents higher in the ToM hierarchy can exchange messages. However, the agent at the bottom of the cognitive hierarchy cannot inherently model other agents and cannot participate in the exchange of messages. To address this, we initially assume a fixed discrete message distribution for agents at the bottom of the cognitive hierarchy, which can be folded into the observation function. Leveraging this assumption, we employ an offline point-based value iteration approach for solving CIPOMDPs. We show results on several variants of the multi-agent tiger game and establish the usefulness of CIPOMDPs, particularly in scenarios when the agents' preferences do not align. Unsurprisingly, it may be optimal for agents to attempt to mislead others if their preferences are not aligned. But it turns out the higher depth of reasoning allows an agent to detect insincere communication and to guard against it. Specifically, in some scenarios, the agent can distinguish a truthful friend from a deceptive foe when the received message contradicts the agent’s observations, even when the received message does not directly reveal the opponent type. Further, we propose a Bayesian learning methodology for the literal listener. The agent models and learns the initially unknown message generation mechanism by recording the counts of messages received in each state during interaction with the environment. We first propose parameterization of message distribution and then the Bayesian update procedure which is integrated alongside the usual POMDP belief update and can be approximated using a particle filtering approach. Then, we derive an error introduced by assuming a finite approximate model and discuss the challenges of applying value iteration in bounded but large state space. We empirically study the properties of lit-POMDP and utilizing the d-step lookahead approach, we study the behavior of lit-POMDP and CIPOMDP modeling a lit-POMDP. Specifically, we consider an instance of a deceptive persuasion problem where the literal listener agent is susceptible to bait and switch strategy by a smarter omniscient agent. Finally, we discuss the Monte Carlo tree search-based approach to CIPOMDPs, which provides scalability for a higher time horizon but suffers from high variance. As a result, optimality is compromised across nesting levels.