Periodic agent-state based Q-learning for POMDPs : GERAD

iCalendar

4 juil. 2024 10h00 — 11h00

Amit Sinha – Université McGill, Canada

Amit Sinha

Séminaire hybride à l'Université McGill ou Zoom.

The traditional approach to POMDPs is to convert them into fully observed MDPs by considering a belief state as an information state. However, a belief-state based approach requires perfect knowledge of the system dynamics and is therefore not applicable in the learning setting where the system model is unknown. Various approaches to circumvent this limitation have been proposed in the literature. A unified treatment of these approaches involves considering the "agent state", which is a model-free, recursively updateable function of the observation history. Some examples of an agent state include frame stacking and recurrent neural networks. Since the agent state is model-free, it is used to adapt standard RL algorithms to POMDPs. However, standard RL algorithms like Q-learning learn a deterministic stationary policy. Since the agent state is not an information state, we cannot apply the same results for MDPs and thus, we must first consider what happens with the different policy classes: stationary/non-stationary and deterministic/stochastic. Our main thesis that we illustrate via examples is that because the agent state is not information state, non-stationary agent-state based policies can outperform stationary ones. To leverage this feature, we propose PASQL (periodic agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies. By combining ideas from periodic Markov chains and stochastic approximation, we rigorously establish that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy. Finally, we present a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.