Jayakumar Subramanian – Université McGill, Canada
We present reinforcement learning for large population multi-agent systems with exchangeable agents such as mean-field systems. A multi-agent system is said to have exchangeable agents if permuting the index of agents does not impact the dynamics and rewards. Such systems are also called systems with symmetric or homogeneous agents. A key feature of exchangeable multi-agent systems is that the dynamic and reward coupling between the agents is through the mean-field (i.e., the empirical distribution). The planning solution for such systems---both for the case when the agents are strategic (i.e., mean-field games) and when the agents are co-operative (mean-field teams) have been considered in the literature. In this work, we consider reinforcement learning with two different classes of solution concepts.
In the first case, we present off-line policy gradient based reinforcement learning algorithms that converge to a local stationary mean-field equilibrium (in case of games) or a local stationary mean-field team optimal solution (in case of teams). The algorithms are demonstrated using a stylized model of malware spread in networks. The example also illustrates that the team and game solution may, in general, differ.
In the second case, we develop reinforcement learning (RL) algorithms for mean-field teams (MFT). In our work, we consider MFTs with a mean-field sharing information structure, i.e., each agent knows its local state and the empirical mean-field at each time step. We use a dynamic programming (DP) decomposition for MFTs from literature using a hierarchical decomposition approach called the common information approach, which splits the decision making process into a centralized coordination rule that yields prescriptions to be followed by each agent based on their local information. We develop an RL approach for MFTs under the assumption of parametrized prescriptions. The novelty in this work compared to the approach in literature using an equivalent MDP/POMDP formulation is that by using parametrized functions for prescriptions, we are able to consider stochastic prescriptions in addition to stochastic coordination laws. We consider the multi-dimensional continuous prescription parameters as the action space. This enables us to use policy based (such as TRPO, PPO, DDPG) as well as value based (such as NAF-DQN) conventional RL algorithms to solve the problem. We illustrate the use of these algorithms through two examples based on stylized models of the demand response problem in smart grids and disease spread in communities.
Bienvenue à tous!