IQ-flow: Mechanism design for inducing cooperative behavior to self-interested agents in sequential social dilemmas

Yükleniyor...
Küçük Resim

item.page.authors

Süreli Yayın başlığı

Süreli Yayın ISSN

Cilt Başlığı

Yayınevi

Graduate School

Özet

Achieving and maintaining cooperation between agents in order to accomplish a common objective is one of the central goals of Multi-Agent Reinforcement Learning (MARL). Although many methods promise high performance in the literature, these methods are mainly concerned with obtaining that performance in the same agent set-up as training. However, in real-world scenarios the environment is open-ended such that any number of agents can enter. Furthermore, in many real world scenarios, separately trained and specialized agents are deployed into a shared environment or the environment requires multiple objectives set to be achieved by different coexisting parties. These variations among specialties and objectives are likely to cause mixed motives that eventually result in a social dilemma where all the parties are at a loss. Nevertheless, when the specialty in the subject is single and the objectives do not cause a mixed motive problem, we can approach the situation as a transfer and generalization problem in cooperative MARL with decentralized execution. Thus, we first examine the scenarios with a single objective and deduce if an external mechanism is necessary to promote cooperation in these scenarios. Then, we turn our focus to cases where there is an underlying social dilemma in the environment such that we study and propose incentivization-based methods to promote cooperation under sequential social dilemmas. Centralization and decentralization are two approaches used for cooperation in MARL. While fully decentralized methods are prone to converge to suboptimal solutions due to partial observability and nonstationarity, the methods involving centralization suffer from scalability limitations and lazy agent problem. The centralized training decentralized execution (CTDE) paradigm brings out the best of these two approaches; however, centralized training still has an upper limit of scalability not only for acquired coordination performance but also for model size and training time. Since we want to study the situation where any number of agents with a single cooperative objective can be deployed into a shared environment, we adopt the centralized training with decentralized execution paradigm for our first study and investigate the generalization and transfer capacity of the trained models across a variable number of agents. The generalization and transfer capacity of the agents is assessed by training a variable number of agents in a specific MARL problem and then performing greedy evaluations with a variable number of agents for each training configuration. Thus, we analyze the evaluation performance for each combination of agent count for training versus evaluation. We perform experimental evaluations on predator prey and traffic junction environments and demonstrate that it is possible to obtain similar or higher evaluation performance by training with fewer agents. We deduce that the optimal number of agents to perform training may differ from the target number of agents and argue that transfer across a large number of agents can be a more efficient solution to scaling up than directly increasing the number of agents during training. Thus, we conclude that deploying trained agents to an open-ended environment does not constitute a problem or necessitate an external incentivizing mechanism when the objective is single and all of the agents use the same policy. Turning the focus to deployment of separately trained and specialized agents to a shared environment necessitates the study of Sequential Social Dilemmas (SSD), since agents with different specializations are prone to have mixed motives. Sequential Social Dilemmas are gaining attention in recent years. The current trends either focus on engineering incentive functions for modifying rewards to reach general welfare, or developing learning based approaches to modify the reward function by accounting for the impact of the incentives on policy updates. One of the most significant works in the learning based approach is LIO, which enables independent self-interested agents to incentivize each other by an additive incentive reward. LIO assumes that agents continually learn and adapt according to the changing incentives they give each other and has demonstrated success in several sequential social dilemma environments. We investigate LIO's performance under a variety of different setups in public goods game Cleanup in order to analyse its robustness against necessity of including inductive bias in incentive function, randomness in initial agent position with an option of asymmetric incentive potential, and assess its stability under frozen incentive functions after agents' explorations are reset. We observe and demonstrate empirically that LIO is indeed sensitive to these settings and it is not reliable for obtaining good incentives that would let the system stay stable when it is static. We conclude with some research directions that would improve the robustness of the method and incentive learning research. Finally, we study having a single incentivizing mechanism instead of giving every agent the ability to incentivize each other. We aim to preclude the suboptimal consequences of agents with mixed motives by using a central mechanism that learns its incentives adaptively while the agents in question learn their policies. Thus, we propose the Incentive Q-Flow (IQ-Flow) algorithm, which modifies the system's reward setup with an incentive regulator agent such that the cooperative policy also corresponds to the self-interested policy for the agents. Unlike the existing methods that learn to incentivize self-interested agents or adaptive mechanisms, IQ-Flow does not make any assumptions on agents' policies or learning algorithms, which enables generalization of the developed framework to wider array of applications. IQ-Flow performs offline evaluation of the optimality of the learned policies using the data provided by other agents to determine cooperative and self-interested policies. Next, IQ-Flow uses meta-gradient learning to estimate how policy evaluation changes according to given incentives and modifies the incentive such that the greedy policy for cooperative objective and self-interested objective yield the same actions. We present the operational characteristics of IQ-Flow in Iterated Matrix Games. We demonstrate that IQ-Flow outperforms the state-of-the-art incentive design algorithm in Escape Room and Cleanup environments. We further demonstrate that pretrained IQ-Flow mechanism significantly outperforms the performance of shared reward setup in Cleanup environment.

Açıklama

Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2022

Konusu

IQ-flow, TQ-akışı, Machine learning, Makine öğrenmesi, Multiagent systems, Çoklu etmen sistemler

Alıntı

Endorsement

Review

Supplemented By

Referenced By