Hierarchical reinforcement learning in complex wargame environments
Hierarchical reinforcement learning in complex wargame environments
Dosyalar
Tarih
2024-01-23
Yazarlar
Kömürcü, Kubilay Kağan
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Graduate School
Özet
In recent times, Reinforcement Learning (RL) agents have achieved remarkable success in tackling difficult games, sometimes outperforming human players. This suggests that RL methods are well-suited for wargames, which are characterized by long decision-making periods, infrequent rewards, and extensive sets of possible actions. However, wargames are highly complex, and even with RL, convergence to a near-optimum solution requires an immense amount of experience and makes the solution sample inefficient. To address this inefficiency, we propose dividing the game into simpler sub-games, each focusing on a specific core skill of the overall game. These sub-games have shorter decision horizons and smaller action sets compared to the main game. To guide the learning process, we adopt a curriculum learning approach, employing a hierarchical control structure where the curriculum comprises these simpler sub-games. For my experimentation, we select StarCraft II as the test environment, as it shares common characteristics with wargames and has been extensively used in such scenarios. Through empirical evaluation, we demonstrate that our hierarchical architecture can successfully solve the complex wargame environment based on StarCraft II, whereas a non-hierarchical agent fails to do so. Additionally we plan to conduct an ablation study to investigate the impact of action frequency on training quality, which we believe to be a crucial factor. Since Starcraft II is a real-time strategy game, and not a turn-based strategy game, it is possible to take actions in different time intervals. we believe that taking actions rarely, as well as too frequently, will hinder the training quality. The recent achievements of Reinforcement Learning (RL) methods have garnered widespread attention across various domains. Among the notable applications, wargames stand out, encompassing a diverse array of games like board games such as chess and Go, as well as strategy games like StarCraft and MicroRTS. Despite the broad spectrum within wargames, they share common features that distinguish them from other domains. These include large action spaces, whether discrete or hybrid, branching actions, adversarial opponent dynamics, and notably sparse reward functions. Additionally, wargames exhibit a characteristic where decisions made by an agent have a delayed impact on the game dynamics, posing challenges for optimization. To tackle these obstacles, several methods have been proposed, including Hierarchical Reinforcement Learning (HRL), forward planning, and curriculum learning. In this study, we primarily employ curriculum learning and a straightforward HRL approach in the real-time strategy game StarCraft II, comparing these techniques with a non-hierarchical agent. Also, although Reinforcement Learning (RL) has exhibited remarkable success across various applications, yet the significance of selecting what we term "decision frequency" in constructing Markov Decision Processes (MDPs) remains underappreciated in real-world scenarios. This thesis sheds light on the crucial role of decision frequency and its impact on RL training through a thorough analysis of a toy experiment, followed by the application of this knowledge to solve a complex StarCraft 2 environment. Our findings underscore that finely tuning decision frequency can be pivotal in determining the success or failure of RL training. To illustrate our insights, we propose an intuitive method for decision frequency tuning, showcasing its effectiveness in both a controlled toy experiment and within the context of StarCraft 2 minigames, known as core skills. Utilizing these core skills, we employ a hierarchical approach to train a model for one of the most challenging types of StarCraft 2 games. Benchmarking results highlight the superiority of our method over similar approaches, achieving competitive scores with approximately 30 days of real-time experience, compared to the approximately 30 years required by comparable methods to achieve similar results. In this study, our initial focus involves the dissection of the primary task into sub-tasks, with each sub-task representing a fundamental skill that can be independently addressed by a non-hierarchical agent and is crucial for mastering the primary task. Our contribution encompasses two key aspects: Firstly, we demonstrate that the proposed hierarchical approach effectively resolves complex challenges presented in wargame environments, challenges that remain insurmountable for a non-hierarchical agent. Additionally, our findings reveal that optimizing dedicated agents for each individual sub-task and combining their policies within a hierarchical framework yields commendable performance scores in the StarCraft II environment, even in the absence of optimization for the hierarchical controller. Secondly, we observe that expanding the set of sub-tasks beyond the core-skill set does not yield a substantial improvement in performance. We introduce a curriculum setting and implement hierarchical control for the sub-policies trained in their respective sub-tasks. The overarching objective of the main task is partitioned into a more manageable yet still comprehensive set of sub-tasks. Each sub-task involves an objective function that is simpler to achieve in terms of the required sample size. Despite uniform game mechanics across all sub-tasks, variations in opponent behaviors lead to differences in the transition distribution and the initial state distributio within the Markov Decision Process (MDP) of each sub-task. Owing to distinct objective functions, each sub-task is associated with a unique reward setand a discount factor. To address this complexity, we independently train an agent for each sub-task, with a corresponding policy, subsequently merging them using a hierarchical control structure within the main task. Our experimentation is conducted within the StarCraft II Learning Environments (SC2LE). StarCraft II, operating as a real-time strategy game, features complex military dynamics, presenting an extensive observation space and a diverse set of actions. It has been widely utilized in reinforcement learning (RL) research due to its inherent challenges. Moreover, StarCraft II is recognized as one of the most realistic environments in terms of military dynamics, offering the flexibility of creating custom game scenarios and objectives across a varied selection of maps. In our experiments, we utilize both the provided environments from SC2LE and construct our own custom environments using the game's map editor. We've introduced a Core Skill Decomposition algorithm, a form of Hierarchical Reinforcement Learning. This algorithm learns individual sub-policies for each sub-task and a manager policy that selects among these sub-policies to create a hierarchical agent. Our approach involves decomposing a complex wargame environment into sub-tasks that can be solved by a non-hierarchical A2C algorithm, while retaining the core aspects of the environment. We evaluate our method in three challenging StarCraft II environments and demonstrate that, when these environments are decomposed into sub-tasks, our hierarchical architecture successfully solves the environment, whereas a non-hierarchical agent fails to do so. Additionally, we observe that expanding the core set of skills only results in a marginal increase in performance. This thesis also addresses the often underestimated yet pivotal factor of decision frequency in the construction of Markov Decision Processes (MDPs) in Reinforcement Learning (RL). Our investigation has unveiled that the careful adjustment of decision frequency holds substantial sway over the efficacy of RL training, presenting potential implications across a diverse array of applications.
Açıklama
Thesis (M.Sc.) -- Istanbul Technical University, Graduate School, 2024
Anahtar kelimeler
Machine learning,
Makine öğrenmesi,
Peinforcement learning,
Pekiştirmeli öğrenme