LEE- Mekatronik Mühendisliği Lisansüstü Programı

Bu topluluk için Kalıcı Uri

http://hdl.handle.net/11527/19385

Gözat

Multi-agent planning with automated curriculum learning

(Graduate School, 2025-06-11) Akgün, Onur ; Üre, Nazım Kemal ; 518182018 ; Mechatronics Engineering

Reinforcement learning (RL) represents a formidable paradigm for training autonomous agents to master sequential decision-making tasks. Its core principle, learning through trial and error guided by a reward signal, has proven successful in a variety of domains. However, the efficacy of standard RL algorithms diminishes drastically in environments characterized by sparse rewards or complex, high-dimensional state spaces. In these challenging settings, an agent receives meaningful feedback only after executing a long and specific sequence of correct actions. This "credit assignment problem" makes exploration, the process of discovering rewarding behaviors, profoundly inefficient. An agent may wander aimlessly without ever stumbling upon the feedback necessary to learn, preventing standard algorithms from developing effective policies. To overcome this fundamental limitation, this thesis turns to curriculum learning (CL), a strategy inspired by the principles of human pedagogy. Just as we teach students arithmetic before calculus, CL structures the learning process by initially presenting the agent with simpler tasks and gradually increasing the difficulty as its competence grows. This guided approach helps the agent build foundational skills that can be leveraged to solve more complex problems. The primary bottleneck of traditional CL, however, is its reliance on manual design; creating an effective curriculum requires significant human expertise, intuition, and domain-specific knowledge, making it a process that is both laborious and difficult to generalize. This thesis addresses this critical gap by proposing a novel framework for the automated and adaptive generation of learning curricula. The central objective was to develop, implement, and rigorously evaluate an algorithmic framework, termed Bayesian Curriculum Generation (BCG), designed to dynamically construct and adapt a curriculum based on an understanding of the task's underlying structure and the agent's real-time progress. The aim is to significantly enhance the performance, stability, and sample efficiency of RL agents, particularly in complex, sparse-reward scenarios where traditional methods falter. The proposed BCG algorithm is built upon a synergistic integration of several key concepts. At its heart, the framework utilizes Bayesian Networks (BNs), a type of probabilistic graphical model, to represent the structural dependencies among the key parameters that define the tasks within an environment. For instance, in a navigation task, these parameters might include map size, the number of obstacles, or the presence of adversaries. The BN captures the probabilistic relationships between these parameters, serving as a powerful generative model. This allows the framework to sample a diverse yet coherent set of task configurations, moving beyond simple parameter randomization to generate tasks with a principled structure. A critical component of the framework is its ability to handle diverse input modalities through flexible task representation techniques. For visual environments like MiniGrid, where the state is an image, a convolutional autoencoder (CAE) is trained to compress high-dimensional observations into a low-dimensional latent feature vector. This vector captures the essential semantic content of the state, providing a compact and meaningful representation for analysis. For environments defined by a set of scalar parameters, such as the physics-based AeroRival simulator, normalized parameter vectors are used directly. Once tasks are represented in a common feature space, their difficulty is quantified. This is typically achieved by measuring the distance (e.g., Euclidean distance) between a given task's representation and that of the final target task. The intuition is that tasks with representations closer to the target are more similar in the skills they require. These raw distance values are then normalized and processed using unsupervised clustering algorithms, such as K-Means, to automatically group tasks into a discrete number of difficulty levels or "bins." This process effectively creates the structured stages of the curriculum. A defining feature of BCG is its adaptability. The curriculum is not a static, predefined sequence. Instead, the selection of tasks for the agent to train on is performed probabilistically, guided by the agent's real-time performance metrics, such as its average reward or task success rate. If an agent consistently succeeds at a certain difficulty level, the probability of sampling tasks from the next, more challenging level increases. Conversely, if the agent struggles, the framework can present it with easier tasks to help it consolidate its skills. This closed-loop system ensures the agent is always training at the edge of its capabilities, preventing both stagnation and frustration. Crucially, the BCG framework implicitly and effectively leverages transfer learning to accelerate skill acquisition. The policy and value function parameters, learned by the base RL agent (in our evaluations, Proximal Policy Optimization - PPO) on tasks from one curriculum stage, are used to initialize the learning process for the subsequent, more challenging stage. This prevents the agent from having to learn from scratch at each step, allowing it to build upon previously acquired knowledge and dramatically speeding up convergence to an optimal policy for the final task. The practical efficacy and robustness of the BCG framework were empirically validated through comprehensive experiments in two distinct and demanding RL environments. The first, MiniGrid (specifically, the DoorKey variant), provided a discrete, grid-based navigation challenge characterized by partial observability (the agent can only see a small portion of its surroundings) and a hierarchically sparse reward (the agent must first find a key, then navigate to a door, and only then receive a reward). The second, AeroRival Pursuit, offered a continuous control task involving high-speed adversarial interaction, dynamic hazard avoidance, and sparse rewards, simulating an aerial combat scenario. In both testbeds, BCG's performance was rigorously benchmarked against a baseline PPO agent (with no curriculum) and a diverse set of relevant contemporary algorithms designed to address similar challenges. The experimental results consistently and unequivocally demonstrated the superiority of the BCG approach. Across both the discrete MiniGrid and continuous AeroRival environments, agents trained with BCG achieved significantly higher final performance levels and converged on successful policies more reliably than all tested baselines. Furthermore, BCG exhibited greater learning stability, as evidenced by a lower variance in performance across multiple independent training runs, indicating that its success is not due to random chance. Notably, in MiniGrid, BCG enabled the agent to master tasks of progressively increasing complexity where many baselines failed to scale. In the highly complex AeroRival environment, BCG was the only method that consistently enabled the agent to learn a successful policy, whereas most baselines failed to obtain any positive rewards at all. This success across environments with fundamentally different dynamics underscores the versatility and generality of the framework. In conclusion, this research makes a significant contribution to the field of reinforcement learning by developing, implementing, and validating the Bayesian Curriculum Generation algorithm. BCG presents a robust, principled, and effective solution for automated and adaptive curriculum learning, particularly in challenging domains hampered by sparse rewards and complex state spaces. By synergistically combining probabilistic modeling of the task space, adaptive task selection driven by agent performance, and efficient knowledge transfer between stages, BCG successfully guides exploration and accelerates the acquisition of complex skills. While acknowledging certain limitations—such as the initial need for domain knowledge to identify parameters for the BN and the added computational overhead—the presented results are highly promising. Future work will focus on automating parameter selection, extending the framework to non-stationary environments, and further improving computational efficiency. Ultimately, BCG offers a powerful approach that advances the potential for training more capable, efficient, and autonomous AI agents in the complex scenarios of tomorrow.

Gözat

Yazar "Akgün, Onur" ile LEE- Mekatronik Mühendisliği Lisansüstü Programı'a göz atma

Sayfa başına sonuç

Sıralama Seçenekleri