Data efficient offline reinforcement learning & reinforcement learning with imitation learning applications to autonomous driving and robotics

Yükleniyor...
Küçük Resim

item.page.authors

Süreli Yayın başlığı

Süreli Yayın ISSN

Cilt Başlığı

Yayınevi

Graduate School

Özet

The aim in the Reinforcement Learning is learning a behavior such that maximizes the expected cumulative reward. The learner in the environment is called the agent, who learns the optimal strategy by trial and error which means that the agent takes an action into the environment and receives some feedback called reward in this context. Fully data-driven subfield of RL is known as Offline Reinforcement Learning. Offline Reinforcement Learning is used to extract good policies from static datasets which are previously collected with various behavior policies. It is very similar to other Deep Learning problem formulations in terms of learning from previously collected data. But their objectives are completely different. In Offline RL, we still try to capture a policy which maximizes the expected cumulative reward but in other Deep Learning tasks, the objective is either classification or regression. When the expert behavior policy is available, we can generate and use samples to imitate the expert behavior. But in Imitation Learning, the expert policy/data available might not cover all cases and/or learned policy might not perform as good as expert policy. On the other hand Reinforcement Learning is practical since it has explorative behavior. It does not strictly follow any other behavior policy as in Imitation Learning. With its exploration capability, we can recover failed actions of a learned policy in a given environment. For policy learning, when the expert data is available, starting with Imitation Learning and recovering limited failure cases with Reinforcement Learning becomes functional and allows us to use best of both worlds. After a starting policy with Imitation Learning is trained, it is evaluated on the full trajectories and during evaluation failure points and scenarios are saved. For a specific failure scenario Reinforcement Learning agent to optimize its behavior for this particular scenario is trained. Then trained policy with Reinforcement Learning is added to the library of trained policies. Another component which is policy classifier is trained at this point with the most up-to-date library of Reinforcement Learning agents as each being a new class. At inference time, we evaluate each of the state's reward at each step, when the reward is above zero for a given state, we always use pre-trained IL agent while for below zero cases we predict the class from our trained policy classifier to understand which of the trained RL agent from the library is suitable for the given state. Even with only introducing stuck vehicle failure case, method outperforms or becomes competitive to previous benchmarks in different driving metrics and road conditions of CARLA simulator. We must consider data efficiency when transforming datasets into decision-making engines. If we can find which data points or subsets of data contribute to improving the evaluation performance, we may have used our computational and memory resources efficiently during data collection or when choosing from the dataset when training. We can also explore the role of data points in training and design more effective learning systems. Further we can work on finding data points that will improve performance. In this study, we present a new pruning metric that calculates the value of a data point in the dataset using Conservative Q Learning (CQL), which is one of the Offline Reinforcement Learning algorithms. The pruning metric is calculated over the change effect of each data on the Conservative Q Learning critic network. This score is called the Off_critic score. This score is calculated for each data point (s_t, a, r(s_t, a_t), s_t+1). Pre-training is used because this score includes value estimation from the critic network. With pre-training, the network is trained until there is a high correlation between the scores calculated with the network. With preliminary experiment, we decide how much the network had to be trained to calculate the score. The Off_critic score calculated by early training is designed as the metric to be used when deciding which training sample to prune from the dataset. Since the score of each data point is considered as its contribution to the change on the critic network of that data point, it is reasonable to hypothesize to keep the data point that contributed the most and to delete the one that contributed the least. The dataset was pruned by deleting the data points that contributed the least, those with the lowest score. During the preliminary experiments, different pruning strategies were attempted. The performance obtained by deleting the lowest-scoring training points is the best and the performance obtained by deleting the highest-scoring ones is the worst. It experimentally shows that the defined pruning score is a logical metric. Although it was assumed that the lowest stochastic pruning, which is another pruning strategy defined, prevents correlation between the data compared to the lowest pruning and provides diversity by sampling even less from low-score data, this pruning method showed poor performance in preliminary experiments. Experiments were conducted using 4 different datasets. The reason for working with different datasets is that the characteristics of the collected data change depending on the behavior policy used while collecting data in Offline Reinforcement Learning. The success of the proposed pruning metric has been demonstrated by experimenting with datasets as APS, DIAYN, PROTO, and RANDOM. Datasets were not mixed together, and to be fair, the performances of all datasets were compared with the other performances of the relevant dataset. Each dataset contains 1 million pieces of data. In the experiments, an Offline Reinforcement Learning experiment was conducted, in which all data points were used for each dataset and there was no pruning on them. Each learning experiment takes 1 million steps. With these experiments the evaluation score of each dataset without pruning was found. This evaluation score was used as a basis for calculating performance change after pruning. Additionally, it has been shown that learning can be done with these 4 datasets, and which hyperparameters should be used has been fixed. Subsequent experiments always use the same set of hyperparameters for fair comparison. After the preliminary experiments were done, it was decided that the pre-trained model step for which the score would be calculated should be 90 000 and the pruning strategy should be kept as lowest. Not all model steps and pruning strategies were tested for all datasets due to insufficient resources. These parameters are fixed in subsequent experiments. In order to demonstrate the performance of the proposed pruning scores and pruning method, new datasets were created by deleting data from each dataset gradually. The pruning percentage goes between 0.1 and 0.9, increasing by 0.1. At each step, 100 thousand data was deleted from the dataset and training was performed from scratch with the newly created dataset without any network weight initialization. The evaluation score obtained by evaluating the learned policy after 1 million learning steps was used as a comparison metric. However, since this metric includes only the final evaluation score, the metric including the evaluation scores during training was defined, with the smoothing. Random pruning baseline experiments were also conducted to interpret how successful the pruning was using the proposed pruning metric. In these experiments, data up to the relevant pruning percentage is randomly deleted from the dataset. Pruning the lowest scoring training samples is the best pruning strategy. For pruning 50% of the DIAYN dataset, pruning the lowest-scoring all the other pruning strategies. For DIAYN, a significant drop in performance is observed after pruning 70% of the data. Further, DIAYN's final performance is not damaged until 60% of the dataset is deleted, it even increases. PROTO is similar to DIAYN where performance remains almost the same with pruning up to 50%. Even if only 100,000 training points remain after 90% pruning, the performance degradation is only 24.68% for PROTO. APS and RANDOM show non-monotonic behavior and these datasets require further investigation. Specifically, the RANDOM data shows us that pruning the dataset doesn't necessarily hurt performance, but can instead lead to an increase in final performance. We empirically evaluated the Off_critic pruning score against no pruning and random pruning. However, more theoretical studies are needed beyond the experimental results. Our pruning method is 61% to 67% more successful than random pruning.

Açıklama

Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2023

Konusu

Reinforcement learning, Pekiştirmeli öğrenme, Robotic, Robotik

Alıntı

Endorsement

Review

Supplemented By

Referenced By