A comparative study of nonlinear model predictive control and reinforcement learning for path tracking
A comparative study of nonlinear model predictive control and reinforcement learning for path tracking
Dosyalar
Tarih
2022
Yazarlar
Türkmen, Gamze
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Graduate School
Özet
One of the most financially significant industries is the automotive industry because of the benefits as well as the fact that it is always evolving and changing. Discoveries in computing and sensing hardware contributed to the evolution of this industry and led to the development of autonomous driving technology. Besides that, they offer several potentials for improving transportation safety, energy and fuel efficiency, as well as traffic congestion. These benefits and increasing attention to autonomous vehicles encourage the development of advanced driving systems. In this thesis, the path tracking problem of autonomous vehicles is investigated and a comparative analysis of two path tracking methods is presented. One of the selected methods is model predictive control and the other is a reinforcement learning algorithm soft actor-critic method. The model predictive controller is applied in a wide variety of path tracking problems due to its high performance and benefits over other control methods. The benefits of MPC are the ability to handle multi-input multi-output systems, optimize multiple objectives, work with nonlinear models, incorporate future steps into the optimization problem, overcome disturbances, and deal with constraints on the inputs, outputs, and states. Basically, MPC determines optimal control inputs for a given prediction horizon by minimizing the cost function while taking the system constraints and objectives into account. The system model is used to obtain future state predictions and these future state predictions are included in the cost function that determines the desired behaviour of the system. The optimization problem is solved for the current time step and system state, resulting in the generation of optimal control input sequences. Then, only the first input of the resulting optimal sequence is given to the system. This procedure is performed for each time step. In this thesis, the problem will be handled as a nonlinear model predictive control problem since a nonlinear vehicle model is used. NMPC problems are expressed as optimal control problems (OCP) and the multiple shooting method is used to transform the OCP into a nonlinear optimization problem (NLP) which is addressed by utilizing the optimization software package IPOPT. A vehicle model is one of the main things that MPC requires, and a vehicle model may be modelled with varying degrees of complexity depending on the problem and performance needs. There are several of different way to model vehicles such as a kinematic model which consists solely of a mathematical description of vehicle motion taken into account geometrically and ignoring the forces acting on the vehicle and a dynamic model which includes the forces affecting motion. Additionally, vehicle models can be described differently with various tire models. Basically, the kinematic model shows poor performance at high speeds due to lateral forces, whereas dynamical model shows high performance at high speeds but cannot be used in stop-and-go situations due to tire models becoming singular at low speeds. Additionally, the system identification process is easier for kinematic model since the kinematic model has only two parameters. Furthermore, one of the objectives of the thesis is to show that vehicles can be controlled with the minimum knowledge of the vehicle model. Therefore, a kinematic model is employed as it requires only distances from center of mass to axles. Control methods require parameters to be tuned manually or by optimization algorithms, and these approaches are not always capable of generalizing to new conditions, but intelligent methods arise with their ability to generalize to new conditions. In addition, while the vehicle model is needed for the controller, it is not always needed for intelligent methods. Intelligent methods like deep and machine learning have been included in autonomous driving studies to automate the driving task. These methods enable researchers to specify the desired behavior, teach the system to perform the desired behavior, and generalize their behaviors. Reinforcement learning has been selected as the method of choice to achieve automating the driving task. A learner agent interacts with the environment and collects experiences. Also, the environment gives feedback with reward signals. Because the agent is motivated to maximize positive reward signals and learns what to do as a result of its own experiences without specific instructions. However, the reinforcement learning problem becomes intractable as the states of the agent increase. The solution to this was found by combining deep learning and reinforcement learning and as a result, deep reinforcement learning has emerged. Deep reinforcement learning problems can be classified according to whether they have an environmental model or the way they optimize policy or whether they use different policies in training. Among many types, the soft actor-critic learning method is chosen for this thesis because it shows outperforming performance regarding both efficiency and stability compared to many other powerful methods. The soft actor-critic is an off-policy method that combines actor-critic and maximum entropy reinforcement learning methods. In order to generate stochastic policies with more exploration abilities, the entropy element is introduced to the objective function in this algorithm. As a result, the agent achieves learning by maximizing both expected reward and entropy rather than only maximizing expected reward as in other standard reinforcement algorithms. One of the important key parts of training reinforcement learning agents is that they require a lot of data and take a long time to learn. However, experience replays, which are mechanisms that allow using past experiences, are employed and it is observed that the learning is stabilized and the amount of experience required is decreased. In this thesis, SAC with different buffers are implemented and their efficiencies are examined. During parameter updates, experiences in the buffer are sampled uniformly in the vanilla experience replay. Prioritized experience replay (PER) is one of the experience replay methods used in this thesis, and it basically samples high important experiences more frequently. Emphasizing recent experience (ERE) is another strategy that samples more aggressively from recent experiences to emphasize the importance of the recently observed experience. These methods were chosen because PER has been shown to be effective in numerous studies, and ERE outperforms PER in some applications in terms of efficiency. However, the performance of ERE in the path tracking problem has not been compared with the PER and one of the aims of this thesis is to examine their efficiency in vehicle driving task. The simulation environment is chosen as CARLA simulator, which aims to be as realistic as possible in terms of control and visual elements. Several towns are available in CARLA, and two different ones have been chosen for this thesis. Also, it is necessary to establish the reference values that will be followed by the vehicle. For this purpose, paths were created for the selected towns and waypoints were produced for the vehicle to follow. Then, the cubic spline interpolation method was used as an optimization method for the waypoints because it is desired that the reference waypoints should be smooth and continuous. As a result of these operations, reference yaw angle and x and y positions were obtained. In addition, the speed reference is given in different values as a fixed reference. NMPC and SAC are responsible for both lateral and longitudinal control to follow the given path. As a longitudinal controller, they control the acceleration in order to achieve the target speed, and as a lateral controller, they change the steering wheel to track the reference path. This means that both have two action outputs which are steering angle and acceleration command. The states in NMPC are the states of the kinematic bicycle model, and the parameters of a Tesla Model 3 vehicle provided by CARLA are used. The states in SAC are chosen similar to the NMPC states and consist of steering and acceleration commands, target speed and reference tracking errors up to 10 steps ahead to reflect horizon information. A cost function consisting of tracking errors is constructed to minimize the error between the reference and followed paths for NMPC. The best weight coefficients of cost function are found after several experiments. Furthermore, steering angle and acceleration constraints are defined to participate in the optimization problem. Then, a symbolic framework CASADI is used to formulate this NMPC and provides an interface to IPOPT solver for solving the optimization problem. On the other side, for the SAC agent to follow the path, an appropriate reward function is prepared after many trials, which the agent will maximize according to its actions. Also, terminal conditions are created where the simulation ends if the agent goes out of lane, moves too slowly, and hits something. The network to be used in the training of the SAC agent consists of an actor network that decides on the actions and a critic network that measures how well the actions are. These networks are implemented with PyTorch library and hyperparameters for networks and buffers are taken from the original papers of the methods. The SAC agent is trained in CARLA on 10 and 5 different paths over 2000 episodes and it is observed that the agent trained on 10 different paths converged faster, so the training with other buffers are done on 10 different paths. After training with buffers, SAC+PER and SAC+ERE converged faster than SAC with vanilla buffer. It shows that the advanced buffer implementations enhance sampling efficiency. These trainings are done with random target velocities of 5 and 6 m/s, then for the SAC+PER agent, which is the fastest converging agent, the training is continued for the target velocities with 5 to 8 m/s. Simulations are carried out on 5 different paths to investigate path tracking performance. The results are discussed for each method and it is shown that the vehicle can follow the reference trajectory with a small margin of error for all approaches. This demonstrates that SAC agents have the ability to generalize since they performed well on unseen tracks. Although the performances of NMPC and SAC agents are very close to each other, SAC agents outperform NMPC in target velocity tracking and NMPC has better performance in yaw angle tracking. Also, as expected, the NMPC with the kinematic model performed worse as the speed increased. Furthermore, it is also observed that SAC+ERE and SAC+PER increase sample efficiency without reducing the performance.
Açıklama
Thesis (M.Sc.) -- Istanbul Technical University, Graduate School, 2022
Anahtar kelimeler
Model predictive control,
Artificial neural networks