Long-horizon value gradient methods on Stiefel manifold

thumbnail.default.alt
Tarih
2022
Yazarlar
Ok, Tolga
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Graduate School
Özet
Sequential decision-making algorithms play an essential role in building autonomous and intelligent systems. In this direction, one of the most prominent research fields is Reinforcement Learning (RL). The long-term dependencies between the actions performed by a learning agent and the rewards returned by the environment introduce a challenge in RL algorithms. One of the ways of overcoming this challenge is the introduction of the value function approximation. This allows policy optimization in RL algorithms to rely on immediate state value approximation and simplifies policy learning. In practice, however, we use value approximations that involve future rewards and values from the truncated future trajectories with a decaying weighting scheme, such as in TD($\lambda$), to make a better trade between the bias and variance of the value estimator. Policy Gradients (PG), a prominent approach in model-free paradigm, rely on the correlation between past actions and the future rewards within truncated trajectories to form a gradient estimator of the objective function with respect to the policy parameters. However, as the length of the truncated trajectories increases or the $\lambda$ parameter approaches 1, akin to the use of Monte Carlo (MC) sampling, the variance in the gradient estimator of the PG objective increases drastically. Although the gradient estimator in PG methods has zero bias, the increase in variance leads to sample inefficiency, since the approximated value of an action may contain future rewards within the truncated trajectory that are not related to that action. One of the alternatives to the PG algorithms that do not introduce high variance during policy optimization is the Value Gradient (VG) algorithm which utilizes the functional relation between the past actions and the future state values on a trajectory. This estimation requires a differentiable model function; hence, VG algorithms are known as model-based Reinforcement Learning (MBLR) approaches. Although it is possible to apply the VG objective on simulated sub-trajectories, as is the case for most MBLR approaches, the most effective way approach is to apply the VG objective on observed trajectories via the means of reparameterization. If the observed trajectories are used, VG algorithms avoid the issue of compounding errors that occur when the model function is called iteratively with its previous prediction to simulate future trajectories.
Açıklama
Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2022
Anahtar kelimeler
sequential decision-making problems, Markov Decision Processes (MDP), Generalized Policy Iteration (GPI)
Alıntı