Long-horizon value gradient methods on Stiefel manifold

dc.contributor.advisorÜre, Nazım Kemal
dc.contributor.authorOk, Tolga
dc.contributor.authorID772336
dc.contributor.departmentComputer Engineering Programme
dc.date.accessioned2025-02-05T08:22:31Z
dc.date.available2025-02-05T08:22:31Z
dc.date.issued2022
dc.descriptionThesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2022
dc.description.abstractSequential decision-making algorithms play an essential role in building autonomous and intelligent systems. In this direction, one of the most prominent research fields is Reinforcement Learning (RL). The long-term dependencies between the actions performed by a learning agent and the rewards returned by the environment introduce a challenge in RL algorithms. One of the ways of overcoming this challenge is the introduction of the value function approximation. This allows policy optimization in RL algorithms to rely on immediate state value approximation and simplifies policy learning. In practice, however, we use value approximations that involve future rewards and values from the truncated future trajectories with a decaying weighting scheme, such as in TD($\lambda$), to make a better trade between the bias and variance of the value estimator. Policy Gradients (PG), a prominent approach in model-free paradigm, rely on the correlation between past actions and the future rewards within truncated trajectories to form a gradient estimator of the objective function with respect to the policy parameters. However, as the length of the truncated trajectories increases or the $\lambda$ parameter approaches 1, akin to the use of Monte Carlo (MC) sampling, the variance in the gradient estimator of the PG objective increases drastically. Although the gradient estimator in PG methods has zero bias, the increase in variance leads to sample inefficiency, since the approximated value of an action may contain future rewards within the truncated trajectory that are not related to that action. One of the alternatives to the PG algorithms that do not introduce high variance during policy optimization is the Value Gradient (VG) algorithm which utilizes the functional relation between the past actions and the future state values on a trajectory. This estimation requires a differentiable model function; hence, VG algorithms are known as model-based Reinforcement Learning (MBLR) approaches. Although it is possible to apply the VG objective on simulated sub-trajectories, as is the case for most MBLR approaches, the most effective way approach is to apply the VG objective on observed trajectories via the means of reparameterization. If the observed trajectories are used, VG algorithms avoid the issue of compounding errors that occur when the model function is called iteratively with its previous prediction to simulate future trajectories.
dc.description.degreeM.Sc.
dc.identifier.urihttp://hdl.handle.net/11527/26361
dc.language.isoen
dc.publisherGraduate School
dc.sdg.typeGoal 9: Industry, Innovation and Infrastructure
dc.subjectsequential decision-making problems
dc.subjectMarkov Decision Processes (MDP)
dc.subjectGeneralized Policy Iteration (GPI)
dc.titleLong-horizon value gradient methods on Stiefel manifold
dc.title.alternativeStiefel manifoldu üzerinde uzun ufuklu değer gradyanı yöntemleri
dc.typeMaster Thesis

Dosyalar

Orijinal seri

Şimdi gösteriliyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
Ad:
772336.pdf
Boyut:
1.06 MB
Format:
Adobe Portable Document Format

Lisanslı seri

Şimdi gösteriliyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
Ad:
license.txt
Boyut:
1.58 KB
Format:
Item-specific license agreed upon to submission
Açıklama