ISTANBUL TECHNICAL UNIVERSITY ⋆ GRADUATE SCHOOL

REAL-TIME HUMAN MANIPULATION ACTION RECOGNITION
WITH A FACTORIZED GRAPH SEQUENCE ENCODER

M.Sc. THESIS

Enes ERDOĞAN

Department of Computer Engineering

Computer Engineering Programme

FEBRUARY 2026


ISTANBUL TECHNICAL UNIVERSITY ⋆ GRADUATE SCHOOL

REAL-TIME HUMAN MANIPULATION ACTION RECOGNITION
WITH A FACTORIZED GRAPH SEQUENCE ENCODER

M.Sc. THESIS

Enes ERDOĞAN
(504211555)

Department of Computer Engineering

Computer Engineering Programme

Thesis Advisor: Prof. Dr. Sanem SARIEL
Co-advisor: Assoc. Prof. Dr. Eren Erdal AKSOY

FEBRUARY 2026


İSTANBUL TEKNİK ÜNİVERSİTESİ ⋆ LİSANSÜSTÜ EĞİTİM ENSTİTÜSÜ

AYRIŞTIRILMIŞ ÇİZGE DİZİ KODLAYICISI İLE İNSAN NESNE
ETKİLEŞİMLERİNİN GERÇEK ZAMANLI OLARAK TANINMASI

YÜKSEK LİSANS TEZİ

Enes ERDOĞAN
(504211555)

Bilgisayar Mühendisliği Anabilim Dalı

Bilgisayar Mühendisliği Programı

Tez Danışmanı: Prof. Dr. Sanem SARIEL
Eş Danışman: Assoc. Prof. Dr. Eren Erdal AKSOY

ŞUBAT 2026


Enes ERDOĞAN, a M.Sc. student of ITU Graduate School student ID 504211555 suc-
cessfully defended the thesis entitled “REAL-TIME HUMAN MANIPULATION AC-
TION RECOGNITION WITH A FACTORIZED GRAPH SEQUENCE ENCODER”,
which he prepared after fulfilling the requirements specified in the associated legisla-
tions, before the jury whose signatures are below.

Thesis Advisor : Prof. Dr. Sanem SARIEL ..............................
Istanbul Technical University

Co-advisor : Assoc. Prof. Dr. Eren Erdal AKSOY ..............................
Halmstad University

Jury Members : Prof. Dr. Sinan KALKAN ..............................
Middle East Technical University

Assoc. Prof. Dr. Yusuf YASLAN ..............................
Istanbul Technical University

Asst. Prof. Dr. Yusuf Hüseyin ŞAHİN ..............................
Istanbul Technical University

Date of Submission : 29 December 2025
Date of Defense : 2 February 2026

v


vi


To my dear family,

vii


viii


FOREWORD

First and foremost, I would like to express my sincere gratitude to my advisors, Sanem
Sarıel and Eren Erdal Aksoy. Their confidence in me, even at times when I doubted
myself, has meant more than I can fully convey. Their guidance, encouragement, and
expertise throughout my master’s studies have been invaluable.

I am also grateful to be part of the AIR Lab, where I had the opportunity to learn a great
deal from A. Cihan Ak and Arda İnceoğlu. I am equally thankful to my dear friends,
Püren Tap and Tuğçe Temel. I feel lucky to be around such kind and fun people. There
may be others whose names I have unintentionally overlooked, but whose support I
deeply appreciate.

I would also like to thank M. Alpaslan Tavukçu. His positive and insightful perspective
on life makes difficult times more bearable. I feel fortunate to have met such a
distinguished individual.

Finally, I want to extend my heartfelt thanks to dear friends who have stood by me
through the ups and downs of life: Osman M. Tekin and Doğan Turan. Their support
has been a steady source of strength.

This thesis is supported by a grant from the Scientific and Technological Research
Council of Turkey (TUBITAK), Grant No. 119E-436. This work was also supported by
the Turkcell-Istanbul Technical University Researcher Funding Program. This research
has also received funding from the Vinnova FFI project SMILE-IV (agreement no
2023-00789). Some of the computing resources used in this work were provided by
the National Center for High Performance Computing of Turkey (UHeM) under grant
number 4019762024.

February 2026 Enes ERDOĞAN
(Research Assistant)

ix


x


TABLE OF CONTENTS

Page
FOREWORD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
ABBREVIATIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
ÖZET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Why Graph Representations? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Current Approaches & Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Scene Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Graph Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 A generic formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Graph convolutional networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Transformer-like GNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Related Works: Graph-based Action Recognition Models . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Offline models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Real-time models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3. FACTORIZED GRAPH SEQUENCE ENCODER . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Sliding Window with Majority Voting (SW-MV) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4. EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5. DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1 Impact of Window Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Comparison with an RGB-Only Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Contribution of Majority Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Comparison with Other Pooling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.5 Ablations with Sequence Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
CURRICULUM VITAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

xi


xii


ABBREVIATIONS

AR : Action Recognition
ASSIGN : Asynchronous-Sparse Interaction Graph Networks
Bimacs : KIT Bimanual Action Dataset
CoAx : Collaborative Action Dataset
ESEC : Enriched Semantic Event Chain
FGSE : Factorized Graph Sequence Encoder
GCN : Graph Convolutional Network
GNN : Graph Neural Network
HRC : Human-Robot Collaboration
MLP : Multi-Layer Perceptron
PGCN : Pyramidal Graph Convolutional Network
RNN : Recurrent Neural Network
RT-AR : Real-Time Action Recognition
RT-MR : Real-Time Manipulation Recognition
ST-GCN : Spatial-Temporal Graph Convolutional Networks
UQ-TFGCN : Uncertainty Quantified Temporal Fusion Graph Convolution Network
ViViT : Video Vision Transformer

xiii


xiv


SYMBOLS

V : Set of nodes
E : Set of edges
Gt : Graph at time t
z : Embedding vector
W : Window Size
D : Frame down-sampling ratio
y : Ground-truth label
LCE : Cross-entropy loss

xv


xvi


LIST OF TABLES

Page

Table 4.1: Manipulation Recognition results on Bimacs dataset. . . . . . . . . . . . . . . . . . 22
Table 4.2: Manipulation Recognition results on CoAx dataset. . . . . . . . . . . . . . . . . . . . 24
Table 5.1: The F1-macro scores as window length increases on Bimacs dataset. 28
Table 5.2: Comparison with an RGB-only model on Bimacs dataset. . . . . . . . . . . . . 29
Table 5.3: Impact of the sliding window with majority voting on Bimacs dataset. 30
Table 5.4: The comparison with alternative pooling methods. . . . . . . . . . . . . . . . . . . . . 30
Table 5.5: The comparison with alternatives of Sequence Encoder on Bimacs

dataset.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

xvii


xviii


LIST OF FIGURES

Page

Figure 1.1: The difference between Action Recognition and Real-time Action
Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Figure 2.1: An example scene graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Figure 2.2: Example illustration that shows RGB-D data to graph representation 6
Figure 2.3: Illustrations for spatiotemporal relation in the graph representation. . 7
Figure 3.1: The proposed Factorized Graph Sequence Encoder (FGSE) network. 15
Figure 3.2: Majority voting usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 4.1: Overview of the Bimacs dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 4.2: Overview of the CoAx dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Figure 4.3: An example for qualitative result.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Figure 4.4: An example result where our model performs poorly. . . . . . . . . . . . . . . . . 24
Figure 4.5: Our graph extraction pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Figure 4.6: Sample frames from the qualitative evaluation of a video recorded

in our lab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Figure 5.1: Sequence of graphs to temporarily concatenated graph representation. 27
Figure A.1: 3D bounding box visualizations from Bimacs.. . . . . . . . . . . . . . . . . . . . . . . . . 41
Figure A.2: Fold-wise performance of our model in original Bimacs dataset

versus the dataset that comes from our graph extraction pipeline . . . . 42

xix


xx


REAL-TIME HUMAN MANIPULATION ACTION RECOGNITION
WITH A FACTORIZED GRAPH SEQUENCE ENCODER

SUMMARY

This thesis addresses the problem of real-time manipulation action recognition, a
specialized subfield of human action recognition. Human action recognition is a
fundamental computer vision problem concerned with classifying short, trimmed
video clips that depict a single, well-defined human action, such as walking, running,
jumping, or waving. In contrast, real-time action recognition is performed on continuous,
untrimmed data streams. No prior information about action boundaries is available,
and instant predictions with low latency are required. In other words, the model must
continuously make predictions on input that may contain consecutive actions or, at times,
no action at all. This requirement is critically important in domains such as human-robot
collaboration, assistive robotics, industrial assembly lines, and intelligent surveillance
systems. This thesis further narrows the scope of real-time action recognition by
focusing on manipulation actions, which involve deliberate object interactions carried
out via human hands. This problem, also referred to as human-object interaction
recognition in the literature, aims to instantly recognize everyday or goal-oriented
hand-object interactions such as pushing a cup, cutting bread with a knife, or stirring
tea with a spoon. In this context, the problem is defined as real-time manipulation
recognition. Especially in settings where people and robots operate side by side, such
as factory assembly lines, it is essential for robots to recognize human manipulation
actions in real time to ensure both safety and efficient collaboration.

Manipulation actions are inherently object-centric, meaning that the relationships
between objects and the hands are more informative than the visual appearance of
individual objects. This characteristic motivates the use of semantic scene graph
representations, where nodes correspond to objects and edges encode spatiotemporal
relations between them. Scene graphs offer several advantages over raw RGB
representations. First, they provide an abstract yet semantically rich representation
that captures the underlying structure of the manipulation scene. Second, by reducing
the dimensionality from high-dimensional image data to compact graph structures,
they enable efficient real-time processing. Third, they naturally filter out irrelevant
variability such as changes in illumination, camera viewpoint, background clutter, and
object appearance, thereby improving generalization, especially when training data is
limited, which is a common situation in human-robot collaboration applications where
domain-specific manipulations may have only a few examples.

The graph representation employed in this work defines semantic relations describing
both static properties such as spatial arrangements between objects, and dynamic
properties such as how objects move relative to each other over time. These relations

xxi


are computed from the three-dimensional bounding boxes of objects across consecutive
frames and are stored as edge features in the resulting scene graph.

A key limitation of existing graph-based manipulation recognition models is their
approach to handling temporal information. Prior real-time capable methods
concatenate sequential graphs by adding temporal edges between the same objects
in consecutive frames. This design inherently limits temporal scalability because
graph neural networks can only propagate information to nodes within a limited
neighborhood determined by the number of network layers. Increasing the number
of layers to compensate leads to over-smoothing, a well-known phenomenon where
node embeddings become indistinguishable, preventing the network from learning
meaningful representations.

To address these challenges, this thesis introduces a novel architecture called the
Factorized Graph Sequence Encoder that separates spatiotemporal feature extraction
into two distinct components: a Graph Encoder and a Sequence Encoder. The
Graph Encoder processes each scene graph independently using attention-based graph
convolutional layers, which refine node embeddings while incorporating edge features
that characterize spatial and temporal relationships. Since each graph is processed
separately, the model can flexibly scale across the temporal dimension without requiring
deeper graph network architectures. A novel parameter-free operation called Hand
Pooling is introduced to extract graph-level embeddings. Based on the observation
that hands are the primary manipulators in any manipulation action, Hand Pooling
selects only the node embeddings corresponding to hands, rather than aggregating all
node embeddings as in traditional pooling methods. This focused extraction yields
more discriminative graph-level representations and reduces the computational burden
on the subsequent temporal encoder. The Sequence Encoder is a transformer-based
architecture that applies self-attention to the sequence of hand embeddings, enabling the
model to learn temporal context across the input window. This factorized design allows
the model to efficiently propagate information across graphs regardless of the temporal
length of the input, achieving temporal scalability that prior approaches lack. During
inference, a sliding window strategy combined with majority voting is employed. As
the window slides along the temporal axis, multiple predictions are generated for each
frame, and majority voting combines these predictions into a final label. This approach
resembles ensemble learning and helps reduce noisy predictions while preventing
over-segmentation.

The proposed model is evaluated on two publicly available datasets covering
manipulation actions in kitchen, workshop, and industrial assembly environments.
These datasets include scenarios relevant to human-robot collaboration, with actions
such as pouring, cutting, stirring, screwing, and assembling components. A
cross-validation approach is employed to ensure robust evaluation across different
subjects. Experimental results demonstrate that the proposed model achieves significant
improvements over previous state-of-the-art real-time methods on both datasets. When
allowing a slightly larger window size, the model achieves results comparable to offline
models that have access to entire videos at once. With a compact architecture containing
only a few hundred thousand parameters and running at approximately 66 frames per

xxii


second on a standard GPU, the model is lightweight enough for practical real-time
deployment.

An extensive ablation study validates the design choices of the proposed architecture.
The analysis of window length confirms that the proposed model successfully scales
with increasing temporal context, while competing methods based on temporal
graph concatenation exhibit performance degradation as the input length grows.
Comparisons with alternative pooling methods demonstrate that Hand Pooling
outperforms both simple averaging approaches and more sophisticated learnable pooling
operations. Experiments with different sequence encoder architectures show that the
transformer-based encoder outperforms recurrent alternatives, while removing the
temporal encoder entirely results in significant performance drops, confirming the
importance of temporal context modeling. A comparison with an architecturally
similar RGB-based model reveals the limitations of image-only approaches on
object-centric manipulation datasets with limited training samples. The RGB-based
model significantly underperforms compared to the proposed graph-based approach,
underscoring the advantages of scene graph representations for manipulation recognition
in scenarios where data efficiency is crucial.

In conclusion, this thesis presents a novel approach to real-time manipulation
recognition that achieves state-of-the-art performance while maintaining computational
efficiency and temporal scalability. The proposed Factorized Graph Sequence
Encoder, combined with the Hand Pooling operation and sliding window with majority
voting, provides an effective solution for recognizing human manipulation actions in
human-robot collaboration scenarios. Future work will explore the application of this
architecture to skeleton-based human whole-body manipulation tasks and investigate
methods to handle noisy scene graph extraction by incorporating estimation confidence
into the model.

xxiii


xxiv


AYRIŞTIRILMIŞ ÇİZGE DİZİ KODLAYICISI İLE İNSAN NESNE
ETKİLEŞİMLERİNİN GERÇEK ZAMANLI OLARAK TANINMASI

ÖZET

Bu tez çalışmasında, insan aksiyon tanıma alanının özel bir alt dalı olan, gerçek
zamanlı manipülasyon aksiyonu tanıma problemi ele alınmıştır. İnsan aksiyonu tanıma,
genellikle kısa ve önceden bölümlendirilmiş video klipler üzerinden belirli ve tekil
hareketlerin sınıflandırılmasını amaçlayan temel bir bilgisayarlı görü problemidir.
Bu yaklaşımda genellikle, yürüme, koşma, zıplama, el sallama gibi net bir şekilde
tanımlanmış tek bir aksiyonu içeren video parçaları analiz edilmekte ve söz konusu
aksiyonun türü belirlenmektedir. Buna karşın gerçek zamanlı aksiyon tanıma, sürekli
ve kesilmemiş veri akışı üzerinde çalışmaktadır. Aksiyon sınırlarına dair herhangi bir
ön bilgi bulunmamakta ve düşük gecikme ile anlık tahminler üretilmesi gerekmektedir.
Başka bir deyişle, model ardışık olarak birden fazla eylemi veya bazen hiçbir eylemin
olmadığı durumları içeren girdi üzerinde sürekli tahmin yapmak zorundadır. Bu
gereksinim, özellikle insan-robot iş birliği, hizmet robotları, endüstriyel montaj
hatları ve akıllı gözetim sistemleri gibi alanlarda kritik öneme sahiptir. Bu tez
çalışması, gerçek zamanlı aksiyon tanıma problemini daha da daraltarak, insanın
elleri aracılığıyla nesnelerle kurduğu bilinçli etkileşim eylemlerine odaklanmaktadır.
Literatürde insan-nesne etkileşimi tanıma olarak da adlandırılan bu problem, bir
bardağı itme, bıçakla ekmek kesme veya kaşıkla çay karıştırma gibi günlük veya
amaca yönelik el-nesne etkileşimlerinin anlık olarak tanınmasını hedeflemektedir.
Bu bağlamda ele alınan problem, gerçek zamanlı manipülasyon tanıma olarak
tanımlanmaktadır. Özellikle fabrika bantları gibi insan ve robotun birlikte çalıştığı
ortamlarda, robotun insanın gerçekleştirdiği manipülasyon eylemlerini gerçek zamanlı
olarak algılayabilmesi, güvenli ve verimli bir iş birliği açısından kritik öneme sahiptir.

Gerçek zamanlı manipülasyon tanıma modellerinin hem düşük gecikme süresiyle
çalışması hem de anlamsal olarak zengin temsiller öğrenebilmesi beklenmektedir.
Ancak doğrudan ham RGB video verisi üzerinde çalışan yöntemler, yüksek
boyutlu girdi uzayı, büyük veri gereksinimi ve sınırlı genelleme kabiliyeti gibi
nedenlerle bu gereksinimleri karşılamakta zorlanmaktadır. Özellikle insan-robot iş
birliği senaryolarında, sınırlı sayıda örnekle belirli manipülasyonların öğrenilmesi
beklenirken, RGB tabanlı yöntemlerin bu tür veri kıtlığı durumlarında başarısız
olduğu gözlemlenmektedir. Ayrıca RGB tabanlı yaklaşımlar, aydınlatma koşulları,
kamera açısı, arka plan karmaşıklığı ve nesne görünümündeki farklılıklar gibi
manipülasyonla doğrudan ilgisi olmayan değişkenlerden olumsuz etkilenmektedir. Bu
nedenle, bu çalışmada ham görsel veriler yerine, sahnedeki nesneler ve bu nesneler
arasındaki uzamsal ve zamansal ilişkileri açıkça modelleyen sembolik sahne çizgeleri
kullanılmıştır.

xxv


Sahne çizgeleri, sahnedeki nesneleri düğümler, bu nesneler arasındaki ilişkileri ise
ayrıtlar olarak temsil eden çizge tabanlı yapılardır. Bu temsilde, nesnelerin göreceli
konumları ve zaman içindeki etkileşimleri, düşük boyutlu ve anlamsal olarak zengin
özellikler aracılığıyla ifade edilir. Çalışmada tercih edilen spesifik çizge temsili,
toplamda on dört farklı anlamsal ilişki türünü içermektedir. Bu ilişkiler iki ana
kategoriye ayrılmaktadır: statik ve dinamik ilişkiler. Statik ilişkiler, nesnelerin
birbirlerine göre uzamsal konumlarını tanımlamaktadır ve üstünde, altında, içinde,
etrafında ve çevresinde gibi durumları kapsamaktadır. Dinamik ilişkiler ise nesnelerin
zaman içindeki hareketsel etkileşimlerini ifade etmekte olup, birlikte hareket etme,
birlikte durma, birbirine yaklaşma ve birbirinden uzaklaşma gibi durumları içermektedir.
Bu ilişkiler, ardışık karelerdeki üç boyutlu nesne sınırlayıcı kutularından kural tabanlı
bir yaklaşımla çıkarılmakta ve sahne çizgesinde ayrıt öznitelikleri olarak saklanmaktadır.
Bu yaklaşım, RGB verisindeki gereksiz detayları büyük ölçüde filtreleyerek modelin
doğrudan eylemin özüne odaklanmasını sağlamaktadır.

Literatürde çizge tabanlı manipülasyon tanıma çalışmaları genel olarak çevrimdışı
ve gerçek zamanlı olmak üzere iki ana gruba ayrılmaktadır. Çevrimdışı yöntemler,
genellikle tüm videoyu tek seferde işleyerek yüksek doğruluk elde edebilmekte, ancak
gerçek zamanlı sistemler için kabul edilemez gecikmelere sahiptir. Gerçek zamanlı
yöntemler ise çoğunlukla ardışık sahne çizgelerini zamansal eksende birleştirerek tek
bir büyük çizge oluşturan yaklaşımlara dayanmaktadır. Ancak bu yaklaşımda, zamansal
olarak uzak düğümlerden bilgi alabilmek için çizge sinir ağlarının katman sayısını
artırmak gerekmektedir. Öte yandan, katman sayısı arttıkça düğüm temsillerinin
birbirine benzemesi ve ayırt edici özelliklerini kaybetmesi anlamına gelen aşırı
düzleşme (oversmoothing) problemi ortaya çıkmakta ve sonuç olarak modelin zamansal
ölçeklenebilirliği ciddi şekilde sınırlanmaktadır.

Bu çalışmada, söz konusu sorunların üstesinden gelmek amacıyla Ayrıştırılmış
Çizge Dizi Kodlayıcısı adı verilen bir ağ mimarisi önerilmiştir. Önerilen mimari,
çizge temsillerinin uzamsal ve zamansal boyutlarını ayrı ayrı işleyen iki bileşenden
oluşmaktadır: çizge kodlayıcı ve dizi kodlayıcı. İlk aşamada, her zaman adımındaki
sahne çizgesi bağımsız olarak çizge kodlayıcıdan geçirilmekte; ikinci aşamada ise
elde edilen öznitelik dizisi dizi kodlayıcı aracılığıyla zamansal bağlam içerisinde
işlenmektedir. Bu ayrıştırılmış tasarım, mevcut gerçek zamanlı yöntemlerin aksine,
çizge sinir ağının derinliğini artırmaya gerek kalmadan uzun zaman dizileri üzerinde
ölçeklenebilir şekilde çalışabilmeyi mümkün kılmaktadır. Bu özellik tezin ana
katkılarından birini oluşturmakta olup, girdi pencere uzunluğu arttıkça model
performansının da artması şeklinde deneysel olarak doğrulanmıştır.

Çizge kodlayıcıda, dikkat mekanizması içeren dönüştürücü tabanlı çizge evrişim
katmanları tercih edilmiştir. Bu katmanlar, düğümler arası mesaj iletiminde hem komşu
düğüm özelliklerini hem de kenar özniteliklerini dikkate alarak her komşuya farklı önem
ağırlıkları atayabilmektedir. Böylece sabit ağırlıklı standart çizge evrişim katmanlarına
kıyasla ifade gücü daha yüksek düğüm temsilleri elde edilmektedir.

Çizge kodlayıcıdan elde edilen (sayısı dinamik olarak değişebilen) düğüm temsillerinin,
bütün çizgeyi temsil eden bir temsile dönüştürülmesi için literatürde çeşitli havuzlama
yöntemleri vardır. Ancak literatürdeki naif havuzlama yöntemleri elde edilen temsili
zayıflatmakta, gelişmiş havuzlama yöntemleri ise gereksiz bir yük katmaktadır. Bu

xxvi


çalışmada alternatif olarak El Merkezli Havuzlama adı verilen yeni ve parametresiz bir
havuzlama yöntemi önerilmiştir.

Çizge kodlayıcıdan elde edilen ve sayısı dinamik olarak değişebilen düğüm temsillerinin,
tüm çizgeyi temsil eden tek bir gösterime dönüştürülmesi için literatürde çeşitli
havuzlama yöntemleri bulunmaktadır. Bununla birlikte, literatürde yaygın olarak
kullanılan naif havuzlama yöntemleri elde edilen temsili zayıflatırken, daha gelişmiş
yaklaşımlar gereksiz bir hesaplama yükü getirmektedir. Bu çalışmada, alternatif
olarak El Merkezli Havuzlama adı verilen yeni ve parametresiz bir havuzlama
yöntemi önerilmektedir. Bu yöntem, yalnızca ellere karşılık gelen düğüm temsillerini
seçmektedir. Manipülasyon tanımı gereği eller, sahnedeki nesnelerle etkileşime
giren tek aktörler olduğundan, eylem hakkında en fazla bilgiyi taşıyan düğümler
olarak değerlendirilebilir. Karşılaştırmalı deneyler, bu basit ve ek hesaplama maliyeti
gerektirmeyen yaklaşımın, hem klasik hem de öğrenilebilir parametreler içeren gelişmiş
havuzlama tekniklerine kıyasla daha iyi performans sunduğunu göstermektedir.

El temsilleri, zamansal bağlamı öğrenmek üzere yalnızca kodlayıcı bileşeninden
oluşan dönüştürücü tabanlı bir dizi kodlayıcıya aktarılmaktadır. Bu yapı, öz-dikkat
mekanizması sayesinde kısa ve orta vadeli zamansal ilişkileri etkin bir şekilde
modellemektedir. Model, her zaman adımı için ayrı bir tahmin üreterek uzun sekanslar
boyunca eylem geçişlerinin sağlıklı şekilde ele alınmasını sağlamaktadır.

Gerçek zamanlı çalışma performansını artırmak amacıyla çıkarım aşamasında
kayan pencere yaklaşımı ve çoğunluk oylaması kullanılmıştır. Kayan pencere
mekanizması sayesinde her bir sahne çizgesi için birden fazla tahmin üretilmekte,
bu tahminler çoğunluk oylaması ile birleştirilerek nihai karar elde edilmektedir.
Topluluk öğrenmesine benzer bu yaklaşım, geçici hatalı tahminlerin etkisini azaltarak
aşırı bölütlemeyi önlerken, yalnızca pencere uzunluğuna bağlı sabit bir gecikme
eklemektedir.

Önerilen yöntem, iki farklı veri kümesi üzerinde kapsamlı deneylerle değerlendirilmiştir.
Kullanılan veri kümelerinden ilki mutfak ve atölye ortamlarında gerçekleşen
görevlerden oluşmakta olup, toplamda on dört çeşit atomik manipülasyon kategorisini
kapsamaktadır. Yaklaşma, kaldırma, bırakma, tutma, karıştırma, dökme, kesme, içme
gibi manipülasyon aksiyonları ve kase, şişe, kesme tahtası, bıçak, çekiç, testere,
tornavida gibi on iki farklı nesne içermektedir. Diğer veri kümesi ise endüstriyel
insan-robot iş birliği senaryolarına odaklanmaktadır. Diğerine benzer şekilde on farklı
atomik manipülasyon aksiyonu içerilmekte ve on altı nesne kullanılmaktadır. Elde
edilen sonuçlar, önerilen modelin her iki veri kümesinde de mevcut gerçek zamanlı
yöntemleri anlamlı ölçüde geride bıraktığını göstermektedir. Ayrıca modelin yalnızca
yaklaşık 269 bin parametreye sahip olması, literatürdeki çevrimdışı modellere kıyasla
onlarca kat daha az parametre içermesi anlamına gelmektedir. Model, orta seviye bir
grafik işlemci üzerinde bile yaklaşık 66 kare/saniye hızına ulaşarak gerçek zamanlı
çalışabilirliğini kanıtlamıştır.

Tezin tartışma bölümünde, modelin farklı yönlerini analiz eden kapsamlı bir inceleme
sunulmuştur. İlk olarak, girdi pencere uzunluğunun performans üzerindeki etkisi
araştırılmış ve önerilen ayrıştırılmış kodlayıcı tasarımının, pencere uzunluğu arttıkça
performansı iyileştirdiği gösterilmiştir. Mevcut gerçek zamanlı yöntemlerin aksine,
bu modeller pencere uzunluğu arttıkça performans kaybı yaşamakta iken, önerilen

xxvii


mimaride böyle bir sorun gözlemlenmemiştir. İkinci olarak, RGB tabanlı bir derin
öğrenme modeli ile karşılaştırma yapılmış ve sahne çizgesi temsiline dayalı yaklaşımın
sınırlı veri koşullarında çok daha iyi genelleme sağladığı ortaya konmuştur. RGB tabanlı
model, kısmen önceden eğitilmiş olmasına rağmen, nesne merkezli manipülasyon veri
kümelerinde yetersiz kalmıştır. Üçüncü olarak, çoğunluk oylamasının katkısı incelenmiş
ve bu mekanizmanın model performansını önemli ölçüde artırdığı doğrulanmıştır.
Dördüncü olarak, önerilen El Merkezli Havuzlama yöntemi alternatif havuzlama
teknikleri ile karşılaştırılmış ve hem basit ortalama havuzlamaya hem de daha karmaşık
öğrenilebilir havuzlama yöntemlerine karşı üstünlük sağladığı gösterilmiştir. Son olarak,
dizi kodlayıcı bileşenin katkısı ölçülmüş ve yinelgen sinir ağı tabanlı alternatiflere
kıyasla daha iyi performans sunduğu belirlenmiştir.

Sonuç olarak, bu tez çalışmasında gerçek zamanlı manipülasyon eylemi tanıma
problemi için, hem hesaplama açısından verimli hem de zamansal olarak ölçeklenebilir
yeni bir çizge tabanlı mimari önerilmiştir. Çalışmanın ana katkıları, uzamsal ve
zamansal işlemeyi ayrıştıran ve böylece zamansal ölçeklenebilirlik sağlayan yeni bir ağ
mimarisi ve manipülasyon eylemlerinde ellerin merkezi rolünü kullanan parametresiz
bir havuzlama yöntemi olarak özetlenebilir. Elde edilen bulgular, önerilen modelin
insan-robot işbirliği senaryolarında güçlü bir alternatif sunduğunu göstermektedir.

xxviii


1. INTRODUCTION

In this chapter, we begin by explaining the importance of the problem and clarifying

the scope and boundaries of the task we aim to address. Next, we motivate graph-based

representation and discuss why it is well-suited for our task. Finally, we provide a

brief overview of existing research, highlighting how our approach sets itself apart, and

present a concise outline of our key contributions.

1.1 Motivation

Action Recognition is the task of categorizing human movements based on sensory

inputs that typically capture a brief and focused segment of activity. These inputs

often come from sources like sequences of images or other motion data that represent a

single, clearly defined action. The main goal is to analyze this short and uniform clip

to determine what specific action a person is performing, such as walking, running,

jumping, or waving.

On the other hand, Real-Time Action Recognition (RT-AR) refers to a more advanced

and demanding field that focuses on identifying actions with very low delay as soon

as they emerge. Unlike traditional approaches that rely on short and neatly segmented

clips, this setting must operate on continuous, untrimmed streaming data. The model is

required to make ongoing predictions while handling inputs that may contain multiple

actions consecutively (and sometimes just no action at all), without any prior information

about the action boundaries. This capability is essential for intelligent systems that

need to interact with or operate alongside humans in dynamic environments, such as

assistive robotics, human-robot collaboration, human computer interaction, autonomous

vehicles, and real time video surveillance.

In this thesis, we focus on enabling robots to collaborate effectively with humans,

whether in structured settings like factory assembly lines or more flexible environments

such as kitchens. Thus, narrowing down our focus from the RT-AR, our goal is to

1


Figure 1.1: The difference between Action Recognition and Real-time Action
Recognition.

detect actions centered on interacting with objects, which we refer to as Real-Time

Manipulation Recognition (RT-MR). The real-time requirement is crucial in scenarios

that demand immediate system responses, particularly in Human-Robot Collaboration

where people handle objects to achieve a goal with robotic assistance.

It is important to clarify terminology in this domain since the word action can refer to

general behaviors like walking, jumping, or pushing. Here, our attention is specifically

on human activities that involve deliberate object interactions using the hands, such as

pushing a cup, cutting bread with a knife, or stirring tea with a spoon. For this reason,

we use the more precise terms manipulation or manipulation action, also known as

human object interaction in the literature.

1.2 Why Graph Representations?

RT-MR models must operate with high computational efficiency to maintain smooth

performance and low latency. Within HRC settings, these recognition models are further

expected to encode semantically rich and abstract knowledge [1]–[3], enabling robots

to act with greater autonomy. However, relying directly on raw RGB observations

presents several challenges in this regard. Such data provides no inherent semantic

understanding of manipulations, and the high-dimensional nature of the representation

demands large training sets and compute resources. However, in HRC context, a model

2


might be expected to recognize, for instance, a very specific cooking action with very

few data points efficiently so that the robot can collaborate with the human user.

Semantic scene graphs offer a way to alleviate these limitations by explicitly capturing

the underlying structure of the scene. By reducing the representation to meaningful

entities and relations, they both lower the dimensional burden and make it feasible

to train and deploy models in real time. This abstraction also filters out irrelevant

variabilities, including changes in illumination, camera viewpoint, background clutter,

object appearance, etc.. As a result, the model can focus on the relational cues among

objects to infer the intended action and generalize easily.

With this motivation in mind, we study the real-time recognition of human manipulation

actions using symbolic scene graphs [3], where nodes represent objects and edges store

semantic embeddings for spatial and/or temporal relations between objects, such as

touching, being above, moving together, and getting close, among others.

1.3 Current Approaches & Our Contributions

Most existing work on graph based manipulation recognition either overlooks real-time

constraint [4]–[7] or uses temporally concatenated graph representation that do not

scale well over longer horizons [1,8], which restricts these models to only recognize

relatively extended manipulation episodes.

Therefore, to address these challenges, we introduce a new Factorized Graph Sequence

Encoder network to recognize manipulation actions in real-time using the scene graph

representation only. Inspired by the factorized encoder design in ViViT [9], more

specifically ViViT Model 2, our model separates spatiotemporal feature extraction into

Graph Encoder and Sequence Encoder combined with a new Hand Pooling operation.

Because our model processes each graph independently, it can flexibly scale across

the temporal dimension without requiring deeper graph neural network architectures,

unlike prior approaches [1,8].

Our novel parameter-free Hand Pooling operation extracts node embeddings associated

with hands, enhancing recognition performance. Moreover, we apply a sliding window

3


strategy with majority voting to boost inference performance, introducing only a

minimal constant delay.

The summary of our contributions is as follows:

• We introduce a new Factorized Graph Sequence Encoder combined with a new

Hand Pooling operation that improves the F1-macro score by 14.3% and 5.6% in

comparison to the nearest competitor [8] on Bimacs [1] and CoAx [10] datasets,

respectively. Furthermore, when allowing a slightly higher delay, our model achieves

results comparable to offline models that process entire videos at once.

• Addressing the limitations of previous approaches, we demonstrate that our network

design supports temporal scalability, meaning that as the input sequence length

increases, the model performs better.

We also note that the results presented in this thesis were published at the

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) in

2025. Specifically, the sections on scene graphs (Section 2.1), graph neural networks

(Section 2.2), qualitative evaluation (Section 4.5), and the appendix titled An Attempt to

Re-Extract the Graph Data (Appendix A) were written from scratch. The remaining

chapters are extended versions of the aforementioned publication.

4


2. BACKGROUND

In this chapter, we first provide a brief background on scene graphs and graph neural

networks, and describe the computations performed by the specific GNN layer employed

in this work. We then present an overview of related literature on graph-based action

recognition methods.

2.1 Scene Graphs

Early computer vision systems primarily focused on recognizing objects in isolation.

While object detection and classification have achieved remarkable accuracy, such

representations are insufficient for capturing the rich semantics of real-world scenes.

Understanding a scene often requires reasoning about how objects interact, not just

which objects are present. For instance, distinguishing between a man riding a horse

and a man standing next to a horse depends critically on relational information rather

than object identities alone.

Scene graphs were introduced to address this limitation by providing a structured,

explicit representation of objects and their relationships in a visual scene. By modeling

objects, their attributes, and their pairwise relations, scene graphs enable higher-level

reasoning and support downstream tasks such as semantic image retrieval, visual

question answering, image captioning, and image generation. Empirical results

show that incorporating relational structure leads to significant improvements over

representations that rely solely on object-level or low-level visual features.

The origin of scene graphs is commonly attributed to [11], who introduced them in the

context of semantic image retrieval, and later extended by [12], who demonstrated the

benefits of contextual reasoning over scene graphs for relationship prediction.

A scene graph is a visually grounded graph representation of an image in which nodes

correspond to object instances localized in the image, and directed edges represent

semantic relationships between pairs of objects. Each object node is typically associated

5


Figure 2.1: An example scene graph. Taken from [12].

with a category label and may include attributes, while each edge encodes a predicate

describing how two objects are related, such as spatial, functional, or action-based

relations. By explicitly modeling objects and their pairwise relationships within a

unified graph structure, scene graphs provide a structured representation that supports

contextual reasoning about the contents of a visual scene. An example scene graph is

provided in Figure 2.1.

A wide range of alternative scene graph representations has been proposed in the

literature, as surveyed in [13]. In this thesis, we adopt an ESEC-based representation,

as it provides a favorable trade-off between expressive power for action discrimination

and simplicity of extraction from raw video data.

ESEC-based Graph Representation

Given a human manipulation demonstration captured from a third-person viewpoint (as

shown in Figure 2.2), we represent each scene as a graph following the manipulation

action ontology presented in [14]. In this representation, nodes correspond to objects,

and edges encode the spatiotemporal relations between them.

Figure 2.2: Example illustration that shows RGB-D data to graph representation

6


Figure 2.3: Static relations include (a1) Above/Below, (a2) Around, and (a3)
Inside/Surround. Dynamic relations include (b1) Moving Together, (b2) Halting

Together, (b3) Fixed-Moving Together, (b4) Getting Close, (b5) Moving Apart, and
(b6) Stable. Taken from [3].

As detailed in [3], a total of 14 distinct semantic relations describe both the static or

spatial properties (for example, above, below, inside) and the dynamic or temporal

interactions (such as moving together, getting close, moving apart), as illustrated in

Figure 2.3. These relations are computed from the 3D bounding boxes of objects across

consecutive frames and are stored as binary edge features in the resulting scene graph.

It is important to recognize that any graph generation approach faces a fundamental

trade-off. On one hand, the representation must be easy to extract from the data. On

the other hand, it must be expressive enough to distinguish between different actions,

meaning it must possess sufficient representational richness. Since the preferred graph

extraction method is rule-based (as opposed to deep learning based methods), it is

easy to work with. Regarding its expressiveness, we partially show its effectiveness in

Section 5 by comparing with an RGB-based method.

Formally, streaming of a graph sequence can be defined as S = {G0,G1, ...Gt}, where Gt

represents the extracted scene graph at time step t. At any specific time τ , the extracted

graph is denoted as Gτ = (Vτ ,Eτ), where Vτ is the set of nodes, expressed as Vτ = {vi
τ}.

Each node vi
τ is a one-hot-encoded object category. Similarly, Eτ denotes the set of

7


edges, given by Eτ = {e j
τ}, where each edge e j

τ is represented as a 14-dimensional

binary feature vector, i.e., e j
τ ∈ {0,1}14.

2.2 Graph Neural Networks

Deep learning architectures such as MultiLayer Perceptrons (MLPs), Convolutional

Neural Networks (CNNs), and Recurrent Neural Networks (RNNs) have achieved

remarkable success in domains including computer vision, speech recognition, and

natural language processing. A common characteristic of these architectures is the

assumption that data lies on a regular and well-defined structure, such as a grid or a

sequence. However, some of the real-world problems involve data that is inherently

irregular and relational such as social networks, molecular graphs, recommender

systems, and knowledge graphs so on. In those domains, data points are best represented

as nodes connected by edges, forming a graph structure.

However, traditional neural networks cannot process graph-structured data because

graphs lack a fixed topology, have variable neighborhood sizes, and they require

permutation-invariance. Thus, these limitations motivate the development of new

neural network architectures that can directly operate on graphs while preserving their

relational structure.

Graph Neural Networks (GNNs) address these challenges by extending deep learning

techniques to graph domains. They enable learning on graphs by iteratively aggregating

and transforming information from neighboring nodes. This allows GNNs to capture

both node attributes and graph topology in a unified framework.

2.2.1 A generic formulation

At the core of most GNN models lies the message passing mechanism. Let a graph

be defined as G = (V,E), where V is the set of nodes and E is the set of edges. A

generic message passing GNN updates node representations iteratively across layers.

Representing hidden representations of the each node at layer k as h(k)
v , the update rule

8


can be written as

h(k+1)
v = φ

h(k)
v ,

⊕
u∈N (v)

ψ

(
h(k)

v ,h(k)
u ,euv

) (2.1)

where ψ is the message function, φ is the update function, euv denotes edge features

between nodes u and v, and
⊕

represents a permutation-invariant aggregation operator

such as summation or averaging. Also, the neighborhood of node v is denoted by N (v).

This formulation ensures permutation invariance with respect to node ordering, which

is a fundamental requirement for graph-based learning. By stacking multiple layers,

each node representation gradually incorporates information from increasingly distant

neighbors, allowing the network to capture higher-order structural patterns. More

explicitly, in an n-layer GNN, node v receives information from its n-hop distant

neighbors.

2.2.2 Graph convolutional networks

One of the widely used examples of a GNN layer is the Graph Convolutional Network

(GCN) [15]. GCNs simplify the message passing mechanism by using a linear

aggregation of normalized neighbor features followed by a nonlinear activation. Let A

be the adjacency matrix of the graph, and let I denote the identity matrix. Self-loops are

added by defining Ã = A+ I. The corresponding degree matrix is denoted by D̃, where

D̃ii = ∑ j Ãi j, and it is used to normalize the adjacency matrix as Â = D̃− 1
2 ÃD̃− 1

2 .

Finally, the GCN update rule in matrix form is given by

H(k+1) = σ

(
ÂH(k)W (k)

)
(2.2)

where H(k) is the node feature matrix at layer k, W (k) is a trainable weight matrix, and

σ(·) is a nonlinear activation function such as ReLU.

2.2.3 Transformer-like GNN

While GCNs are computationally efficient and effective for many tasks, they implicitly

assign equal importance to all neighboring nodes after normalization. This uniform

weighting limits their expressiveness, particularly in graphs where some neighbors are

9


more informative than others. To address this limitation, attention-based graph neural

networks introduce mechanisms that allow the model to learn adaptive, data-dependent

importance weights for neighboring nodes during aggregation, rather than treating all

neighbors equally.

Another limitation of standard GCN is that they typically operate only on scalar edge

weights. This restricts their ability to model more complex relationships between

entities, particularly when edges carry high-dimensional attributes, as in our graph

representation.

Both of these limitations can be addressed by the TransformerConv layer [16]. Inspired

by the success of transformers in sequence modeling, TransformerConv extends

attention mechanisms to graph-structured data using scaled dot-product attention and

multi-head architectures. For a node i, the feature update rule is given by

h′
i = W1hi + ∑

j∈N (i)
αi, j
(
W2h j +W6ei j

)
(2.3)

where hi ∈ Rd denotes the input feature vector of node i, ei j ∈ Rde represents the edge

feature between nodes i and j. The matrices W1,W2 ∈ Rd′×d and W6 ∈ Rd′×de are

learnable linear transformations applied to the central node features, neighbor node

features, and edge features, respectively.

The attention coefficients αi, j are computed using scaled dot-product attention as

αi, j = softmax j∈N (i)

(
(W3xi)

⊤ (W4x j +W6ei j
)

√
d′

)
(2.4)

where W3,W4 ∈ Rd′×d are learnable projection matrices that map node features to

query and key representations, respectively, and d′ denotes the dimensionality of the

attention space. The softmax operation is applied over all neighbors of node i, ensuring

that the attention coefficients form a normalized distribution.

Through this mechanism, TransformerConv allows the model to selectively focus on

the most relevant neighboring nodes while naturally incorporating rich, vector-valued

edge attributes, resulting in a more expressive and flexible graph representation.

10


2.3 Related Works: Graph-based Action Recognition Models

In addition to the representational and architectural background, reviewing existing

approaches in this domain provides useful context for the problem we address.

Our focus is on recognizing human manipulation actions through scene graph

representations; therefore, RGB-based action recognition models [17]–[22] fall outside

the scope of this thesis and are not reviewed here. Accordingly, in the following

sections we review prior work on graph-based manipulation recognition, organizing the

discussion into two categories: offline models and real-time models.

2.3.1 Offline models

There exists a large corpus of work in graph-based scene representation for manipulation

recognition [23]–[26]. Most of these works, however, operate offline in a batch mode.

For instance, Akyol et al. [4] propose a two-headed manipulation recognition and

prediction network based on Variational Graph Autoencoders [27], where reconstruction

is not necessary. However, this work assumes that the key scene graphs are known prior.

Also, the proposed model only accepts a single graph as an input, thus, the model lacks

temporal understanding, making the model infeasible for real-time applications.

Morais et al. [5] follow a different approach and model each entity in the scene with

their state evolving throughout the video sparsely and asynchronously by interacting

with each other. The state is a manipulation label for humans and an affordance label

for objects. Node features are derived from low-level visual features extracted using a

pre-trained Faster-RCNN [28] model and the messages between nodes are modeled as a

type of attention, i.e., the cosine similarity between node features. This architecture is

extended in [6] with a position-based object graph to improve its performance.

The work in [7] proposes an encoder-decoder architecture for joint learning of both

manipulation recognition and temporal segmentation tasks. Their contribution involves

a novel attention-based graph convolution layer to encode scene graphs and a temporal

pyramidal pooling module to decode these graph embeddings into framewise labels.

11


Spatial position information is the only cue employed as node embedding to represent

skeletons and objects in the scene. The edges are dynamically created between highly

correlated nodes during manipulation, except for those between skeleton joints, which

are defined naturally. Conventional 2D convolution operations are then applied to a

generated V ×T dimensional feature map, where V is the number of objects and T

defines the length of the video. However, this design strictly assumes that the number

of nodes throughout the video is constant, which is a highly restrictive assumption for

complex manipulation sequences. Based on [7], [29] enhances temporal segmentation

by introducing a Temporal Feature Fusion decoder while preserving feature space

distances with Spectral Normalized Residual connections. However, the model in [29]

becomes 3.7 times larger than [7], leading to higher computational complexity.

In contrast to these works, our model operates in real-time and does not rely on any

prior knowledge about the number of graph nodes/edges, nor does it require low-level

RGB features.

2.3.2 Real-time models

In the context of online manipulation recognition, Dreher et al. [1] propose a model

based on the graph encoder-decoder architecture [30]. They first extract graphs for each

frame separately, using spatiotemporal relations introduced in [3]. Next, in order to

combine the sequential graphs, they introduce the temporal connections between the

same nodes in consecutive graphs. However, considering that graph neural networks are

capable of propagating information to n-hop distant nodes where n denotes the number

of layers, this design exhibits scalability limitations when the temporal length of the

input increases. One might suggest that new layers could be added to compensate, but

in return, over-smoothing [31,32] might occur, which is a well-known phenomenon in

deep graph networks, where no meaningful and distinguishable node embeddings are

learned.

Another recent attempt [8] proposes a joint model for manipulation recognition and

manipulation-conditioned motion forecasting, with a two-stage training. Initially,

the manipulation action recognition module is trained, and subsequently, to predict

the motion of the objects and hands, the model employs the predicted manipulation

12


information in addition to the current graph sequence. In this graph representation, node

embeddings consist of 3D object positions concatenated with one-hot encoded object

categories. Furthermore, only edges between the hands and other objects are considered,

where the edge feature is nothing but the distance between the hands and objects. As in

the case of [1], the consecutive graphs are linked with temporal edges. Consequently,

the aforementioned criticisms regarding the limited temporal scalability of the model

also apply to this study in [8]. Additionally, the discarding of edges between the objects

may prevent the model from learning more complex manipulations.

The recent work in [33] employs skeleton data and applies the sliding window with a

majority voting approach on top of the Spatial-Temporal Graph Convolutional Networks

(ST-GCN) [34]. The scalability issue is also valid for the ST-GCN model due to the

temporal concatenation of sequential graphs.

Our proposed model also differs from these real-time capable works due to our factorized

encoder design, which enables temporal scaling of the network to enhance accuracy.

13


14


3. FACTORIZED GRAPH SEQUENCE ENCODER

In this chapter, we explain the proposed model architecture in detail. It is worth noting

that the final design emerged through extensive empirical exploration, with numerous

components such as the choice of GNN layer, parameter count, and normalization

strategy evaluated through ablation studies. Some of these design choices are supported

by the findings presented in Section 4. We conclude this chapter by describing how the

model operates in real time using a sliding window mechanism combined with majority

voting.

3.1 Model Architecture

Figure 3.1: The proposed Factorized Graph Sequence Encoder (FGSE) network.

We propose a new Factorized Graph Sequence Encoder (FGSE) network to recognize

manipulation actions in real-time from a stream of graph data. FGSE consists of two

distinct encoder types: Graph Encoder and Sequence Encoder, combined with a new

Hand Pooling operation, as illustrated in Figure 3.1.

Our Graph Encoder (GE) builds upon this foundation using a transformer inspired

graph convolutional operator called TransformerConv [16]. As mentioned earlier, this

15


layer employs attention-based message passing, enabling each node to assign different

weights to information coming from its neighbors while also integrating the edge

features that characterize their spatial and temporal relationships. In doing so, the

layer produces refined node embeddings that reflect both the graph topology and the

semantics of the relations.

To stabilize training and maintain consistent representation scaling across layers, each

TransformerConv block is followed by LayerNorm [35]. This choice aligns with

mainstream transformer architectures, where LayerNorm plays a key role in ensuring

numerical stability and smoother gradient flow when stacking many attention-based

layers.

We also apply the SELU activation function [36] after each convolutional layer. SELU is

a self-normalizing activation that drives activations toward zero mean and unit variance

during training. This property reduces the risk of vanishing or exploding activations as

the network deepens, while eliminating the need for explicit normalization within the

activation pathway. Combining SELU with LayerNorm enhances stability and helps the

model converge more reliably.

Repeating this sequence of TransformerConv, SELU, and LayerNorm N times yields

the complete GE module, illustrated in Figure 3.1. Through these stacked layers, the

encoder progressively enriches the node embeddings, enabling the network to capture

increasingly intricate relational and structural cues from the scene graph. Also, note that

the extent to which node information can propagate, measured in n-hop neighborhoods,

is determined by the number of layers in the GNN.

As mentioned above, the number of nodes varies from one graph to another, which

makes it difficult to pass graph representations directly into standard neural network

components that expect constant-sized inputs. Pooling functions address this issue by

reducing a variable-sized set of node embeddings to a constant-sized representation that

can be processed by downstream modules. The simplest and most common pooling

strategy is average pooling, where the final graph embedding is obtained by taking the

mean of all node embeddings in the graph. This provides a straightforward, permutation

invariant summary of the entire graph. However, it has an obvious drawback: it treats

16


every node as equally important, causing informative or task critical nodes to be diluted

by less relevant ones.

In manipulation action scenarios, hands are the main and only manipulators interacting

with the objects in the scene [14]. Therefore, it is reasonable to assume that hands

accumulate more descriptive embeddings to infer types of performed manipulations.

With this assumption, to obtain graph-level embeddings, we propose a simple and

parameter-free operation, named Hand Pooling (HP), that selects node embeddings

belonging to the hands in the initial graph. The combination of these two stages can be

expressed as:

HP(GEθ (Gτ)) = zh,τ (3.1)

where the Graph Encoder network, GE, is parametrized by θ and zh,τ is the

hand-corresponding (h) embedding vector pooled by HP at time τ from the

corresponding scene graph Gτ .

The Sequence Encoder (SE) is an Encoder-only Transformer [37] that enables the

model to learn temporal context by applying self-attention to hand embeddings (zh,τ )

in the input sequence. Finally, for each graph, a linear layer is applied to map those

embeddings to manipulation labels. Stating these two layers combined formally:

SEL
φ (zh,τ−(W−1), · · · ,zh,τ) = (ŷ0

τ−(W−1), · · · , ŷ
W−1
τ ) (3.2)

where Sequence Encoder network (SE) and linear layer (L), SEL, is parametrized by

φ , and W is the input window length of the model. The model prediction y is the

output vector of the Softmax layer (which is omitted in the notation for the sake of

simplicity), and its superscript denotes the relative position of the prediction within the

given input. Note that, alternatively, the model could have predicted a single label for

the whole input sequence by using the mean of the output embeddings or by employing,

for instance, a class token. However, we observed that for the long sequences, this

strategy significantly reduces the model’s performance due to natural transitions among

different types of manipulations throughout a long scenario. This is elaborated more in

the discussion section.

17


The hallmark of our design is the separation of Graph and Sequence Encoders. This

design allows the model to efficiently pass information among graphs even when the

temporal length increases, regardless of the number of layers in the GE module. In

addition, our new HP operation reduces the workload of SE by exclusively returning

hand embeddings. This is further discussed in Section 5.

3.2 Sliding Window with Majority Voting (SW-MV)

The FGSE network returns a manipulation label for each corresponding input graph,

as depicted in Figure 3.1. During the inference process, we utilize a sliding window

approach, which generates W labels for a given graph, Gτ . To combine these predictions,

the majority voting algorithm is leveraged as illustrated in Figure 3.2. More formally,

let ŷw
τ be the prediction vector, i.e., the output of the Softmax activation for Gτ as being

the wth element in the sliding window. Thus, majority voting combines all predictions

into a final one as:

ỹτ = argmax
c

W−1

∑
w=0

1(argmax(ŷw
τ )=c) (3.3)

where ỹτ is the combined labels at time τ , and 1 represents the indicator function.

Note that applying SW-MV, which resembles ensemble learning, helps reduce noisy

predictions and prevents over-segmentation over time.

As can be noticed, applying majority voting to a sliding window with a length of W

introduces a delay of W/FPS seconds for the model output. Consequently, while a

larger window enables the model to capture a richer local context, it comes with a cost

of delay proportional to W .

Figure 3.2: Majority voting is used to combine shifted predictions. As the window
slides along the temporal axis, new predictions are generated. Here, the window size

(W ) is 5, and each colored box denotes a different predicted manipulation label.

18


4. EXPERIMENTS

In this chapter, we describe the complete experimental setup, including the datasets

used, the model training procedure, the evaluation methodology, and a comparison of

our results with leading approaches reported in the literature.

4.1 Datasets

We benchmark the proposed model on two publicly available datasets described below.

KIT Bimanual Action (Bimacs) Dataset [1] consists of 6 subjects performing 9 distinct

manipulation tasks relevant to kitchen and workshop environments, with each task

repeated 10 times. We borrow Figure 4.1 from [1], which illustrates three representative

videos using selected frames. In total, the dataset contains 2 hours and 18 minutes

of RGB-D recordings and covers 14 atomic manipulation categories: idle, approach,

retreat, lift, place, hold, stir, pour, cut, drink, wipe, hammer, saw, and screw. The videos

are fully annotated for each hand individually and involve interactions with 12 distinct

objects, namely: cup, bowl, whisk, bottle, banana, cutting board, knife, sponge, hammer,

saw, wood, and screwdriver.

Figure 4.1: Sample videos from the Bimacs dataset. The first row presents breakfast
preparation, the second row depicts a cooking task involving stirring and pouring, and
the third row shows hard drive disassembly by unscrewing a screw. Taken from [1].

19


Bimacs dataset already provides extracted graphs with ESEC relations [3], so we directly

work on these graphs. Thus, we directly feed these graphs to our FGSE model. Note

that since manipulations in Bimacs are labeled for each hand separately, we employ

two linear layers to predict each manipulation performed by the left and right hands

individually.

Collaborative Action (CoAx) Dataset [10] involves 6 subjects executing 3 industrial

assembly manipulation tasks, one of which involves interaction with a collaborative

robot. Each manipulation task is repeated 10 times. Similarly, we also borrow Figure

4.2 from [10], which illustrates three representative videos for each of those tasks. The

dataset contains a total of 1 hour and 58 minutes of RGB-D video data. The dataset

comprises 10 distinct manipulation actions and 16 objects, with frames annotated as

action object pairs. Although this setup yields 160 possible action object combinations,

only 23 pairs actually occur in the CoAx dataset. To reduce model complexity, we

identify these existing combinations and merge each action object pair into a single

unified label. The resulting labels are: approach, grab screwdriver, plug screwdriver,

grab valve, screw screwdriver, release valve, grab soldering iron, plug soldering iron,

retreat, join screwdriver, grab valve terminal, plug valve terminal, place screwdriver,

grab box with screws, place box with screws, grab hose, wait for robot, plug hose,

grab box with membrane, grab soldering station, solder hose, release soldering station,

release box with membrane.

Figure 4.2: An overview of the CoAx dataset tasks is shown. From top to bottom, the
rows depict Tasks 1 to 3: valve terminal setup and assembly with screws; valve

assembly with screws and a membrane; and soldering a capacitor using soldering tin,
assisted by a collaborative robot holding the soldering board. Taken from [10].

20


Additionally, the dataset includes 3D object bounding boxes; however, unlike

Bimacs [1], it does not provide spatiotemporal relation information. Following the

approach in [3], we derive these relations from the bounding boxes in order to construct

the graph representations of the dataset.

In both datasets, there might be noisy object detections and, consequently, incorrect

relations between those objects. To mitigate this issue, we set an empirical threshold to

filter out such relations. Specifically, if any two objects are too far apart, we remove the

edge between them.

4.2 Training Setup

We optimize the proposed FGSE network by minimizing the cross-entropy loss averaged

over the input window as given in equation 4.1. Notice that majority voting is not applied

during training.

LCE =− 1
W

W−1

∑
w=0

y⊺τ−w · log(ŷτ−w) (4.1)

where y represents the one-hot-encoded ground truth, and ŷ is the prediction vector after

the softmax activation. And (·) indicates the dot-product between these two vectors.

We experiment with varying window lengths, denoted as W followed by the respective

value (e.g., W30 means window length of 30). Additionally, we observed that

consecutive graphs are quite similar to each other unless the action changes. Therefore,

in certain experiments, we downsampled the input sequence by a factor of 3 to accelerate

training and testing without compromising accuracy, referring to this as D3. Note that

during the metric calculations (F1-macro/micro), we upsampled them back into the

original scale for fair comparison.

Through empirical evaluation, we set the number of layers in both the Graph Encoder

and Sequence Encoder to 2, i.e., the parameter N in Figure 3.1 is set to 2. Further

network and training parameter details can be found in the shared source code link1.

1https://github.com/eneserdo/FGSE

21

https://github.com/eneserdo/FGSE


Table 4.1: Manipulation Recognition results on Bimacs [1].

Methods Real-time No visual F1- F1-
capable feature macro micro

ASSIGN [5,38] ✗ ✗ 79.5 82.3
PGCN [7] ✗ ✓ 81.5 86.9
UQ-TFGCN [29] ✗ ✓ 88.6 88.4
Dreher et al. [1] ✓ ✓ 63.0 64.0
H2O+RGCN [8] ✓ ✓ 66.0 68.0
FGSE-W30-D3 (Ours) ✓ ✓ 78.1 81.1
FGSE-W75-D3 (Ours) ✓ ✓ 80.3 82.7

4.3 Evaluation

Macro and micro-averaged F1 scores are measured to report the success of each

trained model. Note that due to class imbalance, macro-averaged F1 score is a more

reliable metric to measure the performance. Following the work in [1], we apply the

leave-one-subject-out cross-validation approach to generate six folds for both datasets.

Each fold corresponds to different subject in dataset.

4.4 Results

In this section, we present the results for two variants of our model with window lengths

of 30 and 75. Table 4.1 compares the recognition performance of our model with other

relevant models on the Bimacs [1] dataset. We separate the benchmarked models based

on their real-time capabilities, such as online versus offline models.

Among the online models (e.g., [8] and [1] in Table 4.1), our model (FGSE-W75-D3)

achieves a significant improvement on the previous state-of-the-art model [8], surpassing

it by 14.3% and 14.7% in terms of F1-macro and F1-micro scores, respectively.

Compared to the offline models (e.g., [5,7,38] and [29] in Table 4.1) that take the entire

video at once, i.e., access the complete context and relations between the manipulations,

our model (FGSE-W75-D3) achieves comparable results with [5,7] in case of increasing

the window length (W=75). We, however, note that the incorporation of visual features

in [5] contradicts the original purpose of scene graphs. Scene graphs are designed

22


to represent objects independently of their appearances or shapes, thereby making

manipulation recognition more generalizable.

The offline model UQ-TFGCN [29] attains the highest performance among all models;

however, it has the drawback of having the highest number of parameters (20.1M),

which is 74 times more than our model. Similarly, PGCN [7] has 21 times more

parameters (5.4M) than our model.

Figure 4.3: An example run from the test set of Bimacs [1], in which a person pours
water from a bottle into a cup, and then drinks it. The top three rows show the

ground-truth labels, predictions, and vote count in majority voting for the left hand, and
the next three rows correspond to the right hand. Each color represents different

actions: idle, approach, lift, hold, pour, place, retreat, drink. Layout
adapted from [1].

Figure 4.3 presents an illustrative sample from the Bimacs dataset and its qualitative

analysis. In addition to the ground truth and the predicted labels for the sample

video, we also included the vote counts in majority voting, which can be related

to the confidence level of the model. In all predictions, our model demonstrates a

high degree of confidence, except for instances involving transitions between distinct

manipulations.

We also want to give a qualitative example where our model performs poorly. As

illustrated in Figure 4.4, some actions are incorrectly detected and for some part of the

video, over-segmentation is observed.

Table 4.2 reports the obtained recognition results on the CoAx dataset [10]. Our model

yields a new state-of-the-art score, improving the nearest competitor [8] by 5.6% in

terms of F1-macro. Note that the results of Dreher et al. [1] on the CoAx dataset are

taken from [8].

23


Figure 4.4: An example result where our model performs poorly. As can be seen, some
actions are incorrectly detected and for some part of the video, over-segmentation is

observed.

Table 4.2: Manipulation Recognition results on CoAx [10].

Methods F1-macro F1-micro
Dreher et al. [1] 60.0 70.0
H2O+RGCN [8] 87.0 90.0
FGSE-W30 (Ours) 90.7 92.8
FGSE-W75-D3 (Ours) 92.6 94.9

A comparison of our model’s variants in both Table 4.1 and Table 4.2, FGSE-W75-D3

and FGSE-W30-D3, indicates that slightly relaxing the real-time constraints, i.e.,

increasing the window length, leads to improved performance by allowing the model

to capture a larger local context. A further analysis on the impact of window length is

given in the discussion section.

Regarding the runtime performance, with 269K parameters and 4.8 GFLOPS (on

average), our proposed model FGSE achieves approximately 66 FPS on an Intel

i9-12900K CPU with an NVIDIA GeForce RTX 3060 GPU, indicating that it is

lightweight enough to run in real-time even on a low-end GPU card.

To summarize, the results indicate that our model achieves a new state-of-the-art

performance among real-time capable models. Moreover, it demonstrates promising

performance even when compared to offline models, especially given its extremely

parameter-efficient design relative to [29] and [7].

24


4.5 Qualitative Evaluation

Figure 4.5: Our graph extraction pipeline. Each intermediate step is visualized in
between the boxes.

We recorded a proof-of-concept video to evaluate the model in a real-world setting,

where a robot, controlled via teleoperation, assists a human in preparing a generic dish.

An RGB-D video was captured using a ZED camera. To obtain the corresponding

graphs, we constructed a pipeline using state-of-the-art, off-the-shelf tools, as illustrated

in Figure 4.5.

We follow a graph extraction procedure similar to that described in Bimacs. In that

work, approximately 5.4k images were first manually annotated and then used to train a

YOLOv3 model, which automatically labeled the remaining images. Adopting a similar

strategy, we fine-tune a YOLOv11 model for object annotation using a human-annotated

subset of the data. The trained detector is then applied to predict 2D bounding boxes for

objects across the dataset. Using these bounding boxes together with the corresponding

RGB images, we employ the SAM2 [39] model to obtain object segmentation masks.

By incorporating the associated depth frames, we reconstruct scene-level point clouds

and apply the object masks to estimate 3D bounding boxes for each object.

For hand annotations, standard object detection or segmentation models prove

inadequate. Instead, we employ the AlphaPose [40] model, which can reliably localize

25


hands. Based on the detected hand keypoints, we fit 2D bounding boxes for the hands

and then follow a pipeline similar to that used for objects to obtain their corresponding

3D bounding boxes.

We then aggregate the three-dimensional bounding box information for both objects and

hands and extract ESEC [3] relationships using a rule-based approach that leverages the

spatiotemporal arrangement of the 3D bounding boxes.

Finally, the resulting graph representations are fed into our model. A selection of frames,

along with the predicted action labels for each hand, is shown in Figure 4.6. The video

can be accessed via the project webpage2.

Figure 4.6: Sample frames from the qualitative evaluation of a video recorded in our
lab. In left-top corner of each frame, predicted manipulation labels for each hand can

be seen.

Note that after observing some erroneous graph data, we also used this pipeline to

re-extract graphs from the original Bimacs dataset in an effort to obtain higher-quality

representations. However, this attempt was unsuccessful, and the newly generated

graphs were not used. The details of this attempt are provided in Appendix A.

2https://air.cs.itu.edu.tr/projects/fgse.html

26

https://air.cs.itu.edu.tr/projects/fgse.html


5. DISCUSSION

In this chapter, we present a series of focused analyses to better understand our model’s

performance and design choices, including the impact of temporal window length,

comparison with an RGB-only baseline, the contribution of majority voting, pooling

method comparisons, and sequence-encoder ablations.

5.1 Impact of Window Length

Figure 5.1: Sequence of graphs to temporarily concatenated graph representation.

We hypothesized that due to the factorized encoder design, our model would perform

better at scaling in the temporal dimension compared to prior approaches that

concatenate the input graphs temporally [1,8].

As shown in Table 5.1, our experimental findings on Bimacs [1] reveal that the

performance of our model substantially improves when the window length is doubled

from 10 to 20 graphs. After this particular point, although the performance continues to

increase, the rate of improvement slows down, which can be interpreted as 20 graphs

being sufficient to recognize most of the manipulations, and feeding in more graphs

does not dramatically enhance recognition performance.

A similar improvement trend is also visible in the CoAx dataset [10]. As indicated

in the last row in Table 5.1, our model demonstrated an improvement of 9.1 points in

terms of F1-macro when the number of graphs increased from 10 to 40. The results

indicate that our network is better at scaling temporarily by design.

27


Table 5.1: The F1-macro scores as window length increases on Bimacs [1].

Dataset Window length (W) 10 20 30 40

Bimacs
Dreher et al. [1] 63.0 49.6 51.0 N/A
Dreher et al. [1] (scaled) 63.0 51.2 42.9 N/A
FGSE (Ours) 72.2 78.3 78.6 79.9

CoAx FGSE (Ours) 83.1 87.9 90.7 92.2

In this table, we also compare our model with a real-time capable model proposed

by Dreher et al. [1] only, since the source code of H2O-RGCN [8] is not yet publicly

available. As aforementioned, the compared model in [1] constructs a single graph

through the temporal concatenation, which means they add additional edges between

the same objects in the consecutive graphs in temporal axis, as illustrated in Figure

5.1. This design becomes unscalable as the temporal length of the input grows, since

graph neural networks can propagate information to nodes up to n hops away, where n

is the number of layers. The first row in Table 5.1 shows that the model’s performance

in [1] worsened even though input data contains more information as the window

length increases. While adding more layers could mitigate this issue, it may also lead

to over-smoothing [31,32]. To examine this, we doubled and tripled the number of

processing steps in Dreher’s model [1], and as indicated in the second row, this approach

also resulted in a similar performance failure. Note that the first three rows in Table 5.1

only show results for the first fold due to high computational load in [1] during training.

5.2 Comparison with an RGB-Only Model

Considering the thrilling improvements in the RGB-based recognition models, a

reasonable question might be how such a model would perform on the Bimacs dataset [1].

ViViT Model 2 [9] was chosen for comparison due to architectural similarity, i.e., it has

spatial and temporal encoders analogous to our Graph Encoder and Sequence Encoder.

In this ViViT model, we used a pre-trained spatial encoder and trained the temporal

encoder from scratch. To make the comparison fair, we also employed a sliding window

approach with majority voting during the test time. Due to high computational cost, in

this experiment, we only performed tests with the first fold.

As shown in Table 5.2, the obtained F1 scores of the ViViT model are quite low

compared to our proposed model. We believe that this underperformance is inherently

28


Table 5.2: Comparison with an RGB-only model (W30-D3) on Bimacs [1].

Model F1-macro F1-micro
ViViT-Model 2 [9] 63.5 64.1
FGSE (Ours) 78.3 82.6

related to the RGB-based approaches. As known, RGB-based models require a

significant amount of training data to learn from high-dimensional raw image data.

In our case, even though the network is partially pre-trained, the Bimacs dataset may not

be sufficient for such a model, despite our effort to minimize the number of parameters

in the temporal encoder part of the network. This poor generalization performance

in small dataset settings makes the RGB-based models infeasible for HRC scenarios

in which, for instance, a very specific cooking-related manipulation is supposed to

be learned with very few data points to help the robot efficiently collaborate with the

human user.

On the other hand, semantically rich symbolic scene graph-based methods are expected

to be better at generalization from a few data points thanks to the very low dimensional

representation space. In such a semantic representation, details irrelevant to the

manipulation, such as varying light conditions and background clutter, are naturally

disregarded, which might pose a significant challenge for RGB-based methods. For

instance, the same pouring manipulation executed with different objects (e.g., a cup

versus a bottle) might be unrecognizable by the RGB-based model due to the shape and

appearance changes of the objects in the scene. One might argue that the scene graphs

also depend on an object recognition model, thus, it is nothing else than just shifting

the burden of generalizability to the object detector. However, object detectors are

particularly trained to identify objects with varying visual features, which makes them

inherently more robust. Given that manipulation recognition is inherently object-centric,

where the temporal relationships between objects and their environment matter more

than instance-specific properties like object appearance or geometry, it is reasonable to

break down the manipulation recognition task into object detection and graph-based

recognition steps.

To conclude, our experimental findings in Table 5.2 reveal that RGB-only models

underperform on object-centric manipulation datasets with a limited number of samples,

29


such as Bimacs. This observation underscores the limitations of such models in HRC

scenarios.

5.3 Contribution of Majority Voting

Table 5.3: Impact of the sliding window with majority voting on Bimacs [1] (D3).

Methods F1-macro F1-micro
Center of window 76.9 79.2
Single Pred. 70.0 73.2
Majority voting 78.1 81.1

As an ablation study, we measure the impact of majority voting. As a first alternative,

we take the average of final embeddings after the Sequence Encoder and use a single

linear layer to predict a label that corresponds to the last graph in the window. More

formally, the Sequence Encoder combined with linear layer predicts as:

SEL
φ (zh,τ−(W−1), · · · ,zh,τ) = ŷτ (5.1)

As a second alternative approach, we only use the label at the window’s center, i.e.,

ỹτ = ŷW/2
τ , without altering the loss function or applying the majority voting. The results

in Table 5.3 indicate that majority voting strongly improves the model’s performance

compared to these alternatives.

5.4 Comparison with Other Pooling Methods

Table 5.4: The comparison with alternative pooling methods (W30-D3).

Methods F1-macro F1-micro
Global mean pooling 75.6 79.5
Top-k pooling [41] 77.1 80.7
SAGPool [42] 75.5 79.7
Hand-Pooling (Ours) 78.1 81.1

To demonstrate the effectiveness of the proposed Hand Pooling operation, we compare

it against several widely used pooling strategies from the literature. As reported in

Table 5.4, Hand Pooling significantly outperforms naive global mean pooling, which

aggregates all node features by simple averaging and therefore ignores the structural

and semantic importance of individual nodes.

30


Table 5.5: The comparison with alternatives of Sequence Encoder on Bimacs [1]
(W30-D3).

Seq. Enc. Variants F1-macro F1-micro
No Encoder 61.2 69.1
LSTM 69.4 74.1
BiLSTM 77.1 80.7
Encoder-only Transformer 78.1 81.1

We also compare our method with more advanced pooling techniques, including Top-k

pooling [41] and SAGPool [42]. Top-k pooling selects a subset of nodes based on

learned importance scores, while SAGPool employs self attention mechanisms to

adaptively retain informative nodes during pooling. Despite their increased modeling

capacity, both approaches are outperformed by Hand Pooling in our experiments.

In addition to its superior performance, Hand Pooling has the practical advantage of

incurring zero additional computational cost, as it does not rely on learnable parameters

or auxiliary scoring networks, unlike Top-k pooling and SAGPool. This makes Hand

Pooling both an effective and efficient alternative for graph level representation learning

in our setting.

5.5 Ablations with Sequence Encoder

To quantify the contribution of the Sequence Encoder in our model, we compare

it with classical recurrent architectures. The first row of Table 5.5 shows that

removing the Sequence Encoder completely and relying merely on the Graph Encoder

drops the performance significantly. This clearly reveals that the Sequence Encoder

contributes to the performance by extracting the local context information. On the other

hand, LSTM-based approaches could not reach the performance of the Encoder-only

Transformer network, although bidirectional LSTM shows promising results.

5.6 Limitations

Despite these promising results, there are certain limitations that can be viewed in

terms of model architecture and the chosen graph representation. From an architectural

perspective, the model lacks an explicit long-term memory mechanism for handling

31


extended action sequences, limiting its ability to capture long-range dependencies and

consequential relationships between actions. Incorporating a memory-based mechanism,

such as the use of dedicated memory tokens as in LSTR [20], could help preserve and

exploit long-term temporal context.

Additionally, the quality of the scene graphs is highly dependent on the extraction

process from RGB-D data. And it is susceptible to noise, mainly due to unreliable

depth measurements and imperfect object detection. Once introduced, such errors

can propagate through the graph representation and are difficult to recover from. As

an alternative, scene graph generation methods or conventions that are less sensitive

to noise and do not rely on depth information could be explored. Furthermore, the

architecture could be extended to explicitly model detection uncertainty, making it more

robust to errors in the input data.

Finally, because the Hand Pooling operation assumes that manipulation actions are

carried out by hands, our model’s performance degrades when the hands are not visible

in the scene or when the action is non-prehensile.

32


6. CONCLUSIONS

In this thesis, we addressed the challenge of real-time manipulation action recognition

for human-robot collaboration scenarios, where both computational efficiency and

semantic understanding are essential. We introduced a novel Factorized Graph Sequence

Encoder (FGSE) network that effectively decouples spatial and temporal feature

extraction through its Graph Encoder and Sequence Encoder modules, combined with a

parameter-free Hand Pooling operation.

Our approach leverages scene graph representations based on spatiotemporal ESEC

relations, which provide a semantically rich yet computationally efficient abstraction

of manipulation scenes. This design choice filters out irrelevant visual variations such

as lighting conditions, background clutter, and object appearances, enabling the model

to focus on the relational cues that are most informative for recognizing manipulation

actions. The factorized architecture overcomes the temporal scalability limitations

of prior methods that rely on temporally concatenated graphs, allowing our model

to effectively capture longer temporal contexts without suffering from information

propagation bottlenecks or over-smoothing.

Extensive experiments on the Bimacs and CoAx datasets demonstrate that FGSE

achieves state-of-the-art performance among real-time capable models, surpassing

the previous best approach by 14.3% and 5.6% in F1-macro score, respectively.

Furthermore, our model achieves results comparable to offline models that process

entire videos at once, despite operating under real-time constraints and having

significantly fewer parameters. Comprehensive ablation studies validate our design

choices, confirming the contributions of the factorized encoder design, Hand Pooling

operation, and majority voting mechanism to the overall performance. Additionally, our

comparison with ViViT Model 2 provides evidence that RGB-only approaches struggle

on object-centric manipulation datasets with limited training samples, underscoring the

value of graph-based representations in human-robot collaboration settings.

33


As future research directions, we plan to extend the FGSE architecture to skeleton-based

whole-body manipulation tasks, which would enable the recognition of more complex

human actions beyond hand-object interactions. To mitigate the noise in graph

extraction, incorporating estimation confidence scores into the model could improve

robustness against uncertain detections. Alternatively, employing deep learning models

to generate scene graphs directly from point cloud data represents a promising research

direction that could bypass the error-prone intermediate steps of the current pipeline.

Finally, exploring self-supervised or few-shot learning strategies could further enhance

the model’s adaptability to novel manipulation actions with minimal training data.

34


REFERENCES

[1] Dreher, C.R.G., Wächter, M. and Asfour, T. (2020). Learning Object-Action
Relations from Bimanual Human Demonstration Using Graph Networks,
IEEE Robotics and Automation Letters, 5(1), 187–194.

[2] Aksoy, E.E., Orhan, A. and Wörgötter, F. (2017). Semantic decomposition
and recognition of long and complex manipulation action sequences,
International Journal of Computer Vision, 122, 84–115.

[3] Ziaeetabar, F., Kulvicius, T., Tamosiunaite, M. and Wörgötter, F. (2018).
Recognition and prediction of manipulation actions using Enriched
Semantic Event Chains, Robotics and Autonomous Systems, 110, 173–188.

[4] Akyol, G., Sariel, S. and Aksoy, E.E. (2021). A Variational Graph Autoencoder
for Manipulation Action Recognition and Prediction, 20th International
Conference on Advanced Robotics (ICAR), IEEE, pp.968–973.

[5] Morais, R., Le, V., Venkatesh, S. and Tran, T. (2021). Learning Asynchronous
and Sparse Human-Object Interaction in Videos, Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pp.16041–16050.

[6] Qiao, T., Men, Q., Li, F.W.B., Kubotani, Y., Morishima, S. and Shum,
H.P.H. (2022). Geometric Features Informed Multi-person Human-object
Interaction Recognition in Videos, European Conference on Computer
Vision (ECCV).

[7] Xing, H. and Burschka, D. (2022). Understanding Spatio-Temporal Relations in
Human-Object Interaction using Pyramid Graph Convolutional Network,
IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), pp.5195–5201.

[8] Lagamtzis, D., Schmidt, F., Seyler, J., Dang, T. and Schober, S. (2023).
Exploiting Spatio-Temporal Human-Object Relations Using Graph Neural
Networks for Human Action Recognition and 3D Motion Forecasting,
IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), pp.7832–7838.

[9] Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M. and Schmid, C. (2021).
ViViT: A Video Vision Transformer, International Conference on Computer
Vision (ICCV).

35


[10] Lagamtzis, D., Schmidt, F., Seyler, J.R. and Dang, T. (2022). Coax: Collaborative
action dataset for human motion forecasting in an industrial workspace.,
ICAART (3), pp.98–105.

[11] Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D., Bernstein, M.
and Fei-Fei, L. (2015). Image retrieval using scene graphs, Proceedings
of the IEEE conference on computer vision and pattern recognition,
pp.3668–3678.

[12] Xu, D., Zhu, Y., Choy, C.B. and Fei-Fei, L. (2017). Scene graph generation by
iterative message passing, Proceedings of the IEEE conference on computer
vision and pattern recognition, pp.5410–5419.

[13] Li, H., Zhu, G., Zhang, L., Jiang, Y., Dang, Y., Hou, H., Shen, P., Zhao, X.,
Shah, S.A.A. and Bennamoun, M. (2024). Scene graph generation: A
comprehensive survey, Neurocomputing, 566, 127052.

[14] Wörgötter, F., Aksoy, E.E., Krüger, N., Piater, J., Ude, A. and Tamosiunaite, M.
(2013). A Simple Ontology of Manipulation Actions based on Hand-Object
Relations, IEEE Transactions on Autonomous Mental Development, 5(2),
117–134.

[15] Kipf, T. (2016). Semi-supervised classification with graph convolutional networks,
arXiv preprint arXiv:1609.02907.

[16] Shi, Y., Huang, Z., Feng, S., Zhong, H., Wang, W. and Sun, Y. (2021). Masked
Label Prediction: Unified Message Passing Model for Semi-Supervised
Classification, Z.H. Zhou, editor, Proceedings of the Thirtieth International
Joint Conference on Artificial Intelligence, IJCAI-21, International Joint
Conferences on Artificial Intelligence Organization, pp.1548–1554, main
Track.

[17] Zhang, B., Wang, L., Wang, Z., Qiao, Y. and Wang, H. (2016). Real-time action
recognition with enhanced motion vector CNNs, Proceedings of the IEEE
conference on computer vision and pattern recognition, pp.2718–2726.

[18] Cob-Parro, A.C., Losada-Gutiérrez, C., Marrón-Romera, M., Gardel-Vicente,
A. and Bravo-Muñoz, I. (2024). A new framework for deep learning
video based Human Action Recognition on the edge, Expert Systems with
Applications, 238, 122220.

[19] Liu, K., Liu, W., Gan, C., Tan, M. and Ma, H. (2018). T-C3D: Temporal
Convolutional 3D Network for Real-Time Action Recognition, Proceedings
of the AAAI Conference on Artificial Intelligence, 32(1).

[20] Xu, M., Xiong, Y., Chen, H., Li, X., Xia, W., Tu, Z. and Soatto, S. (2021).
Long Short-Term Transformer for Online Action Detection, Conference on
Neural Information Processing Systems (NeurIPS).

36


[21] Zhao, Y. and Krähenbühl, P. (2022). Real-time Online Video Detection with
Temporal Smoothing Transformers, European Conference on Computer
Vision (ECCV).

[22] Wang, X., Zhang, S., Qing, Z., Shao, Y., Zuo, Z., Gao, C. and Sang, N. (2021).
Oadtr: Online action detection with transformers, Proceedings of the
IEEE/CVF International Conference on Computer Vision, pp.7565–7575.

[23] Sridhar, M., Cohn, G.A. and Hogg, D. (2008). Learning Functional
Object-Categories from a Relational Spatio-Temporal Representation, Proc.
18th European Conference on Artificial Intelligence, pp.606–610.

[24] Kjellström, H., Romero, J. and Kragić, D. (2011). Visual object-action
recognition: Inferring object affordances from human demonstration,
Comput. Vis. Image Underst., 115(1), 81–90.

[25] Yang, Y., Fermüller, C. and Aloimonos, Y. (2013). Detection of manipulation
action consequences (MAC), Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp.2563–2570.

[26] Aksoy, E.E., Abramov, A., Wörgötter, F. and Dellen, B. (2010). Categorizing
object-action relations from semantic scene graphs, IEEE International
Conference on Robotics and Automation (ICRA), pp.398–405.

[27] Kipf, T.N. and Welling, M. (2016). Variational graph auto-encoders, arXiv preprint
arXiv:1611.07308.

[28] Ren, S., He, K., Girshick, R. and Sun, J. (2016). Faster R-CNN: Towards real-time
object detection with region proposal networks, IEEE transactions on
pattern analysis and machine intelligence, 39(6), 1137–1149.

[29] Xing, H. and Burschka, D. (2024). Understanding human activity with uncertainty
measure for novelty in graph convolutional networks, The International
Journal of Robotics Research, 02783649241287800.

[30] Battaglia, P., Hamrick, J.B.C., Bapst, V., Sanchez, A., Zambaldi, V., Malinowski,
M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., Gulcehre, C.,
Song, F., Ballard, A., Gilmer, J., Dahl, G.E., Vaswani, A., Allen, K.,
Nash, C., Langston, V.J., Dyer, C., Heess, N., Wierstra, D., Kohli, P.,
Botvinick, M., Vinyals, O., Li, Y. and Pascanu, R. (2018). Relational
inductive biases, deep learning, and graph networks, arXiv.

[31] Keriven, N. (2022). Not too little, not too much: a theoretical analysis of graph
(over) smoothing, Advances in Neural Information Processing Systems, 35,
2268–2281.

[32] Rusch, T.K., Bronstein, M.M. and Mishra, S. (2023). A survey on oversmoothing
in graph neural networks, arXiv preprint arXiv:2303.10993.

37


[33] Dallel, M., Havard, V., Dupuis, Y. and Baudry, D. (2022). A Sliding
Window Based Approach With Majority Voting for Online Human Action
Recognition using Spatial Temporal Graph Convolutional Neural Networks,
Proceedings of the 2022 7th International Conference on Machine Learning
Technologies, ICMLT ’22, Association for Computing Machinery, New
York, NY, USA, p.155–163.

[34] Yan, S., Xiong, Y. and Lin, D. (2018). Spatial temporal graph convolutional
networks for skeleton-based action recognition, Proceedings of the
Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth
Innovative Applications of Artificial Intelligence Conference and Eighth
AAAI Symposium on Educational Advances in Artificial Intelligence,
AAAI’18/IAAI’18/EAAI’18, AAAI Press.

[35] Ba, J.L., Kiros, J.R. and Hinton, G.E. (2016). Layer normalization, arXiv preprint
arXiv:1607.06450.

[36] Klambauer, G., Unterthiner, T., Mayr, A. and Hochreiter, S. (2017).
Self-normalizing neural networks, Advances in neural information
processing systems, 30.

[37] Vaswani, A. (2017). Attention is all you need, Advances in Neural Information
Processing Systems.

[38] Morais, R., Le, V., Venkatesh, S. and Tran, T. Learning Asynchronous and Sparse
Human-Object Interaction in Videos-Supplementary Material.

[39] Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle,
R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V.,
Carion, N., Wu, C.Y., Girshick, R., Dollár, P. and Feichtenhofer, C.
(2024). SAM 2: Segment Anything in Images and Videos, arXiv preprint
arXiv:2408.00714.

[40] Fang, H.S., Li, J., Tang, H., Xu, C., Zhu, H., Xiu, Y., Li, Y.L. and Lu, C.
(2022). AlphaPose: Whole-Body Regional Multi-Person Pose Estimation
and Tracking in Real-Time, IEEE Transactions on Pattern Analysis and
Machine Intelligence.

[41] Gao, H. and Ji, S. (2019). Graph u-nets, international conference on machine
learning, PMLR, pp.2083–2092.

[42] Lee, J., Lee, I. and Kang, J. (2019). Self-attention graph pooling, International
conference on machin