ISTANBUL TECHNICAL UNIVERSITY ⋆ GRADUATE SCHOOL REAL-TIME HUMAN MANIPULATION ACTION RECOGNITION WITH A FACTORIZED GRAPH SEQUENCE ENCODER M.Sc. THESIS Enes ERDOĞAN Department of Computer Engineering Computer Engineering Programme FEBRUARY 2026 ISTANBUL TECHNICAL UNIVERSITY ⋆ GRADUATE SCHOOL REAL-TIME HUMAN MANIPULATION ACTION RECOGNITION WITH A FACTORIZED GRAPH SEQUENCE ENCODER M.Sc. THESIS Enes ERDOĞAN (504211555) Department of Computer Engineering Computer Engineering Programme Thesis Advisor: Prof. Dr. Sanem SARIEL Co-advisor: Assoc. Prof. Dr. Eren Erdal AKSOY FEBRUARY 2026 İSTANBUL TEKNİK ÜNİVERSİTESİ ⋆ LİSANSÜSTÜ EĞİTİM ENSTİTÜSÜ AYRIŞTIRILMIŞ ÇİZGE DİZİ KODLAYICISI İLE İNSAN NESNE ETKİLEŞİMLERİNİN GERÇEK ZAMANLI OLARAK TANINMASI YÜKSEK LİSANS TEZİ Enes ERDOĞAN (504211555) Bilgisayar Mühendisliği Anabilim Dalı Bilgisayar Mühendisliği Programı Tez Danışmanı: Prof. Dr. Sanem SARIEL Eş Danışman: Assoc. Prof. Dr. Eren Erdal AKSOY ŞUBAT 2026 Enes ERDOĞAN, a M.Sc. student of ITU Graduate School student ID 504211555 suc- cessfully defended the thesis entitled “REAL-TIME HUMAN MANIPULATION AC- TION RECOGNITION WITH A FACTORIZED GRAPH SEQUENCE ENCODER”, which he prepared after fulfilling the requirements specified in the associated legisla- tions, before the jury whose signatures are below. Thesis Advisor : Prof. Dr. Sanem SARIEL .............................. Istanbul Technical University Co-advisor : Assoc. Prof. Dr. Eren Erdal AKSOY .............................. Halmstad University Jury Members : Prof. Dr. Sinan KALKAN .............................. Middle East Technical University Assoc. Prof. Dr. Yusuf YASLAN .............................. Istanbul Technical University Asst. Prof. Dr. Yusuf Hüseyin ŞAHİN .............................. Istanbul Technical University Date of Submission : 29 December 2025 Date of Defense : 2 February 2026 v vi To my dear family, vii viii FOREWORD First and foremost, I would like to express my sincere gratitude to my advisors, Sanem Sarıel and Eren Erdal Aksoy. Their confidence in me, even at times when I doubted myself, has meant more than I can fully convey. Their guidance, encouragement, and expertise throughout my master’s studies have been invaluable. I am also grateful to be part of the AIR Lab, where I had the opportunity to learn a great deal from A. Cihan Ak and Arda İnceoğlu. I am equally thankful to my dear friends, Püren Tap and Tuğçe Temel. I feel lucky to be around such kind and fun people. There may be others whose names I have unintentionally overlooked, but whose support I deeply appreciate. I would also like to thank M. Alpaslan Tavukçu. His positive and insightful perspective on life makes difficult times more bearable. I feel fortunate to have met such a distinguished individual. Finally, I want to extend my heartfelt thanks to dear friends who have stood by me through the ups and downs of life: Osman M. Tekin and Doğan Turan. Their support has been a steady source of strength. This thesis is supported by a grant from the Scientific and Technological Research Council of Turkey (TUBITAK), Grant No. 119E-436. This work was also supported by the Turkcell-Istanbul Technical University Researcher Funding Program. This research has also received funding from the Vinnova FFI project SMILE-IV (agreement no 2023-00789). Some of the computing resources used in this work were provided by the National Center for High Performance Computing of Turkey (UHeM) under grant number 4019762024. February 2026 Enes ERDOĞAN (Research Assistant) ix x TABLE OF CONTENTS Page FOREWORD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi ABBREVIATIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi ÖZET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Why Graph Representations? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Current Approaches & Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Scene Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Graph Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 A generic formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Graph convolutional networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.3 Transformer-like GNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Related Works: Graph-based Action Recognition Models . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Offline models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Real-time models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3. FACTORIZED GRAPH SEQUENCE ENCODER . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Sliding Window with Majority Voting (SW-MV) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4. EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.5 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5. DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.1 Impact of Window Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.2 Comparison with an RGB-Only Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.3 Contribution of Majority Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.4 Comparison with Other Pooling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.5 Ablations with Sequence Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 CURRICULUM VITAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 xi xii ABBREVIATIONS AR : Action Recognition ASSIGN : Asynchronous-Sparse Interaction Graph Networks Bimacs : KIT Bimanual Action Dataset CoAx : Collaborative Action Dataset ESEC : Enriched Semantic Event Chain FGSE : Factorized Graph Sequence Encoder GCN : Graph Convolutional Network GNN : Graph Neural Network HRC : Human-Robot Collaboration MLP : Multi-Layer Perceptron PGCN : Pyramidal Graph Convolutional Network RNN : Recurrent Neural Network RT-AR : Real-Time Action Recognition RT-MR : Real-Time Manipulation Recognition ST-GCN : Spatial-Temporal Graph Convolutional Networks UQ-TFGCN : Uncertainty Quantified Temporal Fusion Graph Convolution Network ViViT : Video Vision Transformer xiii xiv SYMBOLS V : Set of nodes E : Set of edges Gt : Graph at time t z : Embedding vector W : Window Size D : Frame down-sampling ratio y : Ground-truth label LCE : Cross-entropy loss xv xvi LIST OF TABLES Page Table 4.1: Manipulation Recognition results on Bimacs dataset. . . . . . . . . . . . . . . . . . 22 Table 4.2: Manipulation Recognition results on CoAx dataset. . . . . . . . . . . . . . . . . . . . 24 Table 5.1: The F1-macro scores as window length increases on Bimacs dataset. 28 Table 5.2: Comparison with an RGB-only model on Bimacs dataset. . . . . . . . . . . . . 29 Table 5.3: Impact of the sliding window with majority voting on Bimacs dataset. 30 Table 5.4: The comparison with alternative pooling methods. . . . . . . . . . . . . . . . . . . . . 30 Table 5.5: The comparison with alternatives of Sequence Encoder on Bimacs dataset.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 xvii xviii LIST OF FIGURES Page Figure 1.1: The difference between Action Recognition and Real-time Action Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Figure 2.1: An example scene graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Figure 2.2: Example illustration that shows RGB-D data to graph representation 6 Figure 2.3: Illustrations for spatiotemporal relation in the graph representation. . 7 Figure 3.1: The proposed Factorized Graph Sequence Encoder (FGSE) network. 15 Figure 3.2: Majority voting usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Figure 4.1: Overview of the Bimacs dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Figure 4.2: Overview of the CoAx dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Figure 4.3: An example for qualitative result.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Figure 4.4: An example result where our model performs poorly. . . . . . . . . . . . . . . . . 24 Figure 4.5: Our graph extraction pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Figure 4.6: Sample frames from the qualitative evaluation of a video recorded in our lab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Figure 5.1: Sequence of graphs to temporarily concatenated graph representation. 27 Figure A.1: 3D bounding box visualizations from Bimacs.. . . . . . . . . . . . . . . . . . . . . . . . . 41 Figure A.2: Fold-wise performance of our model in original Bimacs dataset versus the dataset that comes from our graph extraction pipeline . . . . 42 xix xx REAL-TIME HUMAN MANIPULATION ACTION RECOGNITION WITH A FACTORIZED GRAPH SEQUENCE ENCODER SUMMARY This thesis addresses the problem of real-time manipulation action recognition, a specialized subfield of human action recognition. Human action recognition is a fundamental computer vision problem concerned with classifying short, trimmed video clips that depict a single, well-defined human action, such as walking, running, jumping, or waving. In contrast, real-time action recognition is performed on continuous, untrimmed data streams. No prior information about action boundaries is available, and instant predictions with low latency are required. In other words, the model must continuously make predictions on input that may contain consecutive actions or, at times, no action at all. This requirement is critically important in domains such as human-robot collaboration, assistive robotics, industrial assembly lines, and intelligent surveillance systems. This thesis further narrows the scope of real-time action recognition by focusing on manipulation actions, which involve deliberate object interactions carried out via human hands. This problem, also referred to as human-object interaction recognition in the literature, aims to instantly recognize everyday or goal-oriented hand-object interactions such as pushing a cup, cutting bread with a knife, or stirring tea with a spoon. In this context, the problem is defined as real-time manipulation recognition. Especially in settings where people and robots operate side by side, such as factory assembly lines, it is essential for robots to recognize human manipulation actions in real time to ensure both safety and efficient collaboration. Manipulation actions are inherently object-centric, meaning that the relationships between objects and the hands are more informative than the visual appearance of individual objects. This characteristic motivates the use of semantic scene graph representations, where nodes correspond to objects and edges encode spatiotemporal relations between them. Scene graphs offer several advantages over raw RGB representations. First, they provide an abstract yet semantically rich representation that captures the underlying structure of the manipulation scene. Second, by reducing the dimensionality from high-dimensional image data to compact graph structures, they enable efficient real-time processing. Third, they naturally filter out irrelevant variability such as changes in illumination, camera viewpoint, background clutter, and object appearance, thereby improving generalization, especially when training data is limited, which is a common situation in human-robot collaboration applications where domain-specific manipulations may have only a few examples. The graph representation employed in this work defines semantic relations describing both static properties such as spatial arrangements between objects, and dynamic properties such as how objects move relative to each other over time. These relations xxi are computed from the three-dimensional bounding boxes of objects across consecutive frames and are stored as edge features in the resulting scene graph. A key limitation of existing graph-based manipulation recognition models is their approach to handling temporal information. Prior real-time capable methods concatenate sequential graphs by adding temporal edges between the same objects in consecutive frames. This design inherently limits temporal scalability because graph neural networks can only propagate information to nodes within a limited neighborhood determined by the number of network layers. Increasing the number of layers to compensate leads to over-smoothing, a well-known phenomenon where node embeddings become indistinguishable, preventing the network from learning meaningful representations. To address these challenges, this thesis introduces a novel architecture called the Factorized Graph Sequence Encoder that separates spatiotemporal feature extraction into two distinct components: a Graph Encoder and a Sequence Encoder. The Graph Encoder processes each scene graph independently using attention-based graph convolutional layers, which refine node embeddings while incorporating edge features that characterize spatial and temporal relationships. Since each graph is processed separately, the model can flexibly scale across the temporal dimension without requiring deeper graph network architectures. A novel parameter-free operation called Hand Pooling is introduced to extract graph-level embeddings. Based on the observation that hands are the primary manipulators in any manipulation action, Hand Pooling selects only the node embeddings corresponding to hands, rather than aggregating all node embeddings as in traditional pooling methods. This focused extraction yields more discriminative graph-level representations and reduces the computational burden on the subsequent temporal encoder. The Sequence Encoder is a transformer-based architecture that applies self-attention to the sequence of hand embeddings, enabling the model to learn temporal context across the input window. This factorized design allows the model to efficiently propagate information across graphs regardless of the temporal length of the input, achieving temporal scalability that prior approaches lack. During inference, a sliding window strategy combined with majority voting is employed. As the window slides along the temporal axis, multiple predictions are generated for each frame, and majority voting combines these predictions into a final label. This approach resembles ensemble learning and helps reduce noisy predictions while preventing over-segmentation. The proposed model is evaluated on two publicly available datasets covering manipulation actions in kitchen, workshop, and industrial assembly environments. These datasets include scenarios relevant to human-robot collaboration, with actions such as pouring, cutting, stirring, screwing, and assembling components. A cross-validation approach is employed to ensure robust evaluation across different subjects. Experimental results demonstrate that the proposed model achieves significant improvements over previous state-of-the-art real-time methods on both datasets. When allowing a slightly larger window size, the model achieves results comparable to offline models that have access to entire videos at once. With a compact architecture containing only a few hundred thousand parameters and running at approximately 66 frames per xxii second on a standard GPU, the model is lightweight enough for practical real-time deployment. An extensive ablation study validates the design choices of the proposed architecture. The analysis of window length confirms that the proposed model successfully scales with increasing temporal context, while competing methods based on temporal graph concatenation exhibit performance degradation as the input length grows. Comparisons with alternative pooling methods demonstrate that Hand Pooling outperforms both simple averaging approaches and more sophisticated learnable pooling operations. Experiments with different sequence encoder architectures show that the transformer-based encoder outperforms recurrent alternatives, while removing the temporal encoder entirely results in significant performance drops, confirming the importance of temporal context modeling. A comparison with an architecturally similar RGB-based model reveals the limitations of image-only approaches on object-centric manipulation datasets with limited training samples. The RGB-based model significantly underperforms compared to the proposed graph-based approach, underscoring the advantages of scene graph representations for manipulation recognition in scenarios where data efficiency is crucial. In conclusion, this thesis presents a novel approach to real-time manipulation recognition that achieves state-of-the-art performance while maintaining computational efficiency and temporal scalability. The proposed Factorized Graph Sequence Encoder, combined with the Hand Pooling operation and sliding window with majority voting, provides an effective solution for recognizing human manipulation actions in human-robot collaboration scenarios. Future work will explore the application of this architecture to skeleton-based human whole-body manipulation tasks and investigate methods to handle noisy scene graph extraction by incorporating estimation confidence into the model. xxiii xxiv AYRIŞTIRILMIŞ ÇİZGE DİZİ KODLAYICISI İLE İNSAN NESNE ETKİLEŞİMLERİNİN GERÇEK ZAMANLI OLARAK TANINMASI ÖZET Bu tez çalışmasında, insan aksiyon tanıma alanının özel bir alt dalı olan, gerçek zamanlı manipülasyon aksiyonu tanıma problemi ele alınmıştır. İnsan aksiyonu tanıma, genellikle kısa ve önceden bölümlendirilmiş video klipler üzerinden belirli ve tekil hareketlerin sınıflandırılmasını amaçlayan temel bir bilgisayarlı görü problemidir. Bu yaklaşımda genellikle, yürüme, koşma, zıplama, el sallama gibi net bir şekilde tanımlanmış tek bir aksiyonu içeren video parçaları analiz edilmekte ve söz konusu aksiyonun türü belirlenmektedir. Buna karşın gerçek zamanlı aksiyon tanıma, sürekli ve kesilmemiş veri akışı üzerinde çalışmaktadır. Aksiyon sınırlarına dair herhangi bir ön bilgi bulunmamakta ve düşük gecikme ile anlık tahminler üretilmesi gerekmektedir. Başka bir deyişle, model ardışık olarak birden fazla eylemi veya bazen hiçbir eylemin olmadığı durumları içeren girdi üzerinde sürekli tahmin yapmak zorundadır. Bu gereksinim, özellikle insan-robot iş birliği, hizmet robotları, endüstriyel montaj hatları ve akıllı gözetim sistemleri gibi alanlarda kritik öneme sahiptir. Bu tez çalışması, gerçek zamanlı aksiyon tanıma problemini daha da daraltarak, insanın elleri aracılığıyla nesnelerle kurduğu bilinçli etkileşim eylemlerine odaklanmaktadır. Literatürde insan-nesne etkileşimi tanıma olarak da adlandırılan bu problem, bir bardağı itme, bıçakla ekmek kesme veya kaşıkla çay karıştırma gibi günlük veya amaca yönelik el-nesne etkileşimlerinin anlık olarak tanınmasını hedeflemektedir. Bu bağlamda ele alınan problem, gerçek zamanlı manipülasyon tanıma olarak tanımlanmaktadır. Özellikle fabrika bantları gibi insan ve robotun birlikte çalıştığı ortamlarda, robotun insanın gerçekleştirdiği manipülasyon eylemlerini gerçek zamanlı olarak algılayabilmesi, güvenli ve verimli bir iş birliği açısından kritik öneme sahiptir. Gerçek zamanlı manipülasyon tanıma modellerinin hem düşük gecikme süresiyle çalışması hem de anlamsal olarak zengin temsiller öğrenebilmesi beklenmektedir. Ancak doğrudan ham RGB video verisi üzerinde çalışan yöntemler, yüksek boyutlu girdi uzayı, büyük veri gereksinimi ve sınırlı genelleme kabiliyeti gibi nedenlerle bu gereksinimleri karşılamakta zorlanmaktadır. Özellikle insan-robot iş birliği senaryolarında, sınırlı sayıda örnekle belirli manipülasyonların öğrenilmesi beklenirken, RGB tabanlı yöntemlerin bu tür veri kıtlığı durumlarında başarısız olduğu gözlemlenmektedir. Ayrıca RGB tabanlı yaklaşımlar, aydınlatma koşulları, kamera açısı, arka plan karmaşıklığı ve nesne görünümündeki farklılıklar gibi manipülasyonla doğrudan ilgisi olmayan değişkenlerden olumsuz etkilenmektedir. Bu nedenle, bu çalışmada ham görsel veriler yerine, sahnedeki nesneler ve bu nesneler arasındaki uzamsal ve zamansal ilişkileri açıkça modelleyen sembolik sahne çizgeleri kullanılmıştır. xxv Sahne çizgeleri, sahnedeki nesneleri düğümler, bu nesneler arasındaki ilişkileri ise ayrıtlar olarak temsil eden çizge tabanlı yapılardır. Bu temsilde, nesnelerin göreceli konumları ve zaman içindeki etkileşimleri, düşük boyutlu ve anlamsal olarak zengin özellikler aracılığıyla ifade edilir. Çalışmada tercih edilen spesifik çizge temsili, toplamda on dört farklı anlamsal ilişki türünü içermektedir. Bu ilişkiler iki ana kategoriye ayrılmaktadır: statik ve dinamik ilişkiler. Statik ilişkiler, nesnelerin birbirlerine göre uzamsal konumlarını tanımlamaktadır ve üstünde, altında, içinde, etrafında ve çevresinde gibi durumları kapsamaktadır. Dinamik ilişkiler ise nesnelerin zaman içindeki hareketsel etkileşimlerini ifade etmekte olup, birlikte hareket etme, birlikte durma, birbirine yaklaşma ve birbirinden uzaklaşma gibi durumları içermektedir. Bu ilişkiler, ardışık karelerdeki üç boyutlu nesne sınırlayıcı kutularından kural tabanlı bir yaklaşımla çıkarılmakta ve sahne çizgesinde ayrıt öznitelikleri olarak saklanmaktadır. Bu yaklaşım, RGB verisindeki gereksiz detayları büyük ölçüde filtreleyerek modelin doğrudan eylemin özüne odaklanmasını sağlamaktadır. Literatürde çizge tabanlı manipülasyon tanıma çalışmaları genel olarak çevrimdışı ve gerçek zamanlı olmak üzere iki ana gruba ayrılmaktadır. Çevrimdışı yöntemler, genellikle tüm videoyu tek seferde işleyerek yüksek doğruluk elde edebilmekte, ancak gerçek zamanlı sistemler için kabul edilemez gecikmelere sahiptir. Gerçek zamanlı yöntemler ise çoğunlukla ardışık sahne çizgelerini zamansal eksende birleştirerek tek bir büyük çizge oluşturan yaklaşımlara dayanmaktadır. Ancak bu yaklaşımda, zamansal olarak uzak düğümlerden bilgi alabilmek için çizge sinir ağlarının katman sayısını artırmak gerekmektedir. Öte yandan, katman sayısı arttıkça düğüm temsillerinin birbirine benzemesi ve ayırt edici özelliklerini kaybetmesi anlamına gelen aşırı düzleşme (oversmoothing) problemi ortaya çıkmakta ve sonuç olarak modelin zamansal ölçeklenebilirliği ciddi şekilde sınırlanmaktadır. Bu çalışmada, söz konusu sorunların üstesinden gelmek amacıyla Ayrıştırılmış Çizge Dizi Kodlayıcısı adı verilen bir ağ mimarisi önerilmiştir. Önerilen mimari, çizge temsillerinin uzamsal ve zamansal boyutlarını ayrı ayrı işleyen iki bileşenden oluşmaktadır: çizge kodlayıcı ve dizi kodlayıcı. İlk aşamada, her zaman adımındaki sahne çizgesi bağımsız olarak çizge kodlayıcıdan geçirilmekte; ikinci aşamada ise elde edilen öznitelik dizisi dizi kodlayıcı aracılığıyla zamansal bağlam içerisinde işlenmektedir. Bu ayrıştırılmış tasarım, mevcut gerçek zamanlı yöntemlerin aksine, çizge sinir ağının derinliğini artırmaya gerek kalmadan uzun zaman dizileri üzerinde ölçeklenebilir şekilde çalışabilmeyi mümkün kılmaktadır. Bu özellik tezin ana katkılarından birini oluşturmakta olup, girdi pencere uzunluğu arttıkça model performansının da artması şeklinde deneysel olarak doğrulanmıştır. Çizge kodlayıcıda, dikkat mekanizması içeren dönüştürücü tabanlı çizge evrişim katmanları tercih edilmiştir. Bu katmanlar, düğümler arası mesaj iletiminde hem komşu düğüm özelliklerini hem de kenar özniteliklerini dikkate alarak her komşuya farklı önem ağırlıkları atayabilmektedir. Böylece sabit ağırlıklı standart çizge evrişim katmanlarına kıyasla ifade gücü daha yüksek düğüm temsilleri elde edilmektedir. Çizge kodlayıcıdan elde edilen (sayısı dinamik olarak değişebilen) düğüm temsillerinin, bütün çizgeyi temsil eden bir temsile dönüştürülmesi için literatürde çeşitli havuzlama yöntemleri vardır. Ancak literatürdeki naif havuzlama yöntemleri elde edilen temsili zayıflatmakta, gelişmiş havuzlama yöntemleri ise gereksiz bir yük katmaktadır. Bu xxvi çalışmada alternatif olarak El Merkezli Havuzlama adı verilen yeni ve parametresiz bir havuzlama yöntemi önerilmiştir. Çizge kodlayıcıdan elde edilen ve sayısı dinamik olarak değişebilen düğüm temsillerinin, tüm çizgeyi temsil eden tek bir gösterime dönüştürülmesi için literatürde çeşitli havuzlama yöntemleri bulunmaktadır. Bununla birlikte, literatürde yaygın olarak kullanılan naif havuzlama yöntemleri elde edilen temsili zayıflatırken, daha gelişmiş yaklaşımlar gereksiz bir hesaplama yükü getirmektedir. Bu çalışmada, alternatif olarak El Merkezli Havuzlama adı verilen yeni ve parametresiz bir havuzlama yöntemi önerilmektedir. Bu yöntem, yalnızca ellere karşılık gelen düğüm temsillerini seçmektedir. Manipülasyon tanımı gereği eller, sahnedeki nesnelerle etkileşime giren tek aktörler olduğundan, eylem hakkında en fazla bilgiyi taşıyan düğümler olarak değerlendirilebilir. Karşılaştırmalı deneyler, bu basit ve ek hesaplama maliyeti gerektirmeyen yaklaşımın, hem klasik hem de öğrenilebilir parametreler içeren gelişmiş havuzlama tekniklerine kıyasla daha iyi performans sunduğunu göstermektedir. El temsilleri, zamansal bağlamı öğrenmek üzere yalnızca kodlayıcı bileşeninden oluşan dönüştürücü tabanlı bir dizi kodlayıcıya aktarılmaktadır. Bu yapı, öz-dikkat mekanizması sayesinde kısa ve orta vadeli zamansal ilişkileri etkin bir şekilde modellemektedir. Model, her zaman adımı için ayrı bir tahmin üreterek uzun sekanslar boyunca eylem geçişlerinin sağlıklı şekilde ele alınmasını sağlamaktadır. Gerçek zamanlı çalışma performansını artırmak amacıyla çıkarım aşamasında kayan pencere yaklaşımı ve çoğunluk oylaması kullanılmıştır. Kayan pencere mekanizması sayesinde her bir sahne çizgesi için birden fazla tahmin üretilmekte, bu tahminler çoğunluk oylaması ile birleştirilerek nihai karar elde edilmektedir. Topluluk öğrenmesine benzer bu yaklaşım, geçici hatalı tahminlerin etkisini azaltarak aşırı bölütlemeyi önlerken, yalnızca pencere uzunluğuna bağlı sabit bir gecikme eklemektedir. Önerilen yöntem, iki farklı veri kümesi üzerinde kapsamlı deneylerle değerlendirilmiştir. Kullanılan veri kümelerinden ilki mutfak ve atölye ortamlarında gerçekleşen görevlerden oluşmakta olup, toplamda on dört çeşit atomik manipülasyon kategorisini kapsamaktadır. Yaklaşma, kaldırma, bırakma, tutma, karıştırma, dökme, kesme, içme gibi manipülasyon aksiyonları ve kase, şişe, kesme tahtası, bıçak, çekiç, testere, tornavida gibi on iki farklı nesne içermektedir. Diğer veri kümesi ise endüstriyel insan-robot iş birliği senaryolarına odaklanmaktadır. Diğerine benzer şekilde on farklı atomik manipülasyon aksiyonu içerilmekte ve on altı nesne kullanılmaktadır. Elde edilen sonuçlar, önerilen modelin her iki veri kümesinde de mevcut gerçek zamanlı yöntemleri anlamlı ölçüde geride bıraktığını göstermektedir. Ayrıca modelin yalnızca yaklaşık 269 bin parametreye sahip olması, literatürdeki çevrimdışı modellere kıyasla onlarca kat daha az parametre içermesi anlamına gelmektedir. Model, orta seviye bir grafik işlemci üzerinde bile yaklaşık 66 kare/saniye hızına ulaşarak gerçek zamanlı çalışabilirliğini kanıtlamıştır. Tezin tartışma bölümünde, modelin farklı yönlerini analiz eden kapsamlı bir inceleme sunulmuştur. İlk olarak, girdi pencere uzunluğunun performans üzerindeki etkisi araştırılmış ve önerilen ayrıştırılmış kodlayıcı tasarımının, pencere uzunluğu arttıkça performansı iyileştirdiği gösterilmiştir. Mevcut gerçek zamanlı yöntemlerin aksine, bu modeller pencere uzunluğu arttıkça performans kaybı yaşamakta iken, önerilen xxvii mimaride böyle bir sorun gözlemlenmemiştir. İkinci olarak, RGB tabanlı bir derin öğrenme modeli ile karşılaştırma yapılmış ve sahne çizgesi temsiline dayalı yaklaşımın sınırlı veri koşullarında çok daha iyi genelleme sağladığı ortaya konmuştur. RGB tabanlı model, kısmen önceden eğitilmiş olmasına rağmen, nesne merkezli manipülasyon veri kümelerinde yetersiz kalmıştır. Üçüncü olarak, çoğunluk oylamasının katkısı incelenmiş ve bu mekanizmanın model performansını önemli ölçüde artırdığı doğrulanmıştır. Dördüncü olarak, önerilen El Merkezli Havuzlama yöntemi alternatif havuzlama teknikleri ile karşılaştırılmış ve hem basit ortalama havuzlamaya hem de daha karmaşık öğrenilebilir havuzlama yöntemlerine karşı üstünlük sağladığı gösterilmiştir. Son olarak, dizi kodlayıcı bileşenin katkısı ölçülmüş ve yinelgen sinir ağı tabanlı alternatiflere kıyasla daha iyi performans sunduğu belirlenmiştir. Sonuç olarak, bu tez çalışmasında gerçek zamanlı manipülasyon eylemi tanıma problemi için, hem hesaplama açısından verimli hem de zamansal olarak ölçeklenebilir yeni bir çizge tabanlı mimari önerilmiştir. Çalışmanın ana katkıları, uzamsal ve zamansal işlemeyi ayrıştıran ve böylece zamansal ölçeklenebilirlik sağlayan yeni bir ağ mimarisi ve manipülasyon eylemlerinde ellerin merkezi rolünü kullanan parametresiz bir havuzlama yöntemi olarak özetlenebilir. Elde edilen bulgular, önerilen modelin insan-robot işbirliği senaryolarında güçlü bir alternatif sunduğunu göstermektedir. xxviii 1. INTRODUCTION In this chapter, we begin by explaining the importance of the problem and clarifying the scope and boundaries of the task we aim to address. Next, we motivate graph-based representation and discuss why it is well-suited for our task. Finally, we provide a brief overview of existing research, highlighting how our approach sets itself apart, and present a concise outline of our key contributions. 1.1 Motivation Action Recognition is the task of categorizing human movements based on sensory inputs that typically capture a brief and focused segment of activity. These inputs often come from sources like sequences of images or other motion data that represent a single, clearly defined action. The main goal is to analyze this short and uniform clip to determine what specific action a person is performing, such as walking, running, jumping, or waving. On the other hand, Real-Time Action Recognition (RT-AR) refers to a more advanced and demanding field that focuses on identifying actions with very low delay as soon as they emerge. Unlike traditional approaches that rely on short and neatly segmented clips, this setting must operate on continuous, untrimmed streaming data. The model is required to make ongoing predictions while handling inputs that may contain multiple actions consecutively (and sometimes just no action at all), without any prior information about the action boundaries. This capability is essential for intelligent systems that need to interact with or operate alongside humans in dynamic environments, such as assistive robotics, human-robot collaboration, human computer interaction, autonomous vehicles, and real time video surveillance. In this thesis, we focus on enabling robots to collaborate effectively with humans, whether in structured settings like factory assembly lines or more flexible environments such as kitchens. Thus, narrowing down our focus from the RT-AR, our goal is to 1 Figure 1.1: The difference between Action Recognition and Real-time Action Recognition. detect actions centered on interacting with objects, which we refer to as Real-Time Manipulation Recognition (RT-MR). The real-time requirement is crucial in scenarios that demand immediate system responses, particularly in Human-Robot Collaboration where people handle objects to achieve a goal with robotic assistance. It is important to clarify terminology in this domain since the word action can refer to general behaviors like walking, jumping, or pushing. Here, our attention is specifically on human activities that involve deliberate object interactions using the hands, such as pushing a cup, cutting bread with a knife, or stirring tea with a spoon. For this reason, we use the more precise terms manipulation or manipulation action, also known as human object interaction in the literature. 1.2 Why Graph Representations? RT-MR models must operate with high computational efficiency to maintain smooth performance and low latency. Within HRC settings, these recognition models are further expected to encode semantically rich and abstract knowledge [1]–[3], enabling robots to act with greater autonomy. However, relying directly on raw RGB observations presents several challenges in this regard. Such data provides no inherent semantic understanding of manipulations, and the high-dimensional nature of the representation demands large training sets and compute resources. However, in HRC context, a model 2 might be expected to recognize, for instance, a very specific cooking action with very few data points efficiently so that the robot can collaborate with the human user. Semantic scene graphs offer a way to alleviate these limitations by explicitly capturing the underlying structure of the scene. By reducing the representation to meaningful entities and relations, they both lower the dimensional burden and make it feasible to train and deploy models in real time. This abstraction also filters out irrelevant variabilities, including changes in illumination, camera viewpoint, background clutter, object appearance, etc.. As a result, the model can focus on the relational cues among objects to infer the intended action and generalize easily. With this motivation in mind, we study the real-time recognition of human manipulation actions using symbolic scene graphs [3], where nodes represent objects and edges store semantic embeddings for spatial and/or temporal relations between objects, such as touching, being above, moving together, and getting close, among others. 1.3 Current Approaches & Our Contributions Most existing work on graph based manipulation recognition either overlooks real-time constraint [4]–[7] or uses temporally concatenated graph representation that do not scale well over longer horizons [1,8], which restricts these models to only recognize relatively extended manipulation episodes. Therefore, to address these challenges, we introduce a new Factorized Graph Sequence Encoder network to recognize manipulation actions in real-time using the scene graph representation only. Inspired by the factorized encoder design in ViViT [9], more specifically ViViT Model 2, our model separates spatiotemporal feature extraction into Graph Encoder and Sequence Encoder combined with a new Hand Pooling operation. Because our model processes each graph independently, it can flexibly scale across the temporal dimension without requiring deeper graph neural network architectures, unlike prior approaches [1,8]. Our novel parameter-free Hand Pooling operation extracts node embeddings associated with hands, enhancing recognition performance. Moreover, we apply a sliding window 3 strategy with majority voting to boost inference performance, introducing only a minimal constant delay. The summary of our contributions is as follows: • We introduce a new Factorized Graph Sequence Encoder combined with a new Hand Pooling operation that improves the F1-macro score by 14.3% and 5.6% in comparison to the nearest competitor [8] on Bimacs [1] and CoAx [10] datasets, respectively. Furthermore, when allowing a slightly higher delay, our model achieves results comparable to offline models that process entire videos at once. • Addressing the limitations of previous approaches, we demonstrate that our network design supports temporal scalability, meaning that as the input sequence length increases, the model performs better. We also note that the results presented in this thesis were published at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) in 2025. Specifically, the sections on scene graphs (Section 2.1), graph neural networks (Section 2.2), qualitative evaluation (Section 4.5), and the appendix titled An Attempt to Re-Extract the Graph Data (Appendix A) were written from scratch. The remaining chapters are extended versions of the aforementioned publication. 4 2. BACKGROUND In this chapter, we first provide a brief background on scene graphs and graph neural networks, and describe the computations performed by the specific GNN layer employed in this work. We then present an overview of related literature on graph-based action recognition methods. 2.1 Scene Graphs Early computer vision systems primarily focused on recognizing objects in isolation. While object detection and classification have achieved remarkable accuracy, such representations are insufficient for capturing the rich semantics of real-world scenes. Understanding a scene often requires reasoning about how objects interact, not just which objects are present. For instance, distinguishing between a man riding a horse and a man standing next to a horse depends critically on relational information rather than object identities alone. Scene graphs were introduced to address this limitation by providing a structured, explicit representation of objects and their relationships in a visual scene. By modeling objects, their attributes, and their pairwise relations, scene graphs enable higher-level reasoning and support downstream tasks such as semantic image retrieval, visual question answering, image captioning, and image generation. Empirical results show that incorporating relational structure leads to significant improvements over representations that rely solely on object-level or low-level visual features. The origin of scene graphs is commonly attributed to [11], who introduced them in the context of semantic image retrieval, and later extended by [12], who demonstrated the benefits of contextual reasoning over scene graphs for relationship prediction. A scene graph is a visually grounded graph representation of an image in which nodes correspond to object instances localized in the image, and directed edges represent semantic relationships between pairs of objects. Each object node is typically associated 5 Figure 2.1: An example scene graph. Taken from [12]. with a category label and may include attributes, while each edge encodes a predicate describing how two objects are related, such as spatial, functional, or action-based relations. By explicitly modeling objects and their pairwise relationships within a unified graph structure, scene graphs provide a structured representation that supports contextual reasoning about the contents of a visual scene. An example scene graph is provided in Figure 2.1. A wide range of alternative scene graph representations has been proposed in the literature, as surveyed in [13]. In this thesis, we adopt an ESEC-based representation, as it provides a favorable trade-off between expressive power for action discrimination and simplicity of extraction from raw video data. ESEC-based Graph Representation Given a human manipulation demonstration captured from a third-person viewpoint (as shown in Figure 2.2), we represent each scene as a graph following the manipulation action ontology presented in [14]. In this representation, nodes correspond to objects, and edges encode the spatiotemporal relations between them. Figure 2.2: Example illustration that shows RGB-D data to graph representation 6 Figure 2.3: Static relations include (a1) Above/Below, (a2) Around, and (a3) Inside/Surround. Dynamic relations include (b1) Moving Together, (b2) Halting Together, (b3) Fixed-Moving Together, (b4) Getting Close, (b5) Moving Apart, and (b6) Stable. Taken from [3]. As detailed in [3], a total of 14 distinct semantic relations describe both the static or spatial properties (for example, above, below, inside) and the dynamic or temporal interactions (such as moving together, getting close, moving apart), as illustrated in Figure 2.3. These relations are computed from the 3D bounding boxes of objects across consecutive frames and are stored as binary edge features in the resulting scene graph. It is important to recognize that any graph generation approach faces a fundamental trade-off. On one hand, the representation must be easy to extract from the data. On the other hand, it must be expressive enough to distinguish between different actions, meaning it must possess sufficient representational richness. Since the preferred graph extraction method is rule-based (as opposed to deep learning based methods), it is easy to work with. Regarding its expressiveness, we partially show its effectiveness in Section 5 by comparing with an RGB-based method. Formally, streaming of a graph sequence can be defined as S = {G0,G1, ...Gt}, where Gt represents the extracted scene graph at time step t. At any specific time τ , the extracted graph is denoted as Gτ = (Vτ ,Eτ), where Vτ is the set of nodes, expressed as Vτ = {vi τ}. Each node vi τ is a one-hot-encoded object category. Similarly, Eτ denotes the set of 7 edges, given by Eτ = {e j τ}, where each edge e j τ is represented as a 14-dimensional binary feature vector, i.e., e j τ ∈ {0,1}14. 2.2 Graph Neural Networks Deep learning architectures such as MultiLayer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs) have achieved remarkable success in domains including computer vision, speech recognition, and natural language processing. A common characteristic of these architectures is the assumption that data lies on a regular and well-defined structure, such as a grid or a sequence. However, some of the real-world problems involve data that is inherently irregular and relational such as social networks, molecular graphs, recommender systems, and knowledge graphs so on. In those domains, data points are best represented as nodes connected by edges, forming a graph structure. However, traditional neural networks cannot process graph-structured data because graphs lack a fixed topology, have variable neighborhood sizes, and they require permutation-invariance. Thus, these limitations motivate the development of new neural network architectures that can directly operate on graphs while preserving their relational structure. Graph Neural Networks (GNNs) address these challenges by extending deep learning techniques to graph domains. They enable learning on graphs by iteratively aggregating and transforming information from neighboring nodes. This allows GNNs to capture both node attributes and graph topology in a unified framework. 2.2.1 A generic formulation At the core of most GNN models lies the message passing mechanism. Let a graph be defined as G = (V,E), where V is the set of nodes and E is the set of edges. A generic message passing GNN updates node representations iteratively across layers. Representing hidden representations of the each node at layer k as h(k) v , the update rule 8 can be written as h(k+1) v = φ h(k) v , ⊕ u∈N (v) ψ ( h(k) v ,h(k) u ,euv ) (2.1) where ψ is the message function, φ is the update function, euv denotes edge features between nodes u and v, and ⊕ represents a permutation-invariant aggregation operator such as summation or averaging. Also, the neighborhood of node v is denoted by N (v). This formulation ensures permutation invariance with respect to node ordering, which is a fundamental requirement for graph-based learning. By stacking multiple layers, each node representation gradually incorporates information from increasingly distant neighbors, allowing the network to capture higher-order structural patterns. More explicitly, in an n-layer GNN, node v receives information from its n-hop distant neighbors. 2.2.2 Graph convolutional networks One of the widely used examples of a GNN layer is the Graph Convolutional Network (GCN) [15]. GCNs simplify the message passing mechanism by using a linear aggregation of normalized neighbor features followed by a nonlinear activation. Let A be the adjacency matrix of the graph, and let I denote the identity matrix. Self-loops are added by defining à = A+ I. The corresponding degree matrix is denoted by D̃, where D̃ii = ∑ j Ãi j, and it is used to normalize the adjacency matrix as  = D̃− 1 2 ÃD̃− 1 2 . Finally, the GCN update rule in matrix form is given by H(k+1) = σ ( ÂH(k)W (k) ) (2.2) where H(k) is the node feature matrix at layer k, W (k) is a trainable weight matrix, and σ(·) is a nonlinear activation function such as ReLU. 2.2.3 Transformer-like GNN While GCNs are computationally efficient and effective for many tasks, they implicitly assign equal importance to all neighboring nodes after normalization. This uniform weighting limits their expressiveness, particularly in graphs where some neighbors are 9 more informative than others. To address this limitation, attention-based graph neural networks introduce mechanisms that allow the model to learn adaptive, data-dependent importance weights for neighboring nodes during aggregation, rather than treating all neighbors equally. Another limitation of standard GCN is that they typically operate only on scalar edge weights. This restricts their ability to model more complex relationships between entities, particularly when edges carry high-dimensional attributes, as in our graph representation. Both of these limitations can be addressed by the TransformerConv layer [16]. Inspired by the success of transformers in sequence modeling, TransformerConv extends attention mechanisms to graph-structured data using scaled dot-product attention and multi-head architectures. For a node i, the feature update rule is given by h′ i = W1hi + ∑ j∈N (i) αi, j ( W2h j +W6ei j ) (2.3) where hi ∈ Rd denotes the input feature vector of node i, ei j ∈ Rde represents the edge feature between nodes i and j. The matrices W1,W2 ∈ Rd′×d and W6 ∈ Rd′×de are learnable linear transformations applied to the central node features, neighbor node features, and edge features, respectively. The attention coefficients αi, j are computed using scaled dot-product attention as αi, j = softmax j∈N (i) ( (W3xi) ⊤ (W4x j +W6ei j ) √ d′ ) (2.4) where W3,W4 ∈ Rd′×d are learnable projection matrices that map node features to query and key representations, respectively, and d′ denotes the dimensionality of the attention space. The softmax operation is applied over all neighbors of node i, ensuring that the attention coefficients form a normalized distribution. Through this mechanism, TransformerConv allows the model to selectively focus on the most relevant neighboring nodes while naturally incorporating rich, vector-valued edge attributes, resulting in a more expressive and flexible graph representation. 10 2.3 Related Works: Graph-based Action Recognition Models In addition to the representational and architectural background, reviewing existing approaches in this domain provides useful context for the problem we address. Our focus is on recognizing human manipulation actions through scene graph representations; therefore, RGB-based action recognition models [17]–[22] fall outside the scope of this thesis and are not reviewed here. Accordingly, in the following sections we review prior work on graph-based manipulation recognition, organizing the discussion into two categories: offline models and real-time models. 2.3.1 Offline models There exists a large corpus of work in graph-based scene representation for manipulation recognition [23]–[26]. Most of these works, however, operate offline in a batch mode. For instance, Akyol et al. [4] propose a two-headed manipulation recognition and prediction network based on Variational Graph Autoencoders [27], where reconstruction is not necessary. However, this work assumes that the key scene graphs are known prior. Also, the proposed model only accepts a single graph as an input, thus, the model lacks temporal understanding, making the model infeasible for real-time applications. Morais et al. [5] follow a different approach and model each entity in the scene with their state evolving throughout the video sparsely and asynchronously by interacting with each other. The state is a manipulation label for humans and an affordance label for objects. Node features are derived from low-level visual features extracted using a pre-trained Faster-RCNN [28] model and the messages between nodes are modeled as a type of attention, i.e., the cosine similarity between node features. This architecture is extended in [6] with a position-based object graph to improve its performance. The work in [7] proposes an encoder-decoder architecture for joint learning of both manipulation recognition and temporal segmentation tasks. Their contribution involves a novel attention-based graph convolution layer to encode scene graphs and a temporal pyramidal pooling module to decode these graph embeddings into framewise labels. 11 Spatial position information is the only cue employed as node embedding to represent skeletons and objects in the scene. The edges are dynamically created between highly correlated nodes during manipulation, except for those between skeleton joints, which are defined naturally. Conventional 2D convolution operations are then applied to a generated V ×T dimensional feature map, where V is the number of objects and T defines the length of the video. However, this design strictly assumes that the number of nodes throughout the video is constant, which is a highly restrictive assumption for complex manipulation sequences. Based on [7], [29] enhances temporal segmentation by introducing a Temporal Feature Fusion decoder while preserving feature space distances with Spectral Normalized Residual connections. However, the model in [29] becomes 3.7 times larger than [7], leading to higher computational complexity. In contrast to these works, our model operates in real-time and does not rely on any prior knowledge about the number of graph nodes/edges, nor does it require low-level RGB features. 2.3.2 Real-time models In the context of online manipulation recognition, Dreher et al. [1] propose a model based on the graph encoder-decoder architecture [30]. They first extract graphs for each frame separately, using spatiotemporal relations introduced in [3]. Next, in order to combine the sequential graphs, they introduce the temporal connections between the same nodes in consecutive graphs. However, considering that graph neural networks are capable of propagating information to n-hop distant nodes where n denotes the number of layers, this design exhibits scalability limitations when the temporal length of the input increases. One might suggest that new layers could be added to compensate, but in return, over-smoothing [31,32] might occur, which is a well-known phenomenon in deep graph networks, where no meaningful and distinguishable node embeddings are learned. Another recent attempt [8] proposes a joint model for manipulation recognition and manipulation-conditioned motion forecasting, with a two-stage training. Initially, the manipulation action recognition module is trained, and subsequently, to predict the motion of the objects and hands, the model employs the predicted manipulation 12 information in addition to the current graph sequence. In this graph representation, node embeddings consist of 3D object positions concatenated with one-hot encoded object categories. Furthermore, only edges between the hands and other objects are considered, where the edge feature is nothing but the distance between the hands and objects. As in the case of [1], the consecutive graphs are linked with temporal edges. Consequently, the aforementioned criticisms regarding the limited temporal scalability of the model also apply to this study in [8]. Additionally, the discarding of edges between the objects may prevent the model from learning more complex manipulations. The recent work in [33] employs skeleton data and applies the sliding window with a majority voting approach on top of the Spatial-Temporal Graph Convolutional Networks (ST-GCN) [34]. The scalability issue is also valid for the ST-GCN model due to the temporal concatenation of sequential graphs. Our proposed model also differs from these real-time capable works due to our factorized encoder design, which enables temporal scaling of the network to enhance accuracy. 13 14 3. FACTORIZED GRAPH SEQUENCE ENCODER In this chapter, we explain the proposed model architecture in detail. It is worth noting that the final design emerged through extensive empirical exploration, with numerous components such as the choice of GNN layer, parameter count, and normalization strategy evaluated through ablation studies. Some of these design choices are supported by the findings presented in Section 4. We conclude this chapter by describing how the model operates in real time using a sliding window mechanism combined with majority voting. 3.1 Model Architecture Figure 3.1: The proposed Factorized Graph Sequence Encoder (FGSE) network. We propose a new Factorized Graph Sequence Encoder (FGSE) network to recognize manipulation actions in real-time from a stream of graph data. FGSE consists of two distinct encoder types: Graph Encoder and Sequence Encoder, combined with a new Hand Pooling operation, as illustrated in Figure 3.1. Our Graph Encoder (GE) builds upon this foundation using a transformer inspired graph convolutional operator called TransformerConv [16]. As mentioned earlier, this 15 layer employs attention-based message passing, enabling each node to assign different weights to information coming from its neighbors while also integrating the edge features that characterize their spatial and temporal relationships. In doing so, the layer produces refined node embeddings that reflect both the graph topology and the semantics of the relations. To stabilize training and maintain consistent representation scaling across layers, each TransformerConv block is followed by LayerNorm [35]. This choice aligns with mainstream transformer architectures, where LayerNorm plays a key role in ensuring numerical stability and smoother gradient flow when stacking many attention-based layers. We also apply the SELU activation function [36] after each convolutional layer. SELU is a self-normalizing activation that drives activations toward zero mean and unit variance during training. This property reduces the risk of vanishing or exploding activations as the network deepens, while eliminating the need for explicit normalization within the activation pathway. Combining SELU with LayerNorm enhances stability and helps the model converge more reliably. Repeating this sequence of TransformerConv, SELU, and LayerNorm N times yields the complete GE module, illustrated in Figure 3.1. Through these stacked layers, the encoder progressively enriches the node embeddings, enabling the network to capture increasingly intricate relational and structural cues from the scene graph. Also, note that the extent to which node information can propagate, measured in n-hop neighborhoods, is determined by the number of layers in the GNN. As mentioned above, the number of nodes varies from one graph to another, which makes it difficult to pass graph representations directly into standard neural network components that expect constant-sized inputs. Pooling functions address this issue by reducing a variable-sized set of node embeddings to a constant-sized representation that can be processed by downstream modules. The simplest and most common pooling strategy is average pooling, where the final graph embedding is obtained by taking the mean of all node embeddings in the graph. This provides a straightforward, permutation invariant summary of the entire graph. However, it has an obvious drawback: it treats 16 every node as equally important, causing informative or task critical nodes to be diluted by less relevant ones. In manipulation action scenarios, hands are the main and only manipulators interacting with the objects in the scene [14]. Therefore, it is reasonable to assume that hands accumulate more descriptive embeddings to infer types of performed manipulations. With this assumption, to obtain graph-level embeddings, we propose a simple and parameter-free operation, named Hand Pooling (HP), that selects node embeddings belonging to the hands in the initial graph. The combination of these two stages can be expressed as: HP(GEθ (Gτ)) = zh,τ (3.1) where the Graph Encoder network, GE, is parametrized by θ and zh,τ is the hand-corresponding (h) embedding vector pooled by HP at time τ from the corresponding scene graph Gτ . The Sequence Encoder (SE) is an Encoder-only Transformer [37] that enables the model to learn temporal context by applying self-attention to hand embeddings (zh,τ ) in the input sequence. Finally, for each graph, a linear layer is applied to map those embeddings to manipulation labels. Stating these two layers combined formally: SEL φ (zh,τ−(W−1), · · · ,zh,τ) = (ŷ0 τ−(W−1), · · · , ŷ W−1 τ ) (3.2) where Sequence Encoder network (SE) and linear layer (L), SEL, is parametrized by φ , and W is the input window length of the model. The model prediction y is the output vector of the Softmax layer (which is omitted in the notation for the sake of simplicity), and its superscript denotes the relative position of the prediction within the given input. Note that, alternatively, the model could have predicted a single label for the whole input sequence by using the mean of the output embeddings or by employing, for instance, a class token. However, we observed that for the long sequences, this strategy significantly reduces the model’s performance due to natural transitions among different types of manipulations throughout a long scenario. This is elaborated more in the discussion section. 17 The hallmark of our design is the separation of Graph and Sequence Encoders. This design allows the model to efficiently pass information among graphs even when the temporal length increases, regardless of the number of layers in the GE module. In addition, our new HP operation reduces the workload of SE by exclusively returning hand embeddings. This is further discussed in Section 5. 3.2 Sliding Window with Majority Voting (SW-MV) The FGSE network returns a manipulation label for each corresponding input graph, as depicted in Figure 3.1. During the inference process, we utilize a sliding window approach, which generates W labels for a given graph, Gτ . To combine these predictions, the majority voting algorithm is leveraged as illustrated in Figure 3.2. More formally, let ŷw τ be the prediction vector, i.e., the output of the Softmax activation for Gτ as being the wth element in the sliding window. Thus, majority voting combines all predictions into a final one as: ỹτ = argmax c W−1 ∑ w=0 1(argmax(ŷw τ )=c) (3.3) where ỹτ is the combined labels at time τ , and 1 represents the indicator function. Note that applying SW-MV, which resembles ensemble learning, helps reduce noisy predictions and prevents over-segmentation over time. As can be noticed, applying majority voting to a sliding window with a length of W introduces a delay of W/FPS seconds for the model output. Consequently, while a larger window enables the model to capture a richer local context, it comes with a cost of delay proportional to W . Figure 3.2: Majority voting is used to combine shifted predictions. As the window slides along the temporal axis, new predictions are generated. Here, the window size (W ) is 5, and each colored box denotes a different predicted manipulation label. 18 4. EXPERIMENTS In this chapter, we describe the complete experimental setup, including the datasets used, the model training procedure, the evaluation methodology, and a comparison of our results with leading approaches reported in the literature. 4.1 Datasets We benchmark the proposed model on two publicly available datasets described below. KIT Bimanual Action (Bimacs) Dataset [1] consists of 6 subjects performing 9 distinct manipulation tasks relevant to kitchen and workshop environments, with each task repeated 10 times. We borrow Figure 4.1 from [1], which illustrates three representative videos using selected frames. In total, the dataset contains 2 hours and 18 minutes of RGB-D recordings and covers 14 atomic manipulation categories: idle, approach, retreat, lift, place, hold, stir, pour, cut, drink, wipe, hammer, saw, and screw. The videos are fully annotated for each hand individually and involve interactions with 12 distinct objects, namely: cup, bowl, whisk, bottle, banana, cutting board, knife, sponge, hammer, saw, wood, and screwdriver. Figure 4.1: Sample videos from the Bimacs dataset. The first row presents breakfast preparation, the second row depicts a cooking task involving stirring and pouring, and the third row shows hard drive disassembly by unscrewing a screw. Taken from [1]. 19 Bimacs dataset already provides extracted graphs with ESEC relations [3], so we directly work on these graphs. Thus, we directly feed these graphs to our FGSE model. Note that since manipulations in Bimacs are labeled for each hand separately, we employ two linear layers to predict each manipulation performed by the left and right hands individually. Collaborative Action (CoAx) Dataset [10] involves 6 subjects executing 3 industrial assembly manipulation tasks, one of which involves interaction with a collaborative robot. Each manipulation task is repeated 10 times. Similarly, we also borrow Figure 4.2 from [10], which illustrates three representative videos for each of those tasks. The dataset contains a total of 1 hour and 58 minutes of RGB-D video data. The dataset comprises 10 distinct manipulation actions and 16 objects, with frames annotated as action object pairs. Although this setup yields 160 possible action object combinations, only 23 pairs actually occur in the CoAx dataset. To reduce model complexity, we identify these existing combinations and merge each action object pair into a single unified label. The resulting labels are: approach, grab screwdriver, plug screwdriver, grab valve, screw screwdriver, release valve, grab soldering iron, plug soldering iron, retreat, join screwdriver, grab valve terminal, plug valve terminal, place screwdriver, grab box with screws, place box with screws, grab hose, wait for robot, plug hose, grab box with membrane, grab soldering station, solder hose, release soldering station, release box with membrane. Figure 4.2: An overview of the CoAx dataset tasks is shown. From top to bottom, the rows depict Tasks 1 to 3: valve terminal setup and assembly with screws; valve assembly with screws and a membrane; and soldering a capacitor using soldering tin, assisted by a collaborative robot holding the soldering board. Taken from [10]. 20 Additionally, the dataset includes 3D object bounding boxes; however, unlike Bimacs [1], it does not provide spatiotemporal relation information. Following the approach in [3], we derive these relations from the bounding boxes in order to construct the graph representations of the dataset. In both datasets, there might be noisy object detections and, consequently, incorrect relations between those objects. To mitigate this issue, we set an empirical threshold to filter out such relations. Specifically, if any two objects are too far apart, we remove the edge between them. 4.2 Training Setup We optimize the proposed FGSE network by minimizing the cross-entropy loss averaged over the input window as given in equation 4.1. Notice that majority voting is not applied during training. LCE =− 1 W W−1 ∑ w=0 y⊺τ−w · log(ŷτ−w) (4.1) where y represents the one-hot-encoded ground truth, and ŷ is the prediction vector after the softmax activation. And (·) indicates the dot-product between these two vectors. We experiment with varying window lengths, denoted as W followed by the respective value (e.g., W30 means window length of 30). Additionally, we observed that consecutive graphs are quite similar to each other unless the action changes. Therefore, in certain experiments, we downsampled the input sequence by a factor of 3 to accelerate training and testing without compromising accuracy, referring to this as D3. Note that during the metric calculations (F1-macro/micro), we upsampled them back into the original scale for fair comparison. Through empirical evaluation, we set the number of layers in both the Graph Encoder and Sequence Encoder to 2, i.e., the parameter N in Figure 3.1 is set to 2. Further network and training parameter details can be found in the shared source code link1. 1https://github.com/eneserdo/FGSE 21 https://github.com/eneserdo/FGSE Table 4.1: Manipulation Recognition results on Bimacs [1]. Methods Real-time No visual F1- F1- capable feature macro micro ASSIGN [5,38] ✗ ✗ 79.5 82.3 PGCN [7] ✗ ✓ 81.5 86.9 UQ-TFGCN [29] ✗ ✓ 88.6 88.4 Dreher et al. [1] ✓ ✓ 63.0 64.0 H2O+RGCN [8] ✓ ✓ 66.0 68.0 FGSE-W30-D3 (Ours) ✓ ✓ 78.1 81.1 FGSE-W75-D3 (Ours) ✓ ✓ 80.3 82.7 4.3 Evaluation Macro and micro-averaged F1 scores are measured to report the success of each trained model. Note that due to class imbalance, macro-averaged F1 score is a more reliable metric to measure the performance. Following the work in [1], we apply the leave-one-subject-out cross-validation approach to generate six folds for both datasets. Each fold corresponds to different subject in dataset. 4.4 Results In this section, we present the results for two variants of our model with window lengths of 30 and 75. Table 4.1 compares the recognition performance of our model with other relevant models on the Bimacs [1] dataset. We separate the benchmarked models based on their real-time capabilities, such as online versus offline models. Among the online models (e.g., [8] and [1] in Table 4.1), our model (FGSE-W75-D3) achieves a significant improvement on the previous state-of-the-art model [8], surpassing it by 14.3% and 14.7% in terms of F1-macro and F1-micro scores, respectively. Compared to the offline models (e.g., [5,7,38] and [29] in Table 4.1) that take the entire video at once, i.e., access the complete context and relations between the manipulations, our model (FGSE-W75-D3) achieves comparable results with [5,7] in case of increasing the window length (W=75). We, however, note that the incorporation of visual features in [5] contradicts the original purpose of scene graphs. Scene graphs are designed 22 to represent objects independently of their appearances or shapes, thereby making manipulation recognition more generalizable. The offline model UQ-TFGCN [29] attains the highest performance among all models; however, it has the drawback of having the highest number of parameters (20.1M), which is 74 times more than our model. Similarly, PGCN [7] has 21 times more parameters (5.4M) than our model. Figure 4.3: An example run from the test set of Bimacs [1], in which a person pours water from a bottle into a cup, and then drinks it. The top three rows show the ground-truth labels, predictions, and vote count in majority voting for the left hand, and the next three rows correspond to the right hand. Each color represents different actions: idle, approach, lift, hold, pour, place, retreat, drink. Layout adapted from [1]. Figure 4.3 presents an illustrative sample from the Bimacs dataset and its qualitative analysis. In addition to the ground truth and the predicted labels for the sample video, we also included the vote counts in majority voting, which can be related to the confidence level of the model. In all predictions, our model demonstrates a high degree of confidence, except for instances involving transitions between distinct manipulations. We also want to give a qualitative example where our model performs poorly. As illustrated in Figure 4.4, some actions are incorrectly detected and for some part of the video, over-segmentation is observed. Table 4.2 reports the obtained recognition results on the CoAx dataset [10]. Our model yields a new state-of-the-art score, improving the nearest competitor [8] by 5.6% in terms of F1-macro. Note that the results of Dreher et al. [1] on the CoAx dataset are taken from [8]. 23 Figure 4.4: An example result where our model performs poorly. As can be seen, some actions are incorrectly detected and for some part of the video, over-segmentation is observed. Table 4.2: Manipulation Recognition results on CoAx [10]. Methods F1-macro F1-micro Dreher et al. [1] 60.0 70.0 H2O+RGCN [8] 87.0 90.0 FGSE-W30 (Ours) 90.7 92.8 FGSE-W75-D3 (Ours) 92.6 94.9 A comparison of our model’s variants in both Table 4.1 and Table 4.2, FGSE-W75-D3 and FGSE-W30-D3, indicates that slightly relaxing the real-time constraints, i.e., increasing the window length, leads to improved performance by allowing the model to capture a larger local context. A further analysis on the impact of window length is given in the discussion section. Regarding the runtime performance, with 269K parameters and 4.8 GFLOPS (on average), our proposed model FGSE achieves approximately 66 FPS on an Intel i9-12900K CPU with an NVIDIA GeForce RTX 3060 GPU, indicating that it is lightweight enough to run in real-time even on a low-end GPU card. To summarize, the results indicate that our model achieves a new state-of-the-art performance among real-time capable models. Moreover, it demonstrates promising performance even when compared to offline models, especially given its extremely parameter-efficient design relative to [29] and [7]. 24 4.5 Qualitative Evaluation Figure 4.5: Our graph extraction pipeline. Each intermediate step is visualized in between the boxes. We recorded a proof-of-concept video to evaluate the model in a real-world setting, where a robot, controlled via teleoperation, assists a human in preparing a generic dish. An RGB-D video was captured using a ZED camera. To obtain the corresponding graphs, we constructed a pipeline using state-of-the-art, off-the-shelf tools, as illustrated in Figure 4.5. We follow a graph extraction procedure similar to that described in Bimacs. In that work, approximately 5.4k images were first manually annotated and then used to train a YOLOv3 model, which automatically labeled the remaining images. Adopting a similar strategy, we fine-tune a YOLOv11 model for object annotation using a human-annotated subset of the data. The trained detector is then applied to predict 2D bounding boxes for objects across the dataset. Using these bounding boxes together with the corresponding RGB images, we employ the SAM2 [39] model to obtain object segmentation masks. By incorporating the associated depth frames, we reconstruct scene-level point clouds and apply the object masks to estimate 3D bounding boxes for each object. For hand annotations, standard object detection or segmentation models prove inadequate. Instead, we employ the AlphaPose [40] model, which can reliably localize 25 hands. Based on the detected hand keypoints, we fit 2D bounding boxes for the hands and then follow a pipeline similar to that used for objects to obtain their corresponding 3D bounding boxes. We then aggregate the three-dimensional bounding box information for both objects and hands and extract ESEC [3] relationships using a rule-based approach that leverages the spatiotemporal arrangement of the 3D bounding boxes. Finally, the resulting graph representations are fed into our model. A selection of frames, along with the predicted action labels for each hand, is shown in Figure 4.6. The video can be accessed via the project webpage2. Figure 4.6: Sample frames from the qualitative evaluation of a video recorded in our lab. In left-top corner of each frame, predicted manipulation labels for each hand can be seen. Note that after observing some erroneous graph data, we also used this pipeline to re-extract graphs from the original Bimacs dataset in an effort to obtain higher-quality representations. However, this attempt was unsuccessful, and the newly generated graphs were not used. The details of this attempt are provided in Appendix A. 2https://air.cs.itu.edu.tr/projects/fgse.html 26 https://air.cs.itu.edu.tr/projects/fgse.html 5. DISCUSSION In this chapter, we present a series of focused analyses to better understand our model’s performance and design choices, including the impact of temporal window length, comparison with an RGB-only baseline, the contribution of majority voting, pooling method comparisons, and sequence-encoder ablations. 5.1 Impact of Window Length Figure 5.1: Sequence of graphs to temporarily concatenated graph representation. We hypothesized that due to the factorized encoder design, our model would perform better at scaling in the temporal dimension compared to prior approaches that concatenate the input graphs temporally [1,8]. As shown in Table 5.1, our experimental findings on Bimacs [1] reveal that the performance of our model substantially improves when the window length is doubled from 10 to 20 graphs. After this particular point, although the performance continues to increase, the rate of improvement slows down, which can be interpreted as 20 graphs being sufficient to recognize most of the manipulations, and feeding in more graphs does not dramatically enhance recognition performance. A similar improvement trend is also visible in the CoAx dataset [10]. As indicated in the last row in Table 5.1, our model demonstrated an improvement of 9.1 points in terms of F1-macro when the number of graphs increased from 10 to 40. The results indicate that our network is better at scaling temporarily by design. 27 Table 5.1: The F1-macro scores as window length increases on Bimacs [1]. Dataset Window length (W) 10 20 30 40 Bimacs Dreher et al. [1] 63.0 49.6 51.0 N/A Dreher et al. [1] (scaled) 63.0 51.2 42.9 N/A FGSE (Ours) 72.2 78.3 78.6 79.9 CoAx FGSE (Ours) 83.1 87.9 90.7 92.2 In this table, we also compare our model with a real-time capable model proposed by Dreher et al. [1] only, since the source code of H2O-RGCN [8] is not yet publicly available. As aforementioned, the compared model in [1] constructs a single graph through the temporal concatenation, which means they add additional edges between the same objects in the consecutive graphs in temporal axis, as illustrated in Figure 5.1. This design becomes unscalable as the temporal length of the input grows, since graph neural networks can propagate information to nodes up to n hops away, where n is the number of layers. The first row in Table 5.1 shows that the model’s performance in [1] worsened even though input data contains more information as the window length increases. While adding more layers could mitigate this issue, it may also lead to over-smoothing [31,32]. To examine this, we doubled and tripled the number of processing steps in Dreher’s model [1], and as indicated in the second row, this approach also resulted in a similar performance failure. Note that the first three rows in Table 5.1 only show results for the first fold due to high computational load in [1] during training. 5.2 Comparison with an RGB-Only Model Considering the thrilling improvements in the RGB-based recognition models, a reasonable question might be how such a model would perform on the Bimacs dataset [1]. ViViT Model 2 [9] was chosen for comparison due to architectural similarity, i.e., it has spatial and temporal encoders analogous to our Graph Encoder and Sequence Encoder. In this ViViT model, we used a pre-trained spatial encoder and trained the temporal encoder from scratch. To make the comparison fair, we also employed a sliding window approach with majority voting during the test time. Due to high computational cost, in this experiment, we only performed tests with the first fold. As shown in Table 5.2, the obtained F1 scores of the ViViT model are quite low compared to our proposed model. We believe that this underperformance is inherently 28 Table 5.2: Comparison with an RGB-only model (W30-D3) on Bimacs [1]. Model F1-macro F1-micro ViViT-Model 2 [9] 63.5 64.1 FGSE (Ours) 78.3 82.6 related to the RGB-based approaches. As known, RGB-based models require a significant amount of training data to learn from high-dimensional raw image data. In our case, even though the network is partially pre-trained, the Bimacs dataset may not be sufficient for such a model, despite our effort to minimize the number of parameters in the temporal encoder part of the network. This poor generalization performance in small dataset settings makes the RGB-based models infeasible for HRC scenarios in which, for instance, a very specific cooking-related manipulation is supposed to be learned with very few data points to help the robot efficiently collaborate with the human user. On the other hand, semantically rich symbolic scene graph-based methods are expected to be better at generalization from a few data points thanks to the very low dimensional representation space. In such a semantic representation, details irrelevant to the manipulation, such as varying light conditions and background clutter, are naturally disregarded, which might pose a significant challenge for RGB-based methods. For instance, the same pouring manipulation executed with different objects (e.g., a cup versus a bottle) might be unrecognizable by the RGB-based model due to the shape and appearance changes of the objects in the scene. One might argue that the scene graphs also depend on an object recognition model, thus, it is nothing else than just shifting the burden of generalizability to the object detector. However, object detectors are particularly trained to identify objects with varying visual features, which makes them inherently more robust. Given that manipulation recognition is inherently object-centric, where the temporal relationships between objects and their environment matter more than instance-specific properties like object appearance or geometry, it is reasonable to break down the manipulation recognition task into object detection and graph-based recognition steps. To conclude, our experimental findings in Table 5.2 reveal that RGB-only models underperform on object-centric manipulation datasets with a limited number of samples, 29 such as Bimacs. This observation underscores the limitations of such models in HRC scenarios. 5.3 Contribution of Majority Voting Table 5.3: Impact of the sliding window with majority voting on Bimacs [1] (D3). Methods F1-macro F1-micro Center of window 76.9 79.2 Single Pred. 70.0 73.2 Majority voting 78.1 81.1 As an ablation study, we measure the impact of majority voting. As a first alternative, we take the average of final embeddings after the Sequence Encoder and use a single linear layer to predict a label that corresponds to the last graph in the window. More formally, the Sequence Encoder combined with linear layer predicts as: SEL φ (zh,τ−(W−1), · · · ,zh,τ) = ŷτ (5.1) As a second alternative approach, we only use the label at the window’s center, i.e., ỹτ = ŷW/2 τ , without altering the loss function or applying the majority voting. The results in Table 5.3 indicate that majority voting strongly improves the model’s performance compared to these alternatives. 5.4 Comparison with Other Pooling Methods Table 5.4: The comparison with alternative pooling methods (W30-D3). Methods F1-macro F1-micro Global mean pooling 75.6 79.5 Top-k pooling [41] 77.1 80.7 SAGPool [42] 75.5 79.7 Hand-Pooling (Ours) 78.1 81.1 To demonstrate the effectiveness of the proposed Hand Pooling operation, we compare it against several widely used pooling strategies from the literature. As reported in Table 5.4, Hand Pooling significantly outperforms naive global mean pooling, which aggregates all node features by simple averaging and therefore ignores the structural and semantic importance of individual nodes. 30 Table 5.5: The comparison with alternatives of Sequence Encoder on Bimacs [1] (W30-D3). Seq. Enc. Variants F1-macro F1-micro No Encoder 61.2 69.1 LSTM 69.4 74.1 BiLSTM 77.1 80.7 Encoder-only Transformer 78.1 81.1 We also compare our method with more advanced pooling techniques, including Top-k pooling [41] and SAGPool [42]. Top-k pooling selects a subset of nodes based on learned importance scores, while SAGPool employs self attention mechanisms to adaptively retain informative nodes during pooling. Despite their increased modeling capacity, both approaches are outperformed by Hand Pooling in our experiments. In addition to its superior performance, Hand Pooling has the practical advantage of incurring zero additional computational cost, as it does not rely on learnable parameters or auxiliary scoring networks, unlike Top-k pooling and SAGPool. This makes Hand Pooling both an effective and efficient alternative for graph level representation learning in our setting. 5.5 Ablations with Sequence Encoder To quantify the contribution of the Sequence Encoder in our model, we compare it with classical recurrent architectures. The first row of Table 5.5 shows that removing the Sequence Encoder completely and relying merely on the Graph Encoder drops the performance significantly. This clearly reveals that the Sequence Encoder contributes to the performance by extracting the local context information. On the other hand, LSTM-based approaches could not reach the performance of the Encoder-only Transformer network, although bidirectional LSTM shows promising results. 5.6 Limitations Despite these promising results, there are certain limitations that can be viewed in terms of model architecture and the chosen graph representation. From an architectural perspective, the model lacks an explicit long-term memory mechanism for handling 31 extended action sequences, limiting its ability to capture long-range dependencies and consequential relationships between actions. Incorporating a memory-based mechanism, such as the use of dedicated memory tokens as in LSTR [20], could help preserve and exploit long-term temporal context. Additionally, the quality of the scene graphs is highly dependent on the extraction process from RGB-D data. And it is susceptible to noise, mainly due to unreliable depth measurements and imperfect object detection. Once introduced, such errors can propagate through the graph representation and are difficult to recover from. As an alternative, scene graph generation methods or conventions that are less sensitive to noise and do not rely on depth information could be explored. Furthermore, the architecture could be extended to explicitly model detection uncertainty, making it more robust to errors in the input data. Finally, because the Hand Pooling operation assumes that manipulation actions are carried out by hands, our model’s performance degrades when the hands are not visible in the scene or when the action is non-prehensile. 32 6. CONCLUSIONS In this thesis, we addressed the challenge of real-time manipulation action recognition for human-robot collaboration scenarios, where both computational efficiency and semantic understanding are essential. We introduced a novel Factorized Graph Sequence Encoder (FGSE) network that effectively decouples spatial and temporal feature extraction through its Graph Encoder and Sequence Encoder modules, combined with a parameter-free Hand Pooling operation. Our approach leverages scene graph representations based on spatiotemporal ESEC relations, which provide a semantically rich yet computationally efficient abstraction of manipulation scenes. This design choice filters out irrelevant visual variations such as lighting conditions, background clutter, and object appearances, enabling the model to focus on the relational cues that are most informative for recognizing manipulation actions. The factorized architecture overcomes the temporal scalability limitations of prior methods that rely on temporally concatenated graphs, allowing our model to effectively capture longer temporal contexts without suffering from information propagation bottlenecks or over-smoothing. Extensive experiments on the Bimacs and CoAx datasets demonstrate that FGSE achieves state-of-the-art performance among real-time capable models, surpassing the previous best approach by 14.3% and 5.6% in F1-macro score, respectively. Furthermore, our model achieves results comparable to offline models that process entire videos at once, despite operating under real-time constraints and having significantly fewer parameters. Comprehensive ablation studies validate our design choices, confirming the contributions of the factorized encoder design, Hand Pooling operation, and majority voting mechanism to the overall performance. Additionally, our comparison with ViViT Model 2 provides evidence that RGB-only approaches struggle on object-centric manipulation datasets with limited training samples, underscoring the value of graph-based representations in human-robot collaboration settings. 33 As future research directions, we plan to extend the FGSE architecture to skeleton-based whole-body manipulation tasks, which would enable the recognition of more complex human actions beyond hand-object interactions. To mitigate the noise in graph extraction, incorporating estimation confidence scores into the model could improve robustness against uncertain detections. Alternatively, employing deep learning models to generate scene graphs directly from point cloud data represents a promising research direction that could bypass the error-prone intermediate steps of the current pipeline. Finally, exploring self-supervised or few-shot learning strategies could further enhance the model’s adaptability to novel manipulation actions with minimal training data. 34 REFERENCES [1] Dreher, C.R.G., Wächter, M. and Asfour, T. (2020). Learning Object-Action Relations from Bimanual Human Demonstration Using Graph Networks, IEEE Robotics and Automation Letters, 5(1), 187–194. [2] Aksoy, E.E., Orhan, A. and Wörgötter, F. (2017). Semantic decomposition and recognition of long and complex manipulation action sequences, International Journal of Computer Vision, 122, 84–115. [3] Ziaeetabar, F., Kulvicius, T., Tamosiunaite, M. and Wörgötter, F. (2018). Recognition and prediction of manipulation actions using Enriched Semantic Event Chains, Robotics and Autonomous Systems, 110, 173–188. [4] Akyol, G., Sariel, S. and Aksoy, E.E. (2021). A Variational Graph Autoencoder for Manipulation Action Recognition and Prediction, 20th International Conference on Advanced Robotics (ICAR), IEEE, pp.968–973. [5] Morais, R., Le, V., Venkatesh, S. and Tran, T. (2021). Learning Asynchronous and Sparse Human-Object Interaction in Videos, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.16041–16050. [6] Qiao, T., Men, Q., Li, F.W.B., Kubotani, Y., Morishima, S. and Shum, H.P.H. (2022). Geometric Features Informed Multi-person Human-object Interaction Recognition in Videos, European Conference on Computer Vision (ECCV). [7] Xing, H. and Burschka, D. (2022). Understanding Spatio-Temporal Relations in Human-Object Interaction using Pyramid Graph Convolutional Network, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.5195–5201. [8] Lagamtzis, D., Schmidt, F., Seyler, J., Dang, T. and Schober, S. (2023). Exploiting Spatio-Temporal Human-Object Relations Using Graph Neural Networks for Human Action Recognition and 3D Motion Forecasting, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.7832–7838. [9] Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M. and Schmid, C. (2021). ViViT: A Video Vision Transformer, International Conference on Computer Vision (ICCV). 35 [10] Lagamtzis, D., Schmidt, F., Seyler, J.R. and Dang, T. (2022). Coax: Collaborative action dataset for human motion forecasting in an industrial workspace., ICAART (3), pp.98–105. [11] Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D., Bernstein, M. and Fei-Fei, L. (2015). Image retrieval using scene graphs, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.3668–3678. [12] Xu, D., Zhu, Y., Choy, C.B. and Fei-Fei, L. (2017). Scene graph generation by iterative message passing, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.5410–5419. [13] Li, H., Zhu, G., Zhang, L., Jiang, Y., Dang, Y., Hou, H., Shen, P., Zhao, X., Shah, S.A.A. and Bennamoun, M. (2024). Scene graph generation: A comprehensive survey, Neurocomputing, 566, 127052. [14] Wörgötter, F., Aksoy, E.E., Krüger, N., Piater, J., Ude, A. and Tamosiunaite, M. (2013). A Simple Ontology of Manipulation Actions based on Hand-Object Relations, IEEE Transactions on Autonomous Mental Development, 5(2), 117–134. [15] Kipf, T. (2016). Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv:1609.02907. [16] Shi, Y., Huang, Z., Feng, S., Zhong, H., Wang, W. and Sun, Y. (2021). Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification, Z.H. Zhou, editor, Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, International Joint Conferences on Artificial Intelligence Organization, pp.1548–1554, main Track. [17] Zhang, B., Wang, L., Wang, Z., Qiao, Y. and Wang, H. (2016). Real-time action recognition with enhanced motion vector CNNs, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.2718–2726. [18] Cob-Parro, A.C., Losada-Gutiérrez, C., Marrón-Romera, M., Gardel-Vicente, A. and Bravo-Muñoz, I. (2024). A new framework for deep learning video based Human Action Recognition on the edge, Expert Systems with Applications, 238, 122220. [19] Liu, K., Liu, W., Gan, C., Tan, M. and Ma, H. (2018). T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition, Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). [20] Xu, M., Xiong, Y., Chen, H., Li, X., Xia, W., Tu, Z. and Soatto, S. (2021). Long Short-Term Transformer for Online Action Detection, Conference on Neural Information Processing Systems (NeurIPS). 36 [21] Zhao, Y. and Krähenbühl, P. (2022). Real-time Online Video Detection with Temporal Smoothing Transformers, European Conference on Computer Vision (ECCV). [22] Wang, X., Zhang, S., Qing, Z., Shao, Y., Zuo, Z., Gao, C. and Sang, N. (2021). Oadtr: Online action detection with transformers, Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.7565–7575. [23] Sridhar, M., Cohn, G.A. and Hogg, D. (2008). Learning Functional Object-Categories from a Relational Spatio-Temporal Representation, Proc. 18th European Conference on Artificial Intelligence, pp.606–610. [24] Kjellström, H., Romero, J. and Kragić, D. (2011). Visual object-action recognition: Inferring object affordances from human demonstration, Comput. Vis. Image Underst., 115(1), 81–90. [25] Yang, Y., Fermüller, C. and Aloimonos, Y. (2013). Detection of manipulation action consequences (MAC), Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.2563–2570. [26] Aksoy, E.E., Abramov, A., Wörgötter, F. and Dellen, B. (2010). Categorizing object-action relations from semantic scene graphs, IEEE International Conference on Robotics and Automation (ICRA), pp.398–405. [27] Kipf, T.N. and Welling, M. (2016). Variational graph auto-encoders, arXiv preprint arXiv:1611.07308. [28] Ren, S., He, K., Girshick, R. and Sun, J. (2016). Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE transactions on pattern analysis and machine intelligence, 39(6), 1137–1149. [29] Xing, H. and Burschka, D. (2024). Understanding human activity with uncertainty measure for novelty in graph convolutional networks, The International Journal of Robotics Research, 02783649241287800. [30] Battaglia, P., Hamrick, J.B.C., Bapst, V., Sanchez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., Gulcehre, C., Song, F., Ballard, A., Gilmer, J., Dahl, G.E., Vaswani, A., Allen, K., Nash, C., Langston, V.J., Dyer, C., Heess, N., Wierstra, D., Kohli, P., Botvinick, M., Vinyals, O., Li, Y. and Pascanu, R. (2018). Relational inductive biases, deep learning, and graph networks, arXiv. [31] Keriven, N. (2022). Not too little, not too much: a theoretical analysis of graph (over) smoothing, Advances in Neural Information Processing Systems, 35, 2268–2281. [32] Rusch, T.K., Bronstein, M.M. and Mishra, S. (2023). A survey on oversmoothing in graph neural networks, arXiv preprint arXiv:2303.10993. 37 [33] Dallel, M., Havard, V., Dupuis, Y. and Baudry, D. (2022). A Sliding Window Based Approach With Majority Voting for Online Human Action Recognition using Spatial Temporal Graph Convolutional Neural Networks, Proceedings of the 2022 7th International Conference on Machine Learning Technologies, ICMLT ’22, Association for Computing Machinery, New York, NY, USA, p.155–163. [34] Yan, S., Xiong, Y. and Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18, AAAI Press. [35] Ba, J.L., Kiros, J.R. and Hinton, G.E. (2016). Layer normalization, arXiv preprint arXiv:1607.06450. [36] Klambauer, G., Unterthiner, T., Mayr, A. and Hochreiter, S. (2017). Self-normalizing neural networks, Advances in neural information processing systems, 30. [37] Vaswani, A. (2017). Attention is all you need, Advances in Neural Information Processing Systems. [38] Morais, R., Le, V., Venkatesh, S. and Tran, T. Learning Asynchronous and Sparse Human-Object Interaction in Videos-Supplementary Material. [39] Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P. and Feichtenhofer, C. (2024). SAM 2: Segment Anything in Images and Videos, arXiv preprint arXiv:2408.00714. [40] Fang, H.S., Li, J., Tang, H., Xu, C., Zhu, H., Xiu, Y., Li, Y.L. and Lu, C. (2022). AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time, IEEE Transactions on Pattern Analysis and Machine Intelligence. [41] Gao, H. and Ji, S. (2019). Graph u-nets, international conference on machine learning, PMLR, pp.2083–2092. [42] Lee, J., Lee, I. and Kang, J. (2019). Self-attention graph pooling, International conference on machin