Automatic gaze detection for child-robot interaction

thumbnail.default.alt
Tarih
2024-07-01
Yazarlar
Bölük, Nursena
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Graduate School
Özet
Gaze behavior is a powerful, nonverbal form of communication. In social interactions, gaze is an essential indicator of attention. For individuals with autism, gaze behavior is especially critical. Because individuals with autism tend to avoid looking at the face, for this reason, individuals with autism face significant difficulties in social interactions. Eye estimation technology from scene images was used to get as natural results as possible when detecting the gaze of individuals with autism and to ensure adaptability to every environment. This study focuses on developing an Automatic Gaze Detection (AGD) system to detect two-dimensional (2D) gaze target points and relevant areas in interactions between children with autism and robots. Two new datasets, ChildPlay-R and EMBOA-Gaze, were used to develop the AGD system within the scope of this study. The ChildPlay-R dataset contains videos of children with and without autism interacting with adults and two-dimensional gaze target points of children. The ChildPlay-R dataset was created by labeling those with an environment similar to the EMBOA dataset as autism and non-autism, based on the open-source ChildPlay Gaze dataset. There are fifteen videos in the ChildPlay-R dataset, five of which are autism and ten are non-autism. The EMBOA-Gaze dataset contains robot-assisted therapy videos containing interaction games of children with autism, 2D gaze target points, and the regional equivalents of these points ("Robot," "Therapist," and "Other"). The EMBOA-Gaze dataset includes eight children with autism (six males and two females) and two typically developing children (two males) between the ages of five and ten. Since the children moved frequently during the sessions, a fisheye camera was used to capture scene images. The EMBOA-Gaze dataset is part of the EU Erasmus+ funded EMBOA project. The aim of the EMBOA project is to the enhancement of social robot intervention in children with autism with effective computing technologies. Two different people labeled one of the sessions in the EMBOA-Gaze dataset. Cohen's Kappa analysis was used to find the level of agreement between these labels. The analysis yielded a Kappa score 0.695, indicating strong agreement between labels. The AGD system was created from four main components: Head Detection Module, Adaptive Region Detection Module, Customized Spatio-Temporal Gaze Architecture (C-STGA), and Region Class Assignment Module. The Head Detection Module is trained with YOLOv8 to find the head with all its features, including hair. The Adaptive Region Detection Module has also been trained in YOLOv8 to be adaptable due to the mobility of the actors corresponding to the regions. The C-STGA module is a finely tuned version of the Spatio-Temporal Gaze Architecture (STGA) model with modified layers. Region Class Assignment Module assigns the detected region to the area to which it belongs. In order to achieve higher success on the EMBOA-Gaze dataset, training was first done with the ChildPlay-R dataset. Thus, GazeFollow and Attention Target Detection weights, trained on adults' gaze data, were also trained with the ChildPlay-R, a child gaze dataset, to enable correct detection of the gaze of children with autism. For the C-STGA module in the AGD system to reach the optimum state, the effect of initializing the models with three different weight configurations, GazeFollow, Attention Target Detection, and ChildPlay-R, was evaluated. The models were performed on the ChildPlay-R and EMBOA-Gaze datasets, and both direct testing and post-training testing operations were applied. In addition, C-STGA and STGA models were compared. As a result of the evaluation, the optimum result was achieved by initializing the C-STGA model with the ChildPlay-R dataset weights. Analysis revealed significant improvements in area Under the Curve values with the C-STGA model compared to the STGA model. An AUC value of 77% was obtained in the C-STGA model in detecting children's 2D gaze attention target. At the same time, the success rate in regional detecting is 82% for the robot region, 90% for the therapist region, and 76% for other areas. Combined performance measurements for detecting 2D gaze target points in different configurations are summarized in tables and illustrated through graphs, showing comparative analysis for each configuration.
Açıklama
Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2024
Anahtar kelimeler
gaze, bakış, deep learning, derin öğrenme, social robot, sosyal robot, insan-robot etkileşimi, human-robot interaction
Alıntı