LEE- Bilgisayar Mühendisliği-Doktora

Bu koleksiyon için kalıcı URI

http://hdl.handle.net/11527/19208

Gözat

DA4HI: A deep learning framework for facial emotion recognition in affective systems for children with hearing impairments

(Graduate School, 2023-11-23) Gürpınar, Cemal ; Köse, Hatice ; Arıca, Nafiz ; 504122504 ; Computer Engineering

The study of facial emotions has important implications for fields such as psychology, neuroscience, and computer science, including the recognition and interpretation of facial expressions to improve communication between humans and computers and helping people with particular needs, such as the elderly, children with ASD (Autism Spectrum Disorder), or children with hearing impairment. Facial expressions enable facile observation of emotions by human beings. The term "AU" refers to the smallest facial movements that can be visually distinguished. Facial Expression Recognition (FER) pertains to the identification of general emotional states such as happiness, sadness, anger etc. The identification of distinct muscle movements related to various facial expressions can be achieved by using action unit (AU) detection, allowing for a more comprehensive examination of facial expressions. FER and AU detection are two interrelated yet separate issues in the field of emotion recognition. The AU detection process involves the recognition of distinct facial muscle movements or actions that correspond to various expressions. On the other hand, AU detection encompasses the examination of specific AUs that contribute to the overall expression. Due to how children with special needs and children with hearing impairments express their emotions differently, it may be challenging to comprehend the emotional states of these children compared to adults. However, psychological studies, numerous child-machine interactions, and social robots for children all use children's emotions as one of the most crucial cues for evaluating their cognitive development. Many FER and AU detection systems use pre-trained models based on adult data, but few are trained on child data, whose face morphology differs from that of adults. One of the main reasons for this is the fact that datasets for children are scarce since they are delicate, and accessing them requires additional procedures. Models that are trained only on images of children may not be robust and sufficiently general as a result. The motivation of this thesis study is to develop and implement a model for recognizing facial emotions in children with hearing impairment to be utilized on a Pepper, a socially assistive humanoid robot platform, in real-time for the "RoboRehab: Assistive Audiology Rehabilitation Robot" project (TUBITAK 118E214). The spontaneous facial data of children who have hearing impairments was gathered in a study involving an interaction with a Pepper humanoid robot and a tablet-based game. In both experimental conditions, the responses of the children were captured via a video recording device positioned in their direct line of sight. The frames from the videos that showed notable levels of emotional intensity were chosen for analysis. The resulting images, which were subsequently labeled by annotators, were categorized as neutral, negative, and positive. Also, 18 distinct action units were detected in the aforementioned frames. One of the research questions to be answered in this thesis is whether the process of recognizing emotions can be optimized by directly detecting facial expressions in an image or by first detecting the facial action units within the image and subsequently utilizing them to recognize emotions. In order to conduct the experiment related to recognizing emotions directly from the images, the pre-trained Convolutional Neural Network (CNN) model were fine-tuned by the transfer learning to improve the recognition performance of hearing-impaired children's facial expressions. For this purpose, since human faces have different morphological structures according to age groups, the contribution of transfer learning to facial expression recognition performance from typical adults and children to hearing impaired children was explored. The AffectNet dataset for adults and the CAFE dataset for typical children were used. The CAFE dataset was classified into three emotion categories, namely positive, negative and neutral, in order to align with the emotional categories present in the dataset of hearing-impaired children. The present study analyzed the impact of incorporating training with both the basic 8 emotions (namely, angry, disgust, contempt, fear, happy, neutral, sad, and surprise) and 3 emotions (positive, negative, and neutral) on the performance of a model trained on the AffectNet dataset. This was done in light of the fact that the AffectNet dataset contains a larger number of images compared to the CAFE and hearing-impaired children's datasets. As a result of these experiments, it was found that fine-tuning the trained model using adult datasets which contain eight basic emotions contributed positively to the facial expression recognition performance of hearing-impaired children. For the emotion recognition experiment through AU detection, a model was proposed that employs a contrastive learning-based domain adaptation method that uses a siamese network due to the limited availability of data on hearing-impaired children to enhance the performance of facial AU detection. One of the research questions that this thesis was supposed to answer was how effective domain adaptation is in adults with different facial shapes but lots of data compared to children with similar facial shapes but not as much data.. In Siamese networks, it is important to identify positive and negative pairs. The distinction of positive and negative pairs is unambiguous in scenarios where mutually exclusive labels are employed, but it becomes increasingly uncertain when non-mutually exclusive labels are assigned to each image. In order to benefit from current methodologies that rely on contrastive loss, it is necessary to impose limiting assumptions. The straightforward method entails considering a pair of images as positive if they share identical labels and negative otherwise. For AU labeled data, the same AUs must be detected in both images for an image pair to be positive. So, this is not a very optimistic approach, especially since children's data is difficult to access and therefore data scarcity arises, and facial AU detection is a challenging problem. It further reduces the small dataset. A less strict strategy involves assuming that each image pair is positive if they share at least one label. This method is also far from ideal. Instead, given that the detection of facial AUs constitutes a multi-label classification task, the incorporation of a novel smoothing parameter, denoted as $\beta$, which serves to modulate the impact of comparable samples on the loss function of the contrastive learning approach, was proposed. The findings indicate that the incorporation of children's data (Child Affective Facial Expressions - CAFE) in domain adaptation produces superior performance outcomes compared to the use of adult's data (The Denver Intensity of Spontaneous Facial Action - DISFA). Furthermore, the adoption of the smoothing parameter $\beta$ results in a noteworthy enhancement of the recognition success. In relation to the aforementioned inquiry, an additional question was explored within the confines of the thesis: since the datasets containing the facial expressions of children, especially hearing-impaired children, are limited and difficult to access, which approach would yield superior results—transfer learning or domain adaptation? In order to address this inquiry, we used the Hearing-Impaired Children(HIC) dataset, which has been meticulously annotated by many experts, including both AU and emotion labels. Within the framework of domain adaptation, the Siamese Network model was successfully trained using data derived from typically developing children, namely the CAFE dataset, with the objective of identifying the Action Units (AUs) shown by children in the HIC dataset. Next, the top portion of the Siamese Network model was used as the AU classifier. An artificial neural network (ANN) model was built onto the AU classifier in order to accurately identify the emotional states of the children in the HIC dataset. On the other hand, the transfer learning idea used the EfficientNet-B0 model. This model was first trained using the IMAGENET and CAFE datasets, and then further refined using the HIC dataset. In this work, we use the HIC dataset, which consists of positive, negative, and neutral emotions, to train deep learning models. When conducting the research, it was noted that the models trained using transfer learning and domain adaptation techniques had comparable outcomes. The experts that annotated the HIC dataset in terms of emotion, categorized the emotion as neutral, which belongs to a class of emotions that cannot be categorized as either positive or negative. For instance, the feeling of surprise transcends the categories of negative and positive emotions. Consequently, this feeling is included in the neutral class. However, in the domain adaptation concept, the assumption was made that the neutral emotion did not exhibit any Action Units (AUs) after first detecting them and then classifying the emotions. In the RoboRehab project, we examined the ability to recognize positive and negative emotions in hearing-impaired children. To do this, we investigated the performance of detecting just positive and negative emotions using both transfer learning and domain adaptation concepts. Upon doing the investigation using the HIC dataset, which only consisted of positive and negative emotions, it was noted that the domain adaptation strategy yielded much superior outcomes compared to transfer learning.

Gözat

Yazar "Arıca, Nafiz" ile LEE- Bilgisayar Mühendisliği-Doktora'a göz atma

Sayfa başına sonuç

Sıralama Seçenekleri