LEE- Bilgisayar Mühendisliği-Yüksek Lisans

Bu koleksiyon için kalıcı URI

http://hdl.handle.net/11527/19207

Gözat

Lightweight facial expression recognition systems for social robots

(Graduate School, 2024-07-01) Biçer, Erhan ; Köse, Hatice ; 504211512 ; Computer Engineering

The motivation of this study was to develop resource-efficient lightweight facial emotion recognition (FER) frameworks designed for resource-limited devices, such as social robots. These robots can serve as intermediary agents to support children's health and educational activities. By introducing affect-aware capabilities to these social robots, interactions can be more natural and effective. Considering the limited computational resource of social robots, real-time performance can be achieved by efficient FER models. In FER frameworks, the general approach is as follows: apply pre-processing on the data (1); feed the pre-processed data to the FER model to acquire the emotion output (2). Before exploring the lightweight FER solutions, efficiency in pre-processing is investigated. In this regard, effect of face masking process is revealed. Pre-processing steps are as follows in face masking experiments: augmenting data (rotation, horizontal flipping and shifting); detecting the face; cropping the face using coordinates of detected face; aligning the face using facial landmarks; applying illumination correction technique; applying a face mask using detected landmarks. In this study, 4th and following items are named as "masking" regarding applied pre-processing flow. To assess whether the masking provides significant improvement for FER systems or not, VGG-Face is trained and evaluated with 15 different hyperparameter combinations under different lighting correction techniques for both masked and unmasked version of the CAFE dataset. CAFE dataset, is utilized for this purpose and action unit detection is performed. 3-fold cross validation is carried out. Shapiro-Wilk test is applied for masked and unmasked results of each illumination condition group. Since normality is not achieved in pairs of results, non-parametric test is used to determine whether there is a significant difference or not. The Kruskal-Wallis H test results indicate that, under all illumination conditions, the null hypothesis stating the population medians of both masked and unmasked data are equal cannot be rejected, as the p-values exceed the significance level (greater than 0.05). As a result, there is insufficient information to conclude that the masked version of the dataset causes a significant difference in accuracy. Based on these results, face masking procedures will be omitted in the future real-time robotic applications. Thus, masking is also not used in other experiments within the thesis. After exploring the efficiency of pre-processing steps, efficient solutions for FER are explored. To this end, two main approaches are utilized: model pruning (1), and knowledge distillation (2). The motivation when applying pruning is to remove unnecessary weights by zeroing the value of those weights (sparsify). Knowledge distillation, on the other hand, aims to transfer knowledge from a complex network to a comparably lighter network. Effect of pruning is explored on CAFE dataset for both emotion and action unit detection. Throughout model training, after each step, model weights are ordered by magnitude, and weights with the lowest magnitudes are replaced with weights with zero magnitude to meet the sparsity requirement. In experiments with pruning, beginning sparsity is set as 0.5, and desired final sparsity is set as 0.8. VGG-Face is utilized for pruning experiments. 3-Fold cross validation is utilized. The model is fine-tuned using 0.00001 and 0.000001 learning rates. Results reveal that pruning leads to 0.8913 and 0.8090 accuracies for learning rates of 0.00001 and 0.000001, respectively. Without pruning, accuracies at 0.00001 and 0.000001 learning rates are 0.8831 and 0.8472, respectively. Results show that, when using 0.00001 learning rate, model pruning does not affect the model performance negatively, however, when less learning rate is used (0.000001), accuracy is decreased. Hence, generalization advantage that pruning brings can be used in a balanced learning rate. So, for action unit detection 0.00001 is utilized in pruning experiments. 0.5786% and 0.5608% accuracies are achieved with and without pruning respectively. Resulting AU model is tested on hearing impaired children (HIC) dataset. Overall, the performance of the AU detection is similar for the scenarios with and without pruning. With utilizing model pruning, resulting AU model achieved 71.91% sparsity in the pruned layers, and 35.16% in the overall model. Storage size of the model is reduced from 118.3 MB to 59.8 MB in Keras format. By quantizing weights and transforming into .tflite network, model can be reduced up to 15 MB. With these outcomes, this model can be fine-tuned with HIC to improve the performance. In addition to weight pruning, knowledge distillation method is utilized. In knowledge distillation, knowledge of a complex model is transferred into a small network. Complex network in this training scheme refers to teacher network while small network is named as student network. Distilling knowledge is achieved using the distillation loss, which leads student model to produce similar outputs. This is achieved using Kullback-Leibler divergence between the softened softmax outputs of both networks. Along with that, focal loss is introduced onto standard cross entropy loss to force the model to consider label imbalance when updating weights. To develop a robust teacher network for FER, firstly, weight freezing experiment is performed which reveal that last convolutional block should be enabled for weight updating in order to get the best of the transfer learning regarding accuracy. This model resulted in 61% test accuracy in AffectNet. This model, called Affect-FER, formed the basis of teacher networks in knowledge distillation experiments. Pruned teacher scenario is also tested to reveal whether pruned teacher network can improve the accuracy of the student network. To this end, model pruning experiments are performed on AffectNet. Most balanced result is achieved in pruning experiments with 0.00001 learning rate as 59.37% test accuracy. For experiments with pruned teacher, the pruned model with 0.00001 learning rate, namely Affect-FER-P, is used as a pre-trained network to form the basis of the teacher network. Also, the rate of compressibility is explored among models. The hypothesis that a pruned model can be compressed more effectively than a model without pruning is tested using the gzip compression method. Zipped model size of model without pruning are 82.59 MB and 82.60 MB for 0.00001 and 0.0001 repectively. Gzipped model size of model with pruning are 36.96 MB and 40.99 MB for 0.00001 and 0.0001 repectively. The findings indicate that model pruning enhances the compressibility of the model. For lightweight student model, LITEFER-V1 and LITEFER-V2 are developed. LITEFER-V1 is a shallow CNN with: two convolutional layer block with 4x4x32 and 3x3x16 filter shapes respectively; followed by a single max pooling layer and two fully connected neural networks with 16 and 7 neurons. On the other hand, depthwise separable convolutions are utilized in LITEFER-V2. In spite of the regular convolution layer, the convolution operation is divided into two parts as depthwise convolution and pointwise convolution. LITEFER-V2 is consisted of following layers: four depthwise separable convolution block with 7x7x32, 9x9x64, 3x3x32, 5x5x64 respectively; followed by a single max pooling layer and two fully connected neural networks with 16 and 7 neurons. Teacher models are obtained for adult and child FER by fine-tuning Affect-FER model on CK+ and CAFE respectively. For pruned teacher scenario, Affect-FER-P is fine-tuned on CK+ dataset. LITEFER-V2 outperformed LITEFER-V1 on CK+ dataset for all alpha and temperature hyperparameters (alpha: 0.3/0.4, temperature: 3/10). Best of LITEFER-V1 results in 82.49%, while 89.69% accuracy is achieved by LITEFER-V2. LITEFER-V2 achieves its best performance with alpha as 0.3 and temperature as 3. Student network that is utilized in the pruned teacher (Affect-FER-P) scenario does not improve the existing performances achieved with LITEFER-V2. On the other hand, with experiments on fully connected layers, the accuracy obtained with LITEFER reached up to 90.53%. For CAFE experiments, student model of the most balanced fold in CK+ experiments with best performing hyperparameters is selected. Student model is trained with the Affect-FER being as the teacher model. Best performing models achieved 79.43% accuracy and 77.27% F-1 score in average. With these experiments in knowledge-distillation, knowledge transfer from a teacher to a student with less than 1 MB (445.24KB) is achieved. Inference speed of the proposed model ("LITEFER") is measured and compared with complex teacher model in both standard keras (.h5) and TensorRT format. This speed analysis is performed on a notebook with RTX 3060 Laptop GPU and i7-12700H. Using a single image, average inference time of 100 inferences is reported. LITEFER has the latency of 37 ms and 33 ms for GPU and CPU respectively using keras format. CPU performance of the standard keras format for LITEFER outperforms GPU. Also, LITEFER surpasses the speed performance of the teacher network in keras format. For GPU, LITEFER outperforms by nearly 5.7 frame per seconds. For CPU, LITEFER is nearly 19.7 frames per second faster than the teacher network. Using TensorRT, LITEFER achieves 5.8 ms and teacher achieves 6.4 ms. The difference becomes more apparent in throghput as LITEFER achieves 173.82 FPS in TensorRT, while teacher network achieves 156 FPS. Performance of batch predictions are also analyzed. Using batches with 128 images, LITEFER achieves 10 ms with 3213 FPS which outperforms the teacher network that has the latency of 106.2 ms and 301 FPS throughput. Thus, the predictions with the TensorRT format are faster than with the standard keras format. Moreover, the difference in runtime efficiency between our lightweight model LITEFER and our relatively complex VGG-Face based teacher model is observed.

Gözat

Yazar "Biçer, Erhan" ile LEE- Bilgisayar Mühendisliği-Yüksek Lisans'a göz atma

Sayfa başına sonuç

Sıralama Seçenekleri