On real-world face super-resolution and face image synthesis evaluation

Yükleniyor...
Küçük Resim

item.page.authors

Süreli Yayın başlığı

Süreli Yayın ISSN

Cilt Başlığı

Yayınevi

Graduate School

Özet

The advancements in deep learning have brought revolutionary changes across various fields, particularly in computer vision and natural language processing. CNNs have emerged as the primary deep neural network architecture in computer vision, delivering outstanding performance in many tasks, such as classification and image synthesis. With the introduction of Transformers in natural language processing, researchers have shifted their attention to exploring their potential in computer vision. ViT stands out as a successful application of Transformers in the vision domain. Furthermore, the scalability characteristic of Transformers has led to the development of foundation models that have shown great generalization capability. Face recognition, a fundamental problem in biometrics, has made significant progress thanks to deep learning. Despite these advancements, a major challenge in face recognition has remained: the difference in performance between controlled environments and real-world applications. This difference arises from the domain gap of the used training images, since real-world images are not collected in controlled environments like images in the training dataset. These real-world images often have been subjected to degradations, such as low resolution, blur, and occlusions. This quality difference significantly affects the performance of face recognition systems trained on datasets that do not include these complexities. Due to these difficulties, face recognition approaches have struggled when they have to cope with real-world low-resolution images. Some researchers have addressed this by focusing on operating with low-resolution images or bridging the gap between low-resolution and high-resolution domains in the embedding space. Additionally, some researchers have tackled this problem through image synthesis, such as enhancing low-resolution images to high-resolution ones or vice versa. Generative Adversarial Network-based methodologies have been commonly used for these enhancement and degradation tasks. Real-world super-resolution addresses the quality and resolution enhancement of the low-resolution images captured in real-world scenarios. A crucial bottleneck in this field is the need for large datasets containing low-resolution and high-resolution image pairs. Collecting these paired datasets is expensive and time-consuming since finding a high-resolution counterpart of a real-world low-resolution image is challenging. To address this challenge, some studies have attempted to generate low-resolution pairs from high-resolution images using synthetic degradation pipelines that simulate real-world degradations. Additionally, some researchers have used generator networks trained to mimic real-world degradations instead of the degradation pipeline. In these methods that train a degradation generator, the real-world low-resolution images become idle after the generator is trained. To address this issue, we proposed a two-step method called Residual Consistency to utilize these images more. This method directly incorporates real-world low-resolution images into training to benefit the valuable domain information. We achieve this by reconstructing the input low-resolution image using two image enhancement variations together. We conducted experiments using two similar studies and their datasets, comparing our method with theirs. We used the FID score for evaluation and also visually inspected the outputs. Even though we could not surpass the FID scores of the officially published models, we achieved competitive results. Furthermore, when comparing the results from two low-resolution enhancement models, the FID scores are very similar, supporting the proposed consistency. However, we noticed differences between the numerical metric results and the perceived quality upon visual inspection. The visual quality did not completely correlate with the FID scores in this inspection. This inconsistency initiated a more in-depth investigation into the limitations of current evaluation metrics for face image synthesis. Many early assessments of image synthesis relied on human judgment, where many individuals tried to differentiate between real and generated images. Models were evaluated based on how much they confused the assessors, with more confusion indicating the more realistic generated images. Human assessment has become hard to scale as synthesis methods have become more popular. This has led to the need to develop automatic quantitative evaluation approaches. These approaches often rely on a pre-trained feature extractor, where calculations are made upon the extracted features from this network. Recently, concerns have been raised about whether the most commonly used feature extraction network is suitable for evaluating face image synthesis due to its training data. After considering this debate and the inconsistencies found in the real-world super-resolution results, we decided to shift our focus to analyzing the behavior of the feature extractor networks for face image synthesis evaluation. We conducted a comprehensive study using various feature extractor networks and metrics to gain a deeper understanding. This study involved analyzing diverse datasets containing real images and synthetic ones. We investigated the effect of $L_2$ normalization, models' attention during feature extraction, and the distribution of the features. A reference human assessment is needed to check which results from different networks' features are similar to human judgment, where the orderings can be compared, for example. Since we do not have this kind of reference, we operated some assumptions regarding realism to guide our analysis. From the results, we found that $L_2$ normalization can impact the assessment by altering the preferences of the models. Different networks focus on distinct regions of the face image during the evaluation process. Interestingly, the metrics favored synthetic datasets over real datasets. We also noticed this trend in the distribution analysis of the extracted features. These findings indicate that existing metrics using pre-trained feature extractors may not be entirely suitable for accurately reflecting image realism in the context of facial images, as in the super-resolution study we worked on. This study addressed the challenges of working with real-world low-resolution face images. We proposed the Residual Consistency method, which effectively incorporates real-world low-resolution images into the training process to better utilize low-resolution domain information for super-resolution. Additionally, we discovered discrepancies between the evaluation metrics used and the perceived image quality. As a result, we conducted a comprehensive investigation using various feature extractor networks and metrics, revealing limitations in automatic face image synthesis assessment. In light of these findings, the upcoming work will concentrate on a more functional and explainable method development to evaluate face image synthesis, covering both unconditional/random generation and conditional tasks, such as super-resolution. This effort will be supported by a more comprehensive study of feature extractor behavior, including expanding the exploration of feature extractor networks, evaluation metrics, and result analysis methodologies. Additionally, we will work on improving the quality of the high-resolution images generated while maintaining the input identity's features for the Residual Consistency method.

Açıklama

Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2024

Konusu

human-computer interaction, insan-bilgisayar etkileşimi, image processing, görüntü işleme

Alıntı

Endorsement

Review

Supplemented By

Referenced By