Cross-domain one-shot object detection by online fine-tuning

dc.contributor.advisor Günsel, Bilge
dc.contributor.author Onur, İrem Beyza
dc.contributor.authorID 504211318
dc.contributor.department Telecommunication Engineering
dc.date.accessioned 2024-12-02T06:52:55Z
dc.date.available 2024-12-02T06:52:55Z
dc.date.issued 2024-06-26
dc.description Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2024
dc.description.abstract Object detection aims to identify and locate objects within an image or video frame. Recently, deep learning-based object detectors have made significant contributions to the community, thanks to their capability of processing and learning from large volumes of data. However, their detection performance heavily depends on large labeled datasets, which are essential for them to generalize effectively across previously unseen (novel) or rare classes. This restriction is addressed by the recent development of few-shot object detection (FSOD) and one-shot object detection (OSOD) techniques, which aim to detect novel classes using a few or a single sample of a previously unseen class, enabling rapid adaptation to novel classes without extensive labeled data. The FSOD and OSOD paradigms which allow models to adapt to novel classes from a small sample size have two main lines: methods based on meta-learning and those based on transfer learning. Meta-learning approaches aim to learn a generalizable set of parameters on data-abundant base classes by applying an episodic training approach. This approach is designed to transfer the base knowledge gained on data-abundant base classes to data-scarce novel classes during the inference phase. The other methods based on transfer learning focus on fine-tuning the model, which has trained on extensive datasets, on the limited number of new examples by applying several methods such as freezing only the specific layers or adaptive learning rate scheduling during the fine-tuning. In FSOD and OSOD techniques, the model is trained on data-abundant base classes and then fine-tuned on both the base and novel classes or on only the novel classes. This methodology makes them well-suited approaches for use with still images where the task is to recognize objects based on limited data. Although FSOD and OSOD methods enable quick adaptation to novel objects, their performance highly depends on the training domain, typically composed of still images. Due to the domain shift, the conventional setup yields significant performance degradation in one-shot or few-shot object detection in cross-domain evaluations. Moreover, most of the recent studies remain limited to focusing on performance gaps between different image domains, rather than those between image and video domains. Differing from the existing work, this thesis particularly focuses on the significant performance gap observed in cross-domain evaluations, from the still image domain to the video domain, which has been largely overlooked in the literature. In video domain evaluations, the main purpose of the detection model is to detect the target object, which is introduced at the beginning of the video sequence, within subsequent frames successfully. In the scope of this thesis, we choose to work with the OSOD models rather than the FSOD models because object detection in video sequences primarily hinges on the model's ability to adapt to the target object using only a single example. Therefore, OSOD models, designed to adapt to the target object using only one example, are more suitable than FSOD models, which necessitate few examples to adapt to the target object. In particular, OSOD models aim to classify and localize a target object in an image using its particular representation known as a query shot. This is achieved through a template-matching algorithm that detects all the instances from the class of this single query shot within the target image. The paradigm enables the model to adapt to the specific appearance of the target object from a single sample, in contrast to FSOD where the model is designed to adapt to novel classes using a small number of samples. In addition to the scarcity of examples of the target object aimed to be detected throughout the video frames, temporal challenges, and motion changes also present a substantial challenge for OSOD methods. OSOD models' ability to handle affine motion changes and maintain temporal consistency is crucial for sufficient performance in video sequences. Moreover, the challenge OSOD brings along arises from the requirement that the model must adapt to novel classes based solely on a single sample, which demands a higher level of generalization and adaptability. The majority of the OSOD models are trained and evaluated on the same domain in which the data distributions and class characteristics are quite similar between the training and evaluation sets. Although recent studies have examined the cross-domain evaluation of OSOD models, they have primarily focused on evaluations within different still-image domains, rather than between still-image and video domains. The models are vulnerable to severe shifts in data distribution due to the domain shift between still-image and video domains. In this thesis, we aim to demonstrate and analyze the reasons behind the performance gap that OSOD models experience in cross-domain scenarios. To do this, we evaluate a state-of-the-art (SOTA) OSOD model, BHRL, which has been trained on the MS COCO dataset from the still-image domain, using the VOT-LT 2019 dataset, which presents the challenging context of the video domain. For a fair evaluation, we include only the video frames where the target object is present. To alleviate the performance degradation, we take BHRL as the baseline OSOD model and propose three different novel OSOD frameworks, based on integrating an online fine-tuning scheme and a query shot update mechanism into the inference architecture of BHRL, the SOTA OSOD model utilizing multi-level feature learning. During the evaluation of the proposed frameworks, the mAP0.5 metric is used and class-based reporting is performed. Classes included in the training phase were classified as base classes, while those that were not included were categorized as novel classes. In the following, the proposed OSOD frameworks are summarised. OSCDA w/o CDQSS: The initial appearance of the target object is taken as the query shot and the model is online fine-tuned only on this query shot once at the beginning of the inference phase. Subsequently, the model tries to detect all instances of the relevant class within the video frames. Although fine-tuning is a conventional approach in transfer learning, integration with the multi-level feature learning of BHRL improves the detection mAP.50 performance by 14% on all classes. OSCDA w/ CDQSS: A major limitation of OSCDA w/o CDQSS in video object detection is its vulnerability to rapid appearance changes of the target object, resulting from the risk of overfitting on the query shot that represents the target's initial appearance. To overcome this drawback, in addition to fine-tuning, CDQSS, which is an adaptive query shot selection module, is integrated into the baseline architecture. This approach enables unsupervised online fine-tuning to deal with the rapid appearance changes of the target object caused by the affine motion throughout the video frames. By CDQSS, the query shot is updated with the model's detections based on their objectness scores and localization consistency across frames. The model is continuously fine-tuned with the query shots chosen by the CDQSS during the inference phase. The fine-tuning process is called unsupervised fine-tuning since it is based solely on the model's detections rather than the ground truth. CDQSS provided an additional 6% improvement in mAP.50 on all classes. SACDA: In order to take advantage of extra shots without leaving the one-shot detection approach, we propose incorporating online fine-tuning into the BHRL using the initial query shot and its synthetically generated variations referred to as augmented shots. In particular, similar to OSCDA w/o CDQSS, SACDA conducts fine-tuning only once at the beginning of the inference, and then the model tries to detect all instances of the relevant class throughout the subsequent frames. SACDA aims to adapt to quick changes in the target object's appearance, such as flipped or rotated versions, without relying on the continuous fine-tuning process suggested in the second framework (OSCDA w/ CDQSS). SACDA improves BHRL's mAP.50 score by 14%, matching the improvement seen with OSCDA (w/o CDQSS). However, SACDA significantly outperforms the previous frameworks in specific sequences such as ballet, group2, and longboard by 14%, 28%, and 46% respectively. These sequences share challenges such as extreme changes in lighting, scale, and rotation of target objects, as well as rapid illumination changes. Given SACDA's design goal to enhance the OSOD model's robustness to variations in target and scene appearances, these significant gains indicate SACDA's potential effectiveness. The achieved performance improvements demonstrate the proposed methods' effectiveness in tackling domain shift challenges faced during the cross-domain evaluations for video object detection.
dc.description.degree M.Sc.
dc.identifier.uri http://hdl.handle.net/11527/25689
dc.language.iso en_US
dc.publisher Graduate School
dc.sdg.type Goal 9: Industry, Innovation and Infrastructure
dc.subject Object detection
dc.subject Nesne tespiti
dc.subject Big data
dc.subject Büyük veri
dc.title Cross-domain one-shot object detection by online fine-tuning
dc.title.alternative Çevrimiçi ince-ayar ile tek-örnekli çapraz-alan nesne tespiti
dc.type Master Thesis
Dosyalar
Orijinal seri
Şimdi gösteriliyor 1 - 1 / 1
thumbnail.default.alt
Ad:
504211318.pdf
Boyut:
17.74 MB
Format:
Adobe Portable Document Format
Açıklama