Effect of semi-supervised self-data annotation on video object detection performance

Akman, Vefak Murat
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Graduate School
Access to annotated data is more crucial than ever when deep learning frameworks replace traditional machine learning methodologies. Even if the method is robust, training performance can be inadequate if the data has poor quality. Some methods were developed to address data-related issues. These methods, however, have a negative impact on algorithm complexity and processing cost. Errors related to human factors, such as misclassification or inaccurate labeling, should also be considered. Multiple steps in the data annotation process cost time and money. These steps can be listed as follows. Data gathering, annotation and formatting according to deep learning model architecture. Unfortunately, these steps are still not fully set to a standard and the whole process comes with a lot of difficulties. In this study, the effect of semi-supervised data annotation on video object detection is analysed by using the Soft Teacher algorithm. Soft Teacher is a Swin-Transformer backboned semi-supervised learning method which has a major advantage on overcoming limited data. Swin Transformer is a type of vision transformer. It creates hierarchical feature maps by merging image patches in deeper layers and has linear computation complexity to input image size. As a such, it can be used as a general-purpose backbone for tasks like classification and object detection. In Soft Teacher, there are two types of models; the Student model and the Teacher model. The Teacher model performs pseudo-labeling on weak augmented unlabeled images and the Student model is trained on both labelled and strong augmented unlabeled images while updating the Teacher model. Soft Teacher model was trained with open-source COCO data set that consists of 80 labels. The data set contains 118287 train, 123403 unlabeled and 5000 validation images, was created by the human. The Soft Teacher was trained with percent of 1, 5, 10 and 100 labelled data respectively. Then, using those trained Soft Teacher models, new data was created from the same raw data and some of the state-of-the-art object detection algorithms were trained with newly annotated data. To compare results, these object detection models were also trained with manual annotated data. The model trained with human data was shown to be less successful than the other in terms of mAPs. However, the model that was trained with self annotated data produced more false positives. Because, the trained model can perform mislabeling when generating new data. In conclusion, the results suggest that semi-supervised data annotation degrades the detection performance in expense of huge amounts of training time savings.
Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2022
Anahtar kelimeler
deep learning, derin öğrenme