LEE- Bilgisayar Mühendisliği-Yüksek Lisans

Bu koleksiyon için kalıcı URI

http://hdl.handle.net/11527/19207

Gözat

Effect of semi-supervised self-data annotation on video object detection performance

(Graduate School, 2022-06-22) Akman, Vefak Murat ; Töreyin, Behçet Uğur ; 704191017 ; Computer Sciences

Access to annotated data is more crucial than ever when deep learning frameworks replace traditional machine learning methodologies. Even if the method is robust, training performance can be inadequate if the data has poor quality. Some methods were developed to address data-related issues. These methods, however, have a negative impact on algorithm complexity and processing cost. Errors related to human factors, such as misclassification or inaccurate labeling, should also be considered. Multiple steps in the data annotation process cost time and money. These steps can be listed as follows. Data gathering, annotation and formatting according to deep learning model architecture. Unfortunately, these steps are still not fully set to a standard and the whole process comes with a lot of difficulties. In this study, the effect of semi-supervised data annotation on video object detection is analysed by using the Soft Teacher algorithm. Soft Teacher is a Swin-Transformer backboned semi-supervised learning method which has a major advantage on overcoming limited data. Swin Transformer is a type of vision transformer. It creates hierarchical feature maps by merging image patches in deeper layers and has linear computation complexity to input image size. As a such, it can be used as a general-purpose backbone for tasks like classification and object detection. In Soft Teacher, there are two types of models; the Student model and the Teacher model. The Teacher model performs pseudo-labeling on weak augmented unlabeled images and the Student model is trained on both labelled and strong augmented unlabeled images while updating the Teacher model. Soft Teacher model was trained with open-source COCO data set that consists of 80 labels. The data set contains 118287 train, 123403 unlabeled and 5000 validation images, was created by the human. The Soft Teacher was trained with percent of 1, 5, 10 and 100 labelled data respectively. Then, using those trained Soft Teacher models, new data was created from the same raw data and some of the state-of-the-art object detection algorithms were trained with newly annotated data. To compare results, these object detection models were also trained with manual annotated data. The model trained with human data was shown to be less successful than the other in terms of mAPs. However, the model that was trained with self annotated data produced more false positives. Because, the trained model can perform mislabeling when generating new data. In conclusion, the results suggest that semi-supervised data annotation degrades the detection performance in expense of huge amounts of training time savings.
Fight recognition from still images in the wild

(Graduate School, 2022-06-22) Aktı, Şeymanur ; Ekenel, Hazım Kemal ; 504191539 ; Computer Engineering

Violence in general is a sensitive subject and can have a negative impact on both the involved people and witnesses. Fighting is one of the most common types of violence which can be defined as an act where individuals intend to harm each other physically. In daily life, these kinds of situations might not be faced too often, however, the violent content on social media is also a big concern for the users. Since violent acts or fights in particular are considered as an anomaly or intriguing for some, people tend to record these scenes and upload them on their social media accounts. Similarly, news agencies also regard them as newsworthy material in some cases. As a result, fighting scenes become available on social media platforms frequently. Some users may be sensitive to these kinds of media content and children who can be harmed due to the aggressive nature of the fight scenes also uses social media. These facts make it necessary to detect and put limitations on the distribution of violent content on social media. There are some systems focusing on violence and fight recognition on visual data. However, these works mostly propose methods on different domains for violence such as movies, surveillance cameras, etc., and the social media case remains unexplored. Furthermore, even if most of the fight scenes shared on social media are in video sequences, there is also a non-ignorable amount of image data depicting violent fighting. However, no work tackles the fight recognition from still images instead of videos. Thus, in this thesis, the problem of fight recognition from still images is investigated. In this scope, first, a novel dataset was collected from social media images which is named Social Media Fight Images (SMFI). The dataset was collected from Twitter and Google images and some frames were included from the video dataset of NTU CCTV-Fights. The fight samples were chosen among the samples which are recorded in uncontrolled environments. In order to crawl a large amount of data, different keywords were used in various languages. The non-fight samples were also chosen among the data crawled from social media in order to keep the domain consistent across the classes. The dataset is made publicly available by sharing the links to the images. For the classification of the Social Media Fight Images dataset, some image classification methods were applied to the dataset. First, Convolutional Neural Networks (CNN) were employed for the task and their performance was assessed. Then, a recent approach, Vision Transformer (ViT) was exploited for the classification of the fight and non-fight images. The comparison showed that the Vision Transformer gives better results on the dataset achieving a higher accuracy with less overfit. A further experiment was also held on investigating the effect of varying dataset sizes on the performance of the model. This was seen as necessary as the data shared on social media may be deleted in the future and it is not always possible to retrieve the whole dataset. So, the model was trained on different partitions of the dataset and the results showed that even if using more data is better, the model could still give satisfying performance even in absence of 60% of the dataset. Upon the successful results on fight recognition on still images problem, another experimental study was conducted on the classification of video-based datasets using a single frame from each sample. The experiment included four video-based fight datasets and results showed that three of them could be successfully classified without using any temporal information. This indicated that there might be a dataset bias for these three datasets where the inter-class visual difference is high across the classes. Cross-dataset experiments also supported this hypothesis where the trained models on these video datasets perform poorly on the other fight recognition datasets. Nonetheless, the network trained on the proposed SMFI dataset gave a promising accuracy on other datasets as well, showing that the dataset generalizes the fight recognition problem better than the others.
Memory-based approaches to problems in probabilistic modeling

(Lisansüstü Eğitim Enstitüsü, 2022-10-25) Akgül, Abdullah ; Ünal, Gözde ; 504201504 ; Computer Engineering

Deep neural networks are an accepted solution for many problems in deep learning; however, the application of deep neural networks to safety-critical areas such as health care is still a hot research topic. To employ deep neural networks in such fields, they are expected to fit the in-domain data set, provide calibrated predictions on problematic regions of the target domain, and separate the out-of-domain queries. Even though these expectancies are studied extensively, these studies are highly fragmented. Therefore, there is no model that is able to fit these requirements simultaneously. Continual Learning (CL) is a framework that aims to learn numerous tasks in a sequential way. The excellent CL method should adapt to new tasks perfectly without forgetting previous tasks. However, neural networks suffer from catastrophic forgetting which is a performance drop on previously learned tasks caused by the newly learned task. Yet, to get intelligent systems capable of adapting to environmental change, CL is crucial. Because of this, CL is a hot topic but the research on CL is mainly on image classification tasks and there is limited work on time sequence classification tasks. Yet, there is no work on multi-modal dynamics modeling. In this thesis, we employ an external memory to deal with problems in probabilistic modeling. Our solutions for these problems can be summarized as follows: i) Evidential Turing Processes (ETP): First, we define total calibration for the first time. After investigating two Bayesian paradigms which are the Bayesian model, and the Evidential Bayesian Model, we introduce the Complete Bayesian Model (CBM) which is a unification of those two paradigms. We develop ETP as an instance of CBMs with neural episodic memory. We build a pipeline to evaluate the models' performance for total calibration. We compare our solution, the ETP member of CBMs, with state-of-the-art members of other paradigms, and we also provide an ablation study. We investigate the models' performance under five real-world data sets including one time-series classification, and four image classification tasks. Furthermore, we tested the models in the corrupted versions of different data sets. We use four different metrics that are test error as prediction accuracy, Expected Calibration Error as in-domain calibration score, Negative Log-Likelihood (NLL) as model fit, and area under the ROC curve as out-of-domain detection score. We report that only the ETP can excel in all three aspects of total calibration simultaneously. ii) Continual Dynamic Dirichlet Process (CDDP) for Continual Learning of Multi-modal Dynamics: We introduce a new problem which is CL of multi-modal dynamics. Since the problem is novel, we create a baseline from the existing ones. For this new problem, we introduce a novel solution that employs an external memory to transfer knowledge between tasks. We curate a pipeline for this newly introduced problem, and in the pipeline new tasks are coming sequentially and each task has a certain number of different mode samples. Differences in task order may cause different results in CL setups; therefore, we change the task order for each replication. We also generate synthetic data sets and adapt time-series classification data sets to evaluate models' performance in the problem. We compare models' performance with Normalized Mean Squared Error as a measure of prediction accuracy and NLL as a measure of Bayesian model fit that quantifies uncertainty. We reveal that our approach, CDDP, compares favorably to the established parameter transfer approach in CL of multi-modal dynamical systems. To sum up, in this thesis, by experiments we show that external memory architecture can be used for both calibrations of neural networks to use in safety-critical areas and CL of multi-modal dynamics.

Gözat

Konu "deep learning" ile LEE- Bilgisayar Mühendisliği-Yüksek Lisans'a göz atma

Sayfa başına sonuç

Sıralama Seçenekleri