A video dataset of incidents & video-based incident classification
A video dataset of incidents & video-based incident classification
dc.contributor.advisor | Ekenel, Hazım Kemal | |
dc.contributor.author | Sesver, Duygu | |
dc.contributor.authorID | 765019 | |
dc.contributor.department | Computer Engineering Programme | |
dc.date.accessioned | 2025-04-24T11:34:51Z | |
dc.date.available | 2025-04-24T11:34:51Z | |
dc.date.issued | 2022 | |
dc.description.abstract | Nowadays, the occurrence of natural disasters, such as fires, earthquakes, and floodings have increased in our world. Detection of incidents and natural disasters became more important when action is needed. Social media is one of the data sources to see natural disasters and incidents thoroughly and immediately. There are a lot of studies to detect incidents in the literature. However, most of them are using still images and text datasets. The number of video-based datasets is limited in the literature. Moreover, existing video-based studies have a limited number of class labels. Motivated by the lack of publicly available video-based incident datasets, a diverse dataset with a high number of classes is collected which we named as Video Dataset of Incident (VIDI). The collected dataset has 43 classes which are exactly the same as the ones in the previously published Incidents Dataset. It includes both positive and negative samples for each class as in the Incidents Dataset. The dataset contains 8.881 videos in total, 4.534 of them are positive samples and 4.347 of them are negative samples. Approximately, there are 100 videos for each positive and negative class. Video duration is ten seconds on average. YouTube is used as the source of the videos and video clips are collected manually. Positive examples of the dataset consist of natural disasters such as landslides, earthquakes, floods, vehicle accidents, such as truck accidents, motorcycle accidents, and car accidents, and the consequences of these events such as burned, damaged, and collapsed. Positive samples may contain multiple labels per video and image. It means video can belong to more than one class category. On the other hand, negative samples do not contain the disaster of that class. Samples in the negative can be instances that the model can easily confuse. For instance, the negative example of "car accident" class can be a normal car driving or a video that includes a "flooded" incident, since it contains "flooded" incident but it does not include a "car accident". It is aimed to ensure diversity in the dataset while collecting videos. Videos from different locations for each positive and negative sample are collected to increase diversity. Another approach to increase diversity is to look for videos in various languages to capture the styles of different cultures and include region-specific events. Six languages are used and they are as follows: Turkish, English, Standard Arabic, French, Spanish, and Simplified Chinese. When these languages are not sufficient, videos are queried in different languages, too. After the dataset is collected, various experiments are performed on it. While performing these experiments, the latest video and image classification architectures are used on both existing image-based and newly created video-based incident datasets. The first part of the study is conducted by using only positive samples and negative samples are included in the second part. One of the motivations was to explore the benefit of using video data instead of images for incident classification. To investigate this need, the Vision Transformer and TimeSformer architectures are trained using only positive samples of both datasets for the classification of the incident. Top-1 and top-5 accuracies are used as evaluation metrics. ViT which is designed for the images is performed on the Incidents Dataset. Then, TimeSformer which is designed for multi-frame data is performed on the collected video-based dataset. Eight frames are sampled from each video in the collected dataset and used for the TimeSformer multi-frame training. Since datasets and architectures are not the same in these experiments, it would not be a fair comparison between the image and the video datasets. So, the TimeSformer architecture is performed also on the Incidents Dataset and the ViT is performed also on the VIDI. Input data is adapted according to the requirements of the architectures. In the video classification architecture, each image from the Incidents Dataset was treated as a single-frame video. In the image classification architecture, the middle frame of the input video is used as an image. Finally, to be able to show the impact of using multiple-frames in the incident classification, the TimeSformer architecture is performed also with a single-frame from a video dataset. The same downsampling method is applied and the middle frame is used for the training. TimeSformer achieved 76.56% accuracy in the multi-frame experiment, while 67.37% accuracy is achieved in the single-frame experiment on the collected dataset. This indicates that using video information when available improves incident classification performance. In the experiments, the performance of the state-of-the-art, ViT and TimeSformer architectures, for incident classification is also evaluated. The used method in the Incidents Dataset paper and state-of-the-art image classification architecture performance are compared. The used approach in the Incidents Dataset paper was using ResNet-18 architecture. ViT and TimeSformer achieved higher accuracies than the ResNet-18 on the image-based Incidents Dataset. While the Resnet-18-based model achieved 77.3% accuracy, ViT and TimeSformer achieved 78.5% and 81.47% top-1 accuracy, respectively. Additionally, the performance of ViT and TimeSformer is compared using both datasets in their single-frame version. TimeSformer achieved 67.37% and ViT achieved 61.78% on the single-frame version of the video-based dataset. Moreover, TimeSformer performed 81.47% and ViT performed 78.5% on the image dataset. TimeSformer is found to be superior to ViT on both datasets. However, the results in the collected dataset are lower than those obtained in the Incidents Dataset. There could be two main reasons for this: (1) the image-based dataset contains more examples for training, so systems can learn better models. (2) The collected dataset contains more difficult examples for classification. The second part of the study includes negative samples. By using both positive and negative samples, binary classification models are trained for all classes. The main idea was to measure the performance that a model could detect whether or not that disaster occurs in a given video. Therefore, 43 separate models are trained. As a result, the best accuracy is achieved by the "landslide", and "dirty contamined" classes. The model got the lowest accuracy in the detection of "blocked" disasters. Finally, one more classification experiment has been run on VIDI. This experiment uses negative samples as the 44th class. For the 44th class, 100 videos that do not include any incidents are selected from the negative samples. By using these classes, 72.18% accuracy is achieved for this experiment. In summary, a highly diverse disaster dataset with many classes is presented in this study. For the classification tasks, the performances of the recent video and image classification architectures on video and image datasets are compared, and binary classification experiments are done for each class. | |
dc.identifier.uri | http://hdl.handle.net/11527/26912 | |
dc.language.iso | en | |
dc.publisher | Graduate School | |
dc.sdg.type | Goal 11: Sustainable Cities and Communities | |
dc.subject | Data sets | |
dc.subject | Classification | |
dc.subject | Image transformations | |
dc.subject | Error analysis | |
dc.title | A video dataset of incidents & video-based incident classification | |
dc.title.alternative | Felaket video veriseti & video-tabanlı felaket sınıflandırması | |
dc.type | Master Thesis |