Crowd localization and counting via deep flow maps
Crowd localization and counting via deep flow maps
dc.contributor.advisor | Günsel, Bilge | |
dc.contributor.author | Yousefi, Pedram | |
dc.contributor.authorID | 504211329 | |
dc.contributor.department | Telecommunications Engineering | |
dc.date.accessioned | 2025-01-24T08:22:26Z | |
dc.date.available | 2025-01-24T08:22:26Z | |
dc.date.issued | 2024-06-26 | |
dc.description | Thesis (M.Sc.) -- İstanbul Technical University, Graduate School, 2024 | |
dc.description.abstract | Understanding the location, distribution pattern, and characteristics of crowds, along with the number of objects within a specific space, constitutes a critical subject known as crowd analysis. The analysis and monitoring of people in crowds hold paramount importance, particularly in areas such as security and management, for practical applications such as urban management, city planning, and preventing catastrophes. Over the years, numerous methods have been developed and introduced to address this challenge. Earlier methods relied on detection-based solutions, where each individual had to be detected and then counted, facing challenges such as occlusion which further complicated the process of detecting individual body parts and counting each individual and high processing time. Other methods that were introduced to remedy problems related with detection-based crowd counting relied on regression-based solutions, attempting to map crowd distribution patterns to the crowd count. Regression-based methods faced problems such as occlusion and low performance in highly crowded scenarios. Both approaches could only report the total number of objects or individuals and not their locations or distribution patterns. However, with advancements in the area of deep neural networks, specifically the introduction of convolutional neural networks (CNNs), CNN-based crowd counting methods have emerged. These methods aim to find a relationship between the extracted features from the input image and the ground-truth data, depicted as a color-coded density map. This density map illustrates the distribution pattern and shape of the target objects within the scene. Ground-truth density maps are generated by convolving object center coordinates with a Gaussian kernel, effectively encoding the average object sizes and the distances between the objects. This approach allows for not only the counting of objects but also the visualization of their distribution patterns. In recent years, many density-based crowd counting networks have been developed and introduced, differing in their accuracy and network architecture. Most of these networks work with single images in the spatial domain; however, a limited number of density-based networks that operate in the temporal domain with video frames have been introduced. The network used in the current research study, named CANnet2s, is among the video-based deep neural networks using density estimation techniques. Aside from extracting features, this network estimates the flow of objects within a pair of video frames at the pixel level, within small image areas called "grids." Displacements of objects to or from these grids are estimated, resulting in the generation of flow maps (maps of objects moving in a certain direction). This process totals in the creation of ten flow maps for ten possible directions. The density maps are then generated by combining these flow maps, and the total crowd count is estimated from these combined maps. The CANnet2s network was originally developed for people crowd counting purposes. Therefore, the initial phase of this study investigates the network's performance on people crowds by conducting experiments on different datasets such as FDST, ShanghaiTech, and JHU-Crowds++. However, motivated by recent developments and the increased usage of autonomous vehicles, the second phase of the study focuses on adapting this network to the domain of vehicle crowd counting and estimation. This phase of the study begins with experiments using the TRANCOS cars dataset, which includes traffic jam images. However, due to limitations in the quality of images and camera positions in this dataset, the comprehensive WAYMO dataset is employed. This dataset includes high-quality real-life video sequences recorded from the point of view of the vehicle driver, making it ideal for autonomous driving purposes. A subset of this dataset, comprising 140 video segments (approximately 28,000 video frames), is annotated and prepared for training and testing purposes of the network, where 25 segments are used for training and the remaining segments are employed for testing. Due to pioneering nature of this study and scarcity of related studies in the field of vehicle counting utilizing the WAYMO dataset, the still-image-based counterpart of CANnet2s, the CANnet network, is also trained and tested for comparative analysis. Throughout this research, CANnet2s consistently demonstrated superior performance. It exhibited a smaller mean absolute error (MAE) rate of 5.46 compared to CANnet, which had an MAE error of 7.74, despite being trained for fewer epochs (150 epochs compared to CANnet's 500 epochs). Additionally, CANnet2s showed a 3 dB increase in peak signal-to-noise ratio (PSNR) value compared to CANnet, which resulted in the generation of density maps with higher levels of detail and enhanced quality. In the second phase of this research, WAYMO dataset segments are meticulously labeled and categorized based on various scene characteristics and features, including weather conditions and vehicle crowds. Attribute-based network performance reports are then generated, highlighting the efficacy of CANnet2s, particularly in challenging scenarios. Once again, CANnet2s demonstrated its superiority, reaffirming its effectiveness across diverse conditions and environments. To further boost the performance of CANnet2s, transfer learning techniques are employed. A pre-trained model from the TRANCOS cars dataset served as the baseline for training the CANnet2s network with the WAYMO dataset. This approach halved the required training time, achieving the desired network performance after just 35 epochs of training. The outcome was an enhancement in network performance in terms of MAE error rate, particularly evident in one of the most challenging segments of the WAYMO dataset, depicting a blurry, highly occluded scene, where the MAE error rate decreased by 98 percent and the output density maps closely mirrored the ground-truth data. Furthermore, the study examines the impact of modifications to the CANnet2s architecture and network elements on network performance by experimenting with different kernel sizes and investigating the effect of input video frame dimensions on processing time. By using kernel modification, specifically by adjusting the kernel sizes of the pyramid pooling section of the CANNet2S architecture, the network's performance on the TRANCOS dataset improved both in terms of learning speed and error rate. This modification decreased the required training time from 90 epochs to 10 epochs while reducing the MAE error rate from 2.4 to 2.1, making CANNet2S's performance on the TRANCOS dataset the second best in the benchmark table. This study explores the feasibility of multi-object crowd estimation, with a focus on simultaneously detecting and counting both vehicles and people in video frames. This is crucial for identifying these objects as the main obstacles from the driver's viewpoint. This exploration represents the early stages of research in this area. The results of this research study show promising outcomes for the practical application of these methods in areas such as a pre-processing step in autonomous vehicles, road and urban transportation management by city authorities, and general crowd estimation purposes. | |
dc.description.degree | M.Sc. | |
dc.identifier.uri | http://hdl.handle.net/11527/26274 | |
dc.language.iso | en_US | |
dc.publisher | Graduate School | |
dc.sdg.type | Goal 3: Good Health and Well-being | |
dc.sdg.type | Goal 17: Partnerships to achieve the Goal | |
dc.subject | Road vehicles | |
dc.subject | Arazi taşıtları | |
dc.subject | Deep learning | |
dc.subject | Derin öğrenme | |
dc.subject | Convolutional neural networks | |
dc.subject | Evrişimli sinir ağları | |
dc.subject | Motion estimation | |
dc.subject | Hareket kestirimi | |
dc.subject | Autonomous vehicles | |
dc.subject | Otonom araçlar | |
dc.subject | Distance estimation | |
dc.subject | Uzaklık kestirimi | |
dc.subject | Unmanned vehicles | |
dc.subject | İnsansız araçlar | |
dc.title | Crowd localization and counting via deep flow maps | |
dc.title.alternative | Derin öğrenme ile çıkarılan hareket haritaları kullanılarak nesne kalabalıklarının tespiti ve sayımı | |
dc.type | Master Thesis |