Efficient deep learning approaches for signal and image analysis applications

thumbnail.default.alt
Tarih
2024-10-10
Yazarlar
Koyun, Onur Can
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Yayınevi
Lisansüstü Eğitim Enstitüsü
Özet
This thesis explores the development of parameter-efficient,computationally efficient and data efficient deep learning models, addressing challenges in resource-constrained environments such as mobile devices, edge computing, and tasks requiring significant computational resources. The research presents novel architectures and methods for enhancing the efficiency of deep learning models in three primary domains: image classification, segmentation, small object detection, video object detection, and Raman spectroscopy analysis, while maintaining high performance. Chapter 1 introduces the growing demand for efficient deep learning models, particularly in the context of real-world applications where computational resources are limited. The chapter provides a detailed literature review on deep learning efficiency, focusing on model compression techniques, efficient architectures, and approaches to reduce the computational complexity of deep neural networks. It also discusses the challenges of applying deep learning models to tasks such as video object detection and spectral analysis. Chapter 2 presents HaLViT, a novel architecture aimed at reducing the parameter count of Vision Transformers (ViTs). By leveraging the row and column spaces of weight matrices, HaLViT achieves up to a 50% reduction in parameters without significant loss of accuracy. This method is validated on the ImageNet-1K dataset and further evaluated through transfer learning experiments on object detection tasks using the COCO dataset. HaLViT demonstrates competitive performance with conventional transformer-based models while offering substantial computational savings. The chapter also explores the use of low-rank residual weights to enhance the parameter efficiency of deep neural networks, showing that this technique allows for significant parameter reduction without sacrificing model expressiveness. Chapter 3 addresses the challenge of small object detection in aerial images, which is crucial for applications such as surveillance, precision agriculture, and urban management. Small object detection is challenging due to factors like scale variation, occlusion, dense object distributions, and class imbalance. This chapter introduces Focus&Detect, a two-stage framework that uses a Gaussian Mixture Model (GMM) to identify "focal regions" where small objects are densely clustered. By focusing the detection network's resources on these regions, the computational load is reduced, and detection accuracy is improved. The Incomplete Box Suppression (IBS) technique is introduced to handle the problem of overlapping bounding boxes in focal regions, further enhancing the model's performance. The proposed framework outperforms existing methods on the VisDrone and UAVDT datasets, achieving superior accuracy in detecting small objects. Chapter 4 proposes RamanFormer, a transformer-based model specifically designed for the analysis of Raman spectroscopy data. Raman spectroscopy is a widely used technique for material identification and mixture analysis, but accurate quantification of components in mixtures is often challenging, especially in noisy or low-concentration scenarios. RamanFormer combines self-attention mechanisms, convolutional layers, and global average pooling to process complex spectral data, enabling precise prediction of component ratios in chemical mixtures. The model is evaluated on a dataset of binary and ternary chemical mixtures and shows significant improvements over traditional methods such as Least Squares and deep learning-based models like ResNet50 and MLP. RamanFormer demonstrates robustness in noisy environments and scenarios involving low-concentration components, highlighting its applicability for real-world spectroscopic analysis in fields such as material science, forensics, and food safety. Chapter 5 focuses on improving the computational efficiency of video object detection by leveraging the H.265 video compression codec's Coding Tree Units (CTUs). Video object detection presents additional challenges compared to image-based detection, such as motion blur, occlusion, and changes in camera focus. The chapter introduces SieveNet, a dynamic deep learning model that processes video frames using the CTU structure of the H.265 codec. The model applies more layers to high-frequency content blocks while processing low-frequency blocks with fewer layers, thereby optimizing the computational load. Experimental results on the ImageNet VID dataset show that SieveNet achieves a mean Average Precision (mAP) score of 36.9, which is comparable to the mAP score of ResNet-101 (38.2), but with almost half the computational cost in terms of Floating Point Operations Per Second (FLOPS). The chapter demonstrates the effectiveness of dynamic layer processing based on the CTU structure in achieving a balance between detection accuracy and computational efficiency. Chapter 6 concludes the thesis by summarizing the key contributions and findings of the research. The thesis presents several significant advancements in the field of deep learning, including the development of parameter-efficient architectures like HaLViT and computationally efficient models like SieveNet, which provide scalable and practical solutions for resource-constrained environments. The findings suggest that efficient deep learning models can be successfully applied to tasks such as small object detection, video object detection, and Raman spectroscopy analysis, without compromising performance. Additionally, the research highlights the importance of leveraging compressed domain features, such as those from the H.265 codec, to optimize computational efficiency. Future research directions include further optimizing these models for a wider range of applications, integrating additional compressed domain features, and exploring the application of transformer-based models to other types of spectroscopic data and video understanding tasks. The thesis sets the stage for making deep learning more scalable and accessible, reducing computational overheads while maintaining competitive accuracy and robustness in real-world applications.
Açıklama
Thesis (Ph.D.) -- Istanbul Technical University, Graduate School, 2024
Anahtar kelimeler
Non-linear optimization, Doğrusal olmayan optimizasyon =, Convolutional neural networks, Evrişimli sinir ağları, Image processing, Görüntü işleme, Artificial neural networks, Yapay sinir ağları
Alıntı