LEE- Bilgisayar Mühendisliği-Yüksek Lisans

Bu koleksiyon için kalıcı URI

http://hdl.handle.net/11527/19207

Gözat

Şimdi gösteriliyor 1 - 5 / 83

Multimodal medical visual question answering: Knowledge spaces and semantic segmentation for improved and explainable AI

(Graduate School, 2025-06-11) Yazıcı, Ziya Ata ; Ekenel, Hazım Kemal ; 504211538 ; Computer Engineering

This thesis introduces a comprehensive framework aimed at advancing the performance and interpretability of Medical Visual Question Answering (VQA) systems. These systems are designed to automatically respond to clinically relevant questions based on medical images. Despite notable developments in the field, existing models often suffer from two critical limitations: a lack of structured, domain-specific knowledge for Large Language Models (LLMs) and limited explainability in the generated answers. To address these challenges, this research proposes a novel architecture that integrates multimodal knowledge space pretraining with a semantic segmentation-guided explainability module, offering improvements in both answer accuracy and interpretability. The proposed work comprises three core contributions. First, a multimodal knowledge space is constructed using data from the Slake, VQA-RAD, and PathVQA datasets. Biomedical entities and their interrelations are extracted using ScispaCy and the Unified Medical Language System (UMLS). These entities are embedded using PubMedBERT for textual features and BiomedCLIP alongside a custom-built volumetric image encoder (GLIMS) for visual features. A new Balanced Multimodal Contrastive (BaMCo) learning strategy is introduced to pretrain this knowledge space. BaMCo simultaneously optimizes a contrastive learning objective and a classification objective, promoting better alignment of semantically similar multimodal features while mitigating the impact of class imbalance, which is a common issue in medical datasets. Second, an intra-class image encoder, GLIMS, is proposed. It integrates dilated convolutions, Swin Transformer bottlenecks, and multi-scale attention mechanisms. Since volumetric image encoders are widely used on semantic segmentation tasks, GLIMS is tested on the BraTS2021 and BTCV datasets for glioblastoma and multi-organ segmentation, respectively. In both cases, the model outperforms existing architectures, including Swin UNETR, TransBTS, and nnU-Net, particularly in Dice score metrics by 92.14% and 84.50% in terms of datasets, respectively. Third, a knowledge-enhanced Medical VQA system is developed. This system retrieves relevant entities and intra-class image features, by GLIMS, from the pretrained knowledge space and integrates them into the input prompt for a transformer-based language model, such as LLaMA 3.2 or GPT-2 XL. This enriched prompt allows the language model to generate answers that are more contextually informed and clinically relevant, without requiring extensive in-domain pretraining. The use of intra-class images further enhances the model's ability to generalize across similar cases, providing additional visual grounding during inference. Finally, an explainability mechanism is introduced through a semantic segmentation task. The MedSAM model is employed to produce segmentation maps that highlight the regions of the input image most relevant to the generated answer. This auxiliary task, guided by a dedicated segmentation token ([SEG]), enables the model to provide spatial explanations alongside textual responses, offering transparency that is vital for clinical acceptance. The proposed methods are validated on three benchmark Medical VQA datasets. The model achieves an exact match accuracy of 85.8% on the Slake dataset, outperforming or matching several state-of-the-art models that rely on extensive domain-specific pretraining. On VQA-RAD and PathVQA, the model achieves 76.7% and 60.0% accuracy, respectively. These results demonstrate the effectiveness of the knowledge space and its ability to enhance performance across diverse medical imaging domains. Smaller models, such as LLaMA 1B, also perform competitively, emphasizing the value of the knowledge-driven approach even when computational resources are limited. Furthermore, the qualitative assessments indicate that the segmentation maps generated during VQA inference accurately highlight clinically significant regions, enhancing the interpretability of the model's decisions. The quantitative results indicate that when the LLaMA 3B model is trained together with the VQA and segmentation tasks, the segmentation branch achieves a Dice Score of 62%. In terms of the ablation studies, it is shown that the BaMCo loss significantly improves both alignment and answer quality. Compared to baseline methods, the proposed system achieves improvements of 3.11% in BLEU-1, 3.65% in ROUGE-1, 0.16% in BERTScore, and 2.16% in overall VQA accuracy. These findings validate the architectural design and the complementary roles of knowledge space pretraining and segmentation-based explanation. In conclusion, the thesis demonstrates that the integration of structured multimodal knowledge and explainability mechanisms leads to substantial improvements in both the accuracy and transparency of Medical VQA systems. The proposed architecture enables general-purpose language models to perform specialized medical reasoning tasks without heavy reliance on domain-specific pretraining, and the addition of segmentation maps provides visual justifications essential for clinical use. Future studies can work on developing better models with patient clinical reports, laboratory results, and temporally related data. Finally, user tests to be conducted in collaboration with specialist physicians to evaluate the system by clinical users will reveal the practical impact of the model in more detail.
Vision-based detection and recognition of maritime objects for autonomous surface navigation

(İTÜ Lisansüstü Eğitim Enstitüsü, 2015-06-13) Sayan Sevda, Yonca ; Ekenel, Kazım Kemal ; 504211542 ; Computer Engineering

İnsansız Deniz Araçları (İDA), deniz ortamlarında doğrudan insan müdahalesine gerek kalmadan otonom yada bir yer kontrol istasyonu aracılığıyla uzaktan kumandalı olarak görev yapabilen robotik sistemlerdir. Çevresel izleme, askeri gözetim, keşif ve arama-kurtarma gibi alanlardaki kullanımları, denizcilik sektöründe önemli bir dönüşüme yol açmıştır. Bu araçların söz konusu operasyonlardaki kritik rolü, yeteneklerini daha da geliştirecek ileri teknolojilere olan talebi artırmaktadır. İDA'ların sahip olduğu otonom rota planlama ve hareket kabiliyetlerine rağmen, karşılaştıkları en büyük zorluklardan biri, su yüzeyinde engel oluşturabilecek nesnelerin doğru ve tutarlı bir şekilde algılanması ve sınıflandırılmasıdır. Bu gereklilik, görev başarısı, seyir güvenliği ve deniz trafiği yönetimi açısından yüksek bir öneme sahiptir. Ancak deniz ortamı, nesne tespiti ve sınıflandırma süreçlerinde yüksek doğruluk elde edilmesini zorlaştıran çok sayıda çevresel değişkene sahiptir. Değişken hava koşulları, dalga hareketleri, güneş yansımaları ve değişken ışık seviyeleri gibi faktörler, görüntü kalitesini doğrudan etkileyerek algılama sistemlerinin güvenilirliğini düşürebilmektedir. Ayrıca, gemilerin boyut, şekil, malzeme yapısı ve hareket modelleri açısından çeşitlilik göstermesi, sınıflandırma görevini daha da karmaşık hale getirmektedir. Radar ve LIDAR gibi sensörler çeşitli otonom sistemlerde yaygın olarak kullanılmasına rağmen, deniz ortamlarında çoğunlukla etkin performans gösterememektedir. Radar sistemleri, özellikle küçük ve düşük profilli nesneleri tespit etmekte zorlanmakta ve dalga kaynaklı yansımalar nedeniyle yanlış alarm üretebilmektedir. LIDAR ise, su yüzeyinin düşük yansıtıcılığı ve değişken doğası nedeniyle güvenilir veri sağlayamamakta, ayrıca açık deniz gibi menzil açısından zorlu ortamlarda performans kaybı yaşamaktadır. Bu nedenlerle, kamera tabanlı algılama sistemleri deniz ortamları için daha avantajlı bir alternatif sunmaktadır. Kameralar, yüksek çözünürlükte detaylı görsel bilgi sağlayarak küçük nesnelerin, insan figürlerinin ve diğer potansiyel tehlikelerin daha hassas bir şekilde tespit edilmesini mümkün kılmaktadır. Ayrıca, görüntü verileri üzerine makine öğrenmesi ve derin öğrenme yöntemlerinin uygulanabilmesi, nesne tespiti ve sınıflandırma süreçlerinde esnekliği ve doğruluğu artırmaktadır. Kamera tabanlı derin öğrenme çözümleri, deniz ortamında karşılaşılan bu zorluklara karşı etkili sonuçlar sunmaktadır. Özellikle YOLO (You Only Look Once) gibi hızlı ve doğruluk odaklı nesne tespiti algoritmaları, gerçek zamanlı uygulamalar için ön plana çıkmaktadır. YOLO'nun optimize edilmiş bir versiyonu olan YOLOv5, hafif yapısı, hız ve doğruluk dengesi ile dinamik deniz ortamlarında küçük nesne tespitine uygun bir modeldir. Ayrıca, TPH-YOLOv5 modeli, küçük nesne tespit performansını daha da artırmak amacıyla Transformer Prediction Head (TPH) ve Convolutional Block Attention Module (CBAM) mekanizmalarını entegre ederek, dikkat mekanizmaları aracılığıyla öznitelik temsillerini zenginleştirmiştir. Bu sayede, düşük çözünürlüklü ve karmaşık deniz ortamlarında küçük nesne tespiti daha güvenilir hale getirilmiştir. Derin öğrenme alanındaki bir diğer önemli gelişme ise Görüntü Dönüştürücüler (ViT) mimarisidir. Geleneksel Evrişimsel Sinir Ağlarından (CNN) farklı olarak ViT'ler, görüntüleri küçük parçalara (patch) bölerek bu parçaları doğal dil işleme alanındaki kelimelere benzer token'lar gibi işler. Bu tokenlar arasındaki ilişkiler, self-attention mekanizmasıyla modellenir. Böylece, görüntünün farklı bölgeleri arasındaki uzun menzilli bağıntılar öğrenilebilir. Veri verimliliğini artırmak için geliştirilmiş olan Data-Efficient Image Transformer (DeiT), hesaplama verimliliği odaklı Swin Transformer ve CNN ile Transformer yaklaşımlarını harmanlayan ConvNeXt gibi mimariler, görüntü dönüştürücü ailesini daha geniş bir kullanım alanına yaymıştır. Bu tezde, İDA'ların deniz ortamında nesne tespiti ve gemi sınıflandırması problemlerine çözüm getirmek amacıyla iki aşamalı bir yaklaşım geliştirilmiştir. Çalışmada, hem nesne tespiti hem de sınıflandırma görevleri için kapsamlı veri setleri olan MODS ve MARVEL kullanılmıştır. İlk aşamada, MODS veri seti kullanılarak YOLOv5 tabanlı nesne tespiti modelleri eğitilmiş ve değerlendirilmiştir. MODS, stereo kamera görüntüleri üzerinden hem dinamik (gemi, insan) hem de statik (şamandıra gibi) engelleri içeren zengin anotasyonlara sahiptir. Veri setindeki etiketler YOLO formatına dönüştürülerek modelin eğitimi için uygun hale getirilmiştir. Nesne tespiti için TPH-YOLOv5 modeli seçilmiş ve özellikle küçük nesne algılama performansını artıracak şekilde yapılandırılmıştır. TPH (Transformer Prediction Head) yapısı, farklı ölçeklerdeki nesneleri daha iyi öğrenebilmek için kestirim katmanlarına dönüştürücü modülleri entegre etmiştir. Buna ek olarak, CBAM (Convolutional Block Attention Module) mekanizması kullanılarak, ağın dikkatini görsel olarak önemli bölgelere yönlendirmesi sağlanmıştır. Geliştirilen TPH-YOLOv5 modeli, yapılan değerlendirmelerde yüksek doğruluk (\%89.4), yüksek geri çağırma (\%83) ve ortalama doğruluk (\%86.3) değerleri elde ederek küçük ve zor tespit edilen nesnelerde önemli başarı sağlamıştır. İkinci aşamada, MARVEL veri seti kullanılarak gemi sınıflandırması gerçekleştirilmiştir. MARVEL, beş ana kategoriye ayrılmış toplam 26 farklı gemi alt tipini içermektedir. Veri setindeki dengesizlikler, çeşitli veri artırma teknikleriyle giderilmiştir. Sınıflandırma aşamasında ViT, DeiT, Swin Transformer ve ConvNeXt-v2 gibi farklı görüntü dönüştürücü tabanlı modeller eğitilmiş ve test edilmiştir. Bu modellerin doğrulukları CNN tabanlı ResNet50 ve ResNet101 gibi modellerle karşılaştırılmıştır. Sonuçlar, DeiT modelinin \%92.87 doğruluk oranı ile en iyi performansı sergilediğini göstermiştir. Ancak, FPS (Frames Per Second) değerleri dikkate alındığında, ViT tabanlı modellerin çıkarım hızlarının CNN mimarilerine göre daha düşük olduğu gözlemlenmiştir (örneğin, ResNet50: 52.66 FPS, DeiT: 1.31 FPS). Bu durum, ViT tabanlı modellerin, gerçek zamanlı uygulamalarda kullanılabilmesi için, optimizasyon gerektirdiğini ortaya koymuştur. Denizcilik uygulamalarında, özellikle İDA'ların çevresel farkındalık ve engelden kaçınma görevlerinde, kararların milisaniyeler içinde verilmesi hayati öneme sahiptir. Bu nedenle, kullanılan algılama ve sınıflandırma modellerinin sadece yüksek doğruluk sunması yeterli değildir; aynı zamanda düşük gecikmeli, yani yüksek çıkarım hızına (FPS) sahip olması da kritik bir gerekliliktir. Dönüştürücü tabanlı modellerin yüksek doğruluk sağlamalarına rağmen çıkarım sürelerinin (inference time) gerçek zamanlı uygulamalar için yetersiz olduğu belirlenmiştir. Bu sorunu aşmak amacıyla hem yapısal hem de yapısal olmayan budama (structured ve unstructured pruning) yöntemleri uygulanmıştır. Özellikle DeiT modeli üzerinde yapılan budama çalışmaları sonucunda, model parametrelerinde anlamlı bir azalma sağlanmış ve çıkarım hızında yaklaşık \%11 oranında iyileşme elde edilmiştir. Bu süreçte model doğruluğunda sadece \%0.39'luk küçük bir kayıp yaşanmıştır, bu da performans-hız dengesinin başarılı bir şekilde optimize edildiğini göstermektedir. Bununla birlikte, dönüştürücü tabanlı modellerin çıkarım hızlarının hâlen CNN tabanlı modellere göre daha düşük olduğu ve gerçek zamanlı İDA operasyonları için daha ileri düzey optimizasyonların gerekli olduğu sonucuna varılmıştır. Tez çalışmasında yalnızca performans artırımı değil, aynı zamanda model kararlarının açıklanabilirliğine yönelik çalışmalar da yapılmıştır. Görüntü dönüştürücü tabanlı modellerin karar verme süreçlerini şeffaf hale getirmek amacıyla genel bir dikkat akışı (attention rollout) görselleştirme yöntemi geliştirilmiştir. Bu yöntem, modelin sınıflandırma yaparken hangi görüntü bölgelerine odaklandığını göstererek doğru ve yanlış sınıflandırmaların sebeplerinin daha iyi anlaşılmasını sağlamıştır. Doğru tahminlerde modelin hedef nesne üzerinde odaklandığı, yanlış tahminlerde ise genellikle su yüzeyi ya da arka plan gibi ilgisiz alanlara yöneldiği tespit edilmiştir. Sonuç olarak, bu tez, İDA'lar için deniz ortamında hem güvenli hem de etkin bir şekilde çalışabilecek yüksek doğruluklu ve yorumlanabilir yapay zeka tabanlı sistemlerin geliştirilmesine katkı sağlamaktadır. Elde edilen bulgular, gelecekte daha hızlı ve daha optimize edilmiş dönüştürücü tabanlı sistemlerin geliştirilmesi için önemli bir temel oluşturmaktadır. Her ne kadar geliştirilen modeller eğitim ve test aşamalarında yüksek doğruluk değerleri elde etmiş olsa da, gerçek deniz ortamlarında çalışacak sistemler için modelin genelleme yeteneği kritik bir faktördür. Eğitimde kullanılan MODS ve MARVEL veri setleri farklı deniz koşullarını kapsasa da, tüm olası çevresel değişkenleri temsil etmek mümkün değildir. Özellikle aşırı dalgalı denizler, yoğun güneş yansımaları veya düşük görüş koşulları gibi ekstrem senaryolar, model performansında düşüşe neden olabilir. Gerçek dünya koşullarında tam güvenilirlik için saha testlerinin yapılması ve modellerin farklı deniz ortamlarında yeniden değerlendirilmesinin gerekliliği açıktır.
Secure communication for MUM-T: a blockchain and lightweight cryptography framework

(ITU Graduate School, 2025-07-27) Yaşar, Halimcan ; Bahtiyar, Şerif ; 504231520 ; Computer Engineering

MUM-T systems are becoming increasingly central to modern defense and surveillance operations, enabling the coordinated deployment of MAVs and UAVs in complex, mission critical scenarios. While these systems offer significant tactical advantages, they also introduce unique challenges in maintaining secure, reliable, and real-time communication. In particular, MUM-T environments demand a delicate balance between high data integrity, low latency responsiveness, and computational efficiency requirements that conventional cryptographic and network security architectures often struggle to fulfill, especially within the resource-constrained platforms typical of UAVs. In response to these challenges, this thesis proposes a novel hybrid communication security framework designed specifically for MUM-T systems. The proposed architecture integrates two complementary technologies: a PoA-based blockchain to ensure tamper-resistant logging of mission-critical data, and a XOR-based lightweight cryptographic authentication scheme to facilitate high frequency, low-latency control messaging. By decoupling data assurance and real-time control into two dedicated communication layers, the framework seeks to maximize both operational integrity and responsiveness without overburdening the limited computational resources of UAVs. The research is anchored in a clear hypothesis: that the combined use of a PoA blockchain and lightweight symmetric encryption can provide a secure and efficient communication backbone suitable for MUM-T applications. To evaluate this hypothesis, the thesis employs both analytical modeling and simulation-based validation across two environments. The first, a custom Python-based simulation, models transaction generation, XOR-based encryption, and PoA block validation with fine-grained timing control. This simulation tracks end-to-end latency, throughput, and computational load under realistic scheduling parameters, revealing that the PoA component maintains transaction latencies well below 750 milliseconds, while the XOR mechanism introduces only microsecond-level processing delay making both viable for real-world deployment. The second simulation, developed in OMNeT++, models a network level implementation of the proposed framework, capturing the behavior of MAV and UAV nodes as modular entities within a time accurate, message driven environment. This simulation validates system-level coordination, concurrent operation of both communication layers, and the effectiveness of the hierarchical topology under realistic operating constraints. Collected scalar and vector metrics confirm that mission critical blocks are consistently generated and broadcast by the MAV, while UAVs engage in continuous transaction and control message generation. Time series data further demonstrate a stable and scalable increase in system activity, reinforcing the framework's responsiveness and throughput capacity. Together, the simulation results confirm that the proposed hybrid framework meets its design objectives. It provides a scalable, efficient, and secure communication model that is well suited to the hierarchical structure and operational demands of MUM-T systems. By leveraging the strengths of both blockchain and lightweight encryption, the framework supports the integrity and auditability of mission data while ensuring that time sensitive control messages are delivered with minimal delay. This thesis contributes to the growing field of secure autonomous system communication by offering a practical architecture that bridges the gap between real time responsiveness and high assurance data integrity. Future work may involve exploring more advanced lightweight encryption schemes, incorporating adaptive key management, or extending the consensus model to support partially decentralized architectures. Additional simulation scenarios involving adversarial threats or mission disruption can also enhance the robustness and applicability of the system in the real world.
Enriching item feature representations in session-based recommendation with global item graphs

(ITU Graduate School, 2025-07-01) Karatepe, Yunus ; Gündüz Öğüdücü, Şule ; 504221536 ; Computer Engineering

With the advancements in artificial intelligence (AI), many industries that rely on human interaction have begun to utilize these powerful AI models. E-commerce stands out as one of the industries where modeling customer behavior is crucial. In e-commerce, it is very important to recommend items that will attract customers' attention and result in their purchase. A user's shopping experience is significantly influenced by how efficiently they can discover items that align with their preferences within a given time period. There may be different types of shopping behaviors that can create the need for different types of recommendation models. Hence, there are several types of recommender systems: Collaborative Filtering, Content-Based Filtering, and Hybrid Systems that utilize both approaches. Collaborative filtering methods predict users' interests by analyzing preferences from multiple users. It operates under the assumption that users who have liked similar things will continue to like similar things in the future. Content-based filtering methods, however, analyze items and item features, then match them to the user's profile and make recommendations. It builds a profile for each user based on the features of items they have interacted with and suggests new items with similar characteristics. In addition, various types of recommender systems have been developed to accommodate different data characteristics and application requirements. Session-Based Recommender Systems (SBRS) represent one of the most popular subtypes of recommender systems, focusing specifically on a short period of user interactions to generate recommendations. Unlike traditional methods such as collaborative and content-based filtering, which primarily rely on long-term historical user and item data, SBRS are designed to capture short-term, dynamic user preferences that emerge during a specific session. This approach proves especially beneficial in scenarios where historical user data is lacking or where user preferences are rapidly shifting. By focusing solely on the sequence and context of user actions in a session, SBRS can provide more relevant recommendations, enhancing the overall user experience. SBRS are particularly prevalent in e-commerce platforms, where users often interact with items over short periods. In today's fast-paced digital environment, user interests can change rapidly, and the relevance of historical data may diminish quickly. For instance, consider a user who logs into an e-commerce application and purchases a mobile phone after viewing several related items. Post-purchase, the user's interest in mobile phones may wane, rendering past interaction data less effective for future recommendations. Traditional recommendation methods, which rely heavily on long-term user profiles, might not adapt swiftly to such shifts in user intent. In contrast, SBRS focus on the user's current session, capturing immediate preferences and providing timely, context-aware recommendations that align with the user's interests. Although SBRS appear to be the best system for modeling short-term user interests, they have their limitations, including the cold start problem and limited item diversity. The cold-start problem arises when a model lacks adequate historical data, limiting its ability to learn effectively and generate accurate recommendations. Additionally, limited item diversity often emerges as models tend to favor popular items, resulting in repetitive recommendations and a lack of variety. To deal with these problems, graph-based recommender systems or SBRS that leverage graph structures have been introduced. In this study, we introduce a novel approach to enhance SBRS by enriching item representations through graph-based aggregation techniques and storing this information in the item's feature vector. The core idea is to capture and embed item-item relationships directly into the feature vectors of individual items, thereby incorporating contextual information derived from their interactions within the item network. By constructing an item graph where nodes represent items and edges represent relationships, we apply graph aggregation methods to enrich items' features for each item. This process enables each item to incorporate information coming from its neighborhood. Such enriched representations are particularly beneficial for items with limited interaction data, commonly referred to as "cold items.". Integrating these enriched item feature vectors into the SBRS aims to improve recommendation accuracy for tested models: Session-based Recommendation with Graph Neural Networks (SR-GNN) and Long-Short Term Memory (LSTM), especially in sessions involving cold items. By providing a more comprehensive understanding of each item's context within the broader item network, the system can generate more relevant and personalized recommendations, enhancing the overall user experience. Experiments are conducted on the Dressipi 2022 Dataset, which is mainly used in the past for the RecSys2022 Challenge. Experimental evaluations show that the proposed method is improving overall recommendation accuracy for both SR-GNN and LSTM models. Beyond overall recommendation accuracy, the proposed feature enrichment technique significantly improves the performance of SR-GNN on cold sessions (sessions that contain cold-unpopular items). Looking ahead, there are several promising directions to further explore the impact of our feature enrichment approach. One area involves applying this method across various recommendation models, such as collaborative filtering, content-based filtering, and hybrid systems. Each model type may uniquely benefit from enriched item representations, potentially leading to improved recommendation accuracy. Additionally, evaluating the effectiveness of our approach on diverse datasets would help measure its generalizability and robustness. Another avenue worth investigating is the integration strategy of the enriched features. Rather than embedding them directly into the feature vector of the item, there may be other, perhaps better, ways to store them. In conclusion, SBRS are crucial for e-commerce, offering the best possible methods to represent and model customer behavior in various ways. Several problems, such as cold start and limited item diversity, still exist, but there are different methods to tackle those challenges. Utilizing graphs is one of the most common approaches that addresses the cold start problem. One promising solution involves leveraging graph-based methods to enrich item representations. By incorporating contextual information from item-item relationships, these approaches can enhance the representation of less popular or new items, thereby improving recommendation quality in sessions involving such items. Continued advancements in both SBRS and graph-based techniques are expected to further enhance recommendation systems. This progress will contribute to more personalized and satisfying shopping experiences for users, ultimately benefiting both consumers and e-commerce platforms.
Effect of expanding bounding box annotations on small object detection performance in aerial imagery

(ITU Graduate School, 2025-06-13) Uğur, Mustafa ; Ekenel, Hazım Kemal ; 504211540 ; Computer Engineering

The use of Unmanned Aerial Vehicles (UAVs), commonly referred to as drones, has increased significantly in recent years across diverse military and civilian applications, from reconnaissance and border patrol to precision agriculture and disaster relief. A key enabling technology for automating these tasks is Artificial Intelligence (AI), particularly for object detection from aerial images. However, this task is complicated by unique challenges such as object occlusion, extreme variations in viewing angles and object scale, and a high prevalence of small objects. Detecting these small objects, often defined as those smaller than 32 x 32 pixels, is especially difficult due to their limited visual information. While most research has focused on developing novel model architectures or specialized loss functions, the impact of the bounding box strategies has been largely underexplored. This thesis addresses this gap by proposing and systematically evaluating a simple yet powerful method: improving small object detection by expanding bounding box annotations to incorporate critical contextual information. Standard object detection training employs tightly-fitted bounding boxes that, while precise, often exclude valuable information about an object's complete shape and its immediate surroundings. This exclusion is particularly detrimental for small objects, where every pixel is crucial. The method investigated in this thesis tries to overcome this by expanding the ground truth bounding boxes for small objects during the training phase. This approach enriches the training data by forcing the model to learn from a broader context, capturing more of the object's features and its relationship with the background, thereby creating more robust feature representations. To validate the effectiveness of this approach, we conducted comprehensive experiments on two challenging, large-scale aerial imagery datasets: VisDrone and DIOR. Both are characterized by a high proportion of small objects (60% in VisDrone, 68% in our DIOR subset), making them ideal testbeds. Our experiments first focused on optimizing the expansion strategy, revealing that a fixed-pixel increment outperformed proportional methods. We identified optimal expansion values tailored to each dataset's characteristics: a 20-pixel expansion for the high-resolution images in VisDrone and a 5-pixel expansion for the standard-resolution images in DIOR. The results were significant, demonstrating an absolute increase in mean Average Precision (mAP) of up to 10.5% on VisDrone. This improvement was even more pronounced for classes dominated by small instances, with the Average Precision (AP) for the pedestrian class surging by 21%. Qualitative analysis confirmed that these gains stem from two factors: a substantial increase in newly detected True Positives and a reduction in False Positives due to more effective Non-Maximum Suppression (NMS) in dense scenes. To confirm the generalizability and robustness of the bounding box expansion strategy, we tested it across a diverse set of state-of-the-art architectures, including various YOLO models (YOLOv8m, YOLOv10, YOLOv11), the Transformer-based RT-DETR, and the specialized small-object detector TPH-YOLOv5. Performance gains were consistently observed across all detectors, proving the model-agnostic nature of our approach. Crucially, because the expansion is a data preprocessing step, it introduces zero computational overhead during inference. This makes the method highly practical for real-world deployment on resource-constrained UAVs, where both accuracy and speed are paramount. In summary, this thesis presents a practical, computationally efficient, and effective solution to a persistent challenge in aerial object detection. By demonstrating that a simple, expansion of bounding box annotations can significantly enhance detection accuracy, this work provides a valuable contribution that can be readily integrated into existing training pipelines to improve the performance of aerial surveillance and monitoring systems. The experimental results robustly validate this strategy across multiple datasets and state-of-the-art models, establishing it as a valuable tool for the computer vision community.

Gözat

Son Başvurular