LEE- Geomatik Mühendisliği-Doktora

Bu koleksiyon için kalıcı URI

http://hdl.handle.net/11527/19291

Gözat

Deep learning based road segmentation from multi-source and multi-scale data

(Graduate School, 2023-05-12) Öztürk, Ozan ; Şeker, Dursun Zafer ; 501162611 ; Geomatics Engineering

Roads are geographical objects that have been the subject of many application areas, such as city planning, traffic management, disaster management, and military interventions. The success of these applications depends on the speed and accuracy of obtaining road information. Researchers have mostly used satellite and/or aerial photographs as data sources in these studies and focused on the automatic acquisition of road information. Although successful results have been obtained with Artificial intelligence (AI)-based approaches, that are widely used recently, automatic segmentation of roads from remote sensing data is still considered a difficult and important problem due to its complex and irregular structure. AI has been developed to enable computers to realize human abilities such as reasoning, perception, and problem-solving. The most basic expectation is that AI can overcome the problems in which the traditional approaches are insufficient. As a recent trend of AI, deep learning (DL) methods establish a more complex relationship with the data and distinguish the hidden features of the data more accurately. DL is data-driven, and the quality, number, and variety of training data directly impact the performance of the models. For this purpose, comprehensive data sets such as MNIST, COCO, and ImageNet were published. However, the number of datasets containing geographic details is limited compared to others. In addition, datasets containing geographic details can represent only the characteristics of the regions where they were created. Therefore, the models trained with these data sets can only have the capacity to distinguish details at the level that they can only learn from these limited data. It is extremely difficult for these models to effectively predict roads in regions characterized by complex road networks, such as Istanbul. In this thesis, it is aimed to overcome the data gap in road segmentation studies with DL algorithms, to produce datasets representative of the study region, and finally to use data obtained from different sources together to overcome the problems encountered in existing research using only optical images. This thesis is divided into five main parts. The introduction provides a general overview of the subject matter, including comprehensive information on current studies and the motivation of this thesis. In the second part, a fast, accurate, and comprehensive road dataset production infrastructure was created using a web map service to overcome data-related problems. For this purpose, it was found appropriate to utilize service providers where maps can be edited based on user requests. Using the Static API feature of the Google Maps Platform, a data generation program was developed in Python programming language. In this program, the properties of the mask images corresponding to the satellite images were defined with a JavaScript code. An automatic static map style was created for road segmentation. In addition, using this program, the desired number of images can be generated randomly or as a sequence at fixed image sizes and within the boundaries of specified test regions. Furthermore, the Google Maps Platform does not provide geographic information about the images. In order to overcome this deficiency, the geo-referencing of these satellite images and corresponding masks was added to the program. In the third part of the thesis, it is aimed to create an Istanbul road dataset due to the necessity of producing a dataset that represents the characteristics of the region being tested in the road segmentation studies. Istanbul's road network is in a state of development with an ever-increasing population. As it contains different road types and land use details, it is capable of meeting the data diversity required by DL applications. The changing and evolving structure of Istanbul makes it one of the most important regions to be constantly observed and analyzed. In order to examine the contributions of different resolutions of satellite images and different generalization levels of masks in road segmentation studies, the images at zoom levels 14, 15, 16, and 17 from Google Maps were generated in this thesis. Consequently, 10000 optical images and road mask images were produced for each zoom level in the test regions in Istanbul. In order to test the performance of the generated dataset in DL models, the deep residual U-Net architecture was used. When the training metrics of the models' predictions are examined, it was found that the Istanbul dataset achieved successful results in terms of segmenting road pixels at each zoom level separately. In addition, DeepGlobe and Massachusetts datasets, which are widely preferred in road segmentation studies, were included in the analysis to test the prediction performance of the models trained with these datasets generated outside the study region.
Deep learning-based building segmentation using high-resolution aerial images

(Graduate School, 2022-10-05) Sarıtürk, Batuhan ; Şeker, Dursun Zafer ; 501142612 ; Geomatics Engineering

With the advancements in satellite and remote sensing technologies, and the developments in urban areas, building segmentation and extraction from high-resolution aerial images and generating building maps have become important and popular research topics. With technological developments, a large number of high-resolution images have become increasingly more accessible and convenient data sources. At the same time, due to their ability of imaging over large areas, these aerial images can be very useful for accurate building segmentation and generating building maps. As one of the most important and key features of the urban database, buildings are the building blocks for human livelihood. Due to this importance, building maps have significant role for various geoscience-related applications such as illegal building detection, change detection, population estimation, land use/land cover analysis, disaster management, and topographic and cadastral map production. Nonetheless, obtaining accurate and reliable building maps from high-resolution aerial images is still a challenging task due to various reasons such as complex backgrounds, differences in building size, shape, and colors, noisy data, roof type diversity, and many other topological difficulties. Therefore, improving the efficiency and accuracy of building segmentation and extraction from high-resolution aerial images is still a focus and a hot topic among researchers in the field. Over the past years, various methods have been used to achieve automatic building segmentation from aerial images. In earlier studies, traditional image processing methods such as object-based, shadow-based, or edge-based methods were used. The low-level features and metrics that are used with these methods such as color, spectrum, length, texture, edge, shadow, and height could vary under different conditions like atmospheric state, light, scale, and sensor quality. These methods generally take manually extracted features and apply classifiers or conventional machine learning techniques to achieve building segmentation. However, manually extracting these features is costly, time-consuming, labor intensive, and requires high experience and prior knowledge. Although over time these methods have made some progress, they have some serious shortcomings such as low accuracy, low generalization ability, and complex processing. With the technological developments and availability of large datasets, deep learning-based approaches, especially Convolutional Neural Networks (CNN), have gained a lot of attention from researchers and surpass conventional methods in terms of accuracy and efficiency. CNNs have the ability to extract relevant features directly from the input data and make predictions using fully connected layers. Many CNN architectures such as LeNet, AlexNet, VGGNet, GoogleNet, and ResNet have been used over the years. However, CNNs perform regional divisions and use computationally expensive fully connected layers. These patch-based CNNs have achieved exceptional success, but due to their reliance on small patches around the targets to perform predictions and ignoring the relations between them, they are unable to provide accurate integrity and spatial continuity of building features. To improve the performance and overcome these problems, Long et al. proposed Fully Convolutional Networks (FCN). Instead of fully connected layers in CNNs, FCNs have convolution layers that improve the prediction accuracy greatly. FCNs output the feature maps at the size of the input images and perform pixel-based segmentation through their encoder-decoder structure. However, much of the information is lost in the decoder path, due to the FCNs having just one upsampling layer. Despite their success, FCNs also have some limitations, such as computational complexity and a large number of parameters. To overcome these shortcomings, various variants have been proposed over the years such as SegNet, U-Net, and Feature Pyramid Networks (FPN). These CNN-based approaches have achieved successful results on image segmentation tasks, but they also have some bottlenecks. For example, the usage of fixed-size convolutions results a local receptive field. Due to their designs, they are successful at extracting local context but have a low ability to extract global context. To overcome these shortcomings, some approaches have been proposed and implemented. Such as attention mechanism, residual connections, and designing architecture in different depths. The Transformer was first used in natural language processing (NLP), and later on, implemented to computer vision tasks. In 2020, the Vision Transformer (ViT) approach was proposed to be used in computer vision studies and obtained successful results on the ImageNet dataset. CNNs are successful in identifying local features, but they are insufficient in identifying global features due to their structure. Transformers can compensate for these shortcomings with the use of attention mechanisms. In ViT-based methods, global information can be extracted but spatially detailed context is ignored. In addition, Transformers use all the pixels in vector operations when working with large images, and therefore require large amounts of memory and are computationally inefficient. The main objective of this thesis is to investigate, evaluate, and realize comparisons of different CNN-based and Transformer-based approaches for building segmentation from high-resolution aerial images, and propose a modernized CNN approach to deal with the mentioned shortcomings. This thesis is composed of four papers dealing with these objectives. In the first paper, four U-Net-based architectures, which are shallower and deeper versions of the U-Net, have been generated to perform building segmentation from high-resolution aerial images and they were compared with each other and the original U-Net. The models were trained and tested on datasets prepared using the Inria Aerial Image Labeling Dataset and the Massachusetts Buildings Dataset. On the INRIA test set, Deeper 1 U-Net architecture provided the highest F1 score with 0.79 and IoU score with 0.65, followed by Deeper 2 and U-Net architectures. On the Massachusetts test set, U-Net architecture provided 0.79 F1 score and 0.66 IoU score, followed by Deeper 2 and Shallower 1. Successful results were obtained with Deeper 1 and Deeper 2 architectures show that deeper architectures can provide better results even if there is not too much data. Additionally, Shallower 1 architecture appears to have a performance not far behind deep architectures, with less computational cost, and this shows usefulness for geographic applications. In the second paper, U-Net and FPN architectures utilizing four different backbones (ResNet, ResNeXt, SE-ResNeXt, and DenseNet) and an Attention Residual U-Net approach were generated and comparisons between them were realized. Publicly available Inria Aerial Image Labeling Dataset and Massachusetts Buildings Dataset were used to train and test the models. Attention Residual U-Net model has the highest F1 score with 0.8154, IoU score with 0.7102, and test accuracy with 94.51% on the Inria test set. On the Massachusetts test set, FPN Dense-Net-121 model has the highest F1 score with 0.7565 and IoU score with 0.6188, and the Attention Residual U-Net model has the highest test accuracy with 92.43%. It has been observed that FPN with DenseNet backbone can be a better choice when working with small-size datasets. On the other hand, the Attention Residual U-Net approach achieved higher success when a sufficiently large dataset is provided. In the third paper, a total of twelve CNN-based models (U-Net, FPN, and LinkNet architectures utilizing EfficientNet-B5 backbone, original U-Net, SegNet, FCN, and six different Residual U-Net approaches) were generated, evaluated and comparisons between them were realized. Inria Aerial Image Labeling Dataset was used to train models, and three datasets (Inria Aerial Image Labeling Dataset, Massachusetts Buildings Dataset, and Syedra Archaeological Site Dataset) were used to evaluate trained models. On the Inria test set, Residual-2 U-Net has the highest F1 and IoU scores with 0.824 and 0.722, respectively. On the Syedra test set, LinkNet-EfficientNet-B5 has F1 and IoU scores of 0.336 and 0.246. On the Massachusetts test set, Residual-4 U-Net has F1 and IoU scores of 0.394 and 0.259. When the results were evaluated, it has been observed that the models using residual connections are more successful than the models using conventional convolution structures. It has also been observed that the LinkNet architecture gave good results on the Syedra test set with different characteristics than the other two datasets, and could be a good option for future studies involving archaeological sites. In the fourth paper, a total of ten CNN and Transformer models (the proposed Residual-Inception U-Net (RIU-Net), U-Net, Residual U-Net, Attention Residual U-Net, U-Net-based models implementing Inception, Inception-ResNet, Xception, and MobileNet as backbones, Trans U-Net, and Swin U-Net) were generated, and building segmentation from high-resolution satellite images was carried out. Massachusetts Buildings Dataset and Inria Aerial Image Labeling Dataset were used for training and evaluation of the models. On the Inria dataset, RIU-Net achieved the highest IoU score, F1 score, and test accuracy, with 0.6736, 0.7868, and 92.23%, respectively. On the Massachusetts Small dataset, Attention Residual U-Net achieved the highest IoU and F1 scores, with 0.6218 and 0.7606, and Trans U-Net reached the highest test accuracy, with 94.26%. On the Massachusetts Large dataset, Residual U-Net accomplished the highest IoU and F1 scores, with 0.6165 and 0.7565, and Attention Residual U-Net attained the highest test accuracy, with 93.81%. The results showed that the RIU-Net approach was significantly successful in the Inria dataset compared to other models. On the Massachusetts datasets, Residual U-Net, Attention Residual U-Net, and Trans U-Net gave more successful results.

Gözat

Sustainable Development Goal "Goal 4: Quality Education" ile LEE- Geomatik Mühendisliği-Doktora'a göz atma

Sayfa başına sonuç

Sıralama Seçenekleri