Deep learning-based building segmentation using high-resolution aerial images

Sarıtürk, Batuhan
Süreli Yayın başlığı
Süreli Yayın ISSN
Cilt Başlığı
Graduate School
With the advancements in satellite and remote sensing technologies, and the developments in urban areas, building segmentation and extraction from high-resolution aerial images and generating building maps have become important and popular research topics. With technological developments, a large number of high-resolution images have become increasingly more accessible and convenient data sources. At the same time, due to their ability of imaging over large areas, these aerial images can be very useful for accurate building segmentation and generating building maps. As one of the most important and key features of the urban database, buildings are the building blocks for human livelihood. Due to this importance, building maps have significant role for various geoscience-related applications such as illegal building detection, change detection, population estimation, land use/land cover analysis, disaster management, and topographic and cadastral map production. Nonetheless, obtaining accurate and reliable building maps from high-resolution aerial images is still a challenging task due to various reasons such as complex backgrounds, differences in building size, shape, and colors, noisy data, roof type diversity, and many other topological difficulties. Therefore, improving the efficiency and accuracy of building segmentation and extraction from high-resolution aerial images is still a focus and a hot topic among researchers in the field. Over the past years, various methods have been used to achieve automatic building segmentation from aerial images. In earlier studies, traditional image processing methods such as object-based, shadow-based, or edge-based methods were used. The low-level features and metrics that are used with these methods such as color, spectrum, length, texture, edge, shadow, and height could vary under different conditions like atmospheric state, light, scale, and sensor quality. These methods generally take manually extracted features and apply classifiers or conventional machine learning techniques to achieve building segmentation. However, manually extracting these features is costly, time-consuming, labor intensive, and requires high experience and prior knowledge. Although over time these methods have made some progress, they have some serious shortcomings such as low accuracy, low generalization ability, and complex processing. With the technological developments and availability of large datasets, deep learning-based approaches, especially Convolutional Neural Networks (CNN), have gained a lot of attention from researchers and surpass conventional methods in terms of accuracy and efficiency. CNNs have the ability to extract relevant features directly from the input data and make predictions using fully connected layers. Many CNN architectures such as LeNet, AlexNet, VGGNet, GoogleNet, and ResNet have been used over the years. However, CNNs perform regional divisions and use computationally expensive fully connected layers. These patch-based CNNs have achieved exceptional success, but due to their reliance on small patches around the targets to perform predictions and ignoring the relations between them, they are unable to provide accurate integrity and spatial continuity of building features. To improve the performance and overcome these problems, Long et al. proposed Fully Convolutional Networks (FCN). Instead of fully connected layers in CNNs, FCNs have convolution layers that improve the prediction accuracy greatly. FCNs output the feature maps at the size of the input images and perform pixel-based segmentation through their encoder-decoder structure. However, much of the information is lost in the decoder path, due to the FCNs having just one upsampling layer. Despite their success, FCNs also have some limitations, such as computational complexity and a large number of parameters. To overcome these shortcomings, various variants have been proposed over the years such as SegNet, U-Net, and Feature Pyramid Networks (FPN). These CNN-based approaches have achieved successful results on image segmentation tasks, but they also have some bottlenecks. For example, the usage of fixed-size convolutions results a local receptive field. Due to their designs, they are successful at extracting local context but have a low ability to extract global context. To overcome these shortcomings, some approaches have been proposed and implemented. Such as attention mechanism, residual connections, and designing architecture in different depths. The Transformer was first used in natural language processing (NLP), and later on, implemented to computer vision tasks. In 2020, the Vision Transformer (ViT) approach was proposed to be used in computer vision studies and obtained successful results on the ImageNet dataset. CNNs are successful in identifying local features, but they are insufficient in identifying global features due to their structure. Transformers can compensate for these shortcomings with the use of attention mechanisms. In ViT-based methods, global information can be extracted but spatially detailed context is ignored. In addition, Transformers use all the pixels in vector operations when working with large images, and therefore require large amounts of memory and are computationally inefficient. The main objective of this thesis is to investigate, evaluate, and realize comparisons of different CNN-based and Transformer-based approaches for building segmentation from high-resolution aerial images, and propose a modernized CNN approach to deal with the mentioned shortcomings. This thesis is composed of four papers dealing with these objectives. In the first paper, four U-Net-based architectures, which are shallower and deeper versions of the U-Net, have been generated to perform building segmentation from high-resolution aerial images and they were compared with each other and the original U-Net. The models were trained and tested on datasets prepared using the Inria Aerial Image Labeling Dataset and the Massachusetts Buildings Dataset. On the INRIA test set, Deeper 1 U-Net architecture provided the highest F1 score with 0.79 and IoU score with 0.65, followed by Deeper 2 and U-Net architectures. On the Massachusetts test set, U-Net architecture provided 0.79 F1 score and 0.66 IoU score, followed by Deeper 2 and Shallower 1. Successful results were obtained with Deeper 1 and Deeper 2 architectures show that deeper architectures can provide better results even if there is not too much data. Additionally, Shallower 1 architecture appears to have a performance not far behind deep architectures, with less computational cost, and this shows usefulness for geographic applications. In the second paper, U-Net and FPN architectures utilizing four different backbones (ResNet, ResNeXt, SE-ResNeXt, and DenseNet) and an Attention Residual U-Net approach were generated and comparisons between them were realized. Publicly available Inria Aerial Image Labeling Dataset and Massachusetts Buildings Dataset were used to train and test the models. Attention Residual U-Net model has the highest F1 score with 0.8154, IoU score with 0.7102, and test accuracy with 94.51% on the Inria test set. On the Massachusetts test set, FPN Dense-Net-121 model has the highest F1 score with 0.7565 and IoU score with 0.6188, and the Attention Residual U-Net model has the highest test accuracy with 92.43%. It has been observed that FPN with DenseNet backbone can be a better choice when working with small-size datasets. On the other hand, the Attention Residual U-Net approach achieved higher success when a sufficiently large dataset is provided. In the third paper, a total of twelve CNN-based models (U-Net, FPN, and LinkNet architectures utilizing EfficientNet-B5 backbone, original U-Net, SegNet, FCN, and six different Residual U-Net approaches) were generated, evaluated and comparisons between them were realized. Inria Aerial Image Labeling Dataset was used to train models, and three datasets (Inria Aerial Image Labeling Dataset, Massachusetts Buildings Dataset, and Syedra Archaeological Site Dataset) were used to evaluate trained models. On the Inria test set, Residual-2 U-Net has the highest F1 and IoU scores with 0.824 and 0.722, respectively. On the Syedra test set, LinkNet-EfficientNet-B5 has F1 and IoU scores of 0.336 and 0.246. On the Massachusetts test set, Residual-4 U-Net has F1 and IoU scores of 0.394 and 0.259. When the results were evaluated, it has been observed that the models using residual connections are more successful than the models using conventional convolution structures. It has also been observed that the LinkNet architecture gave good results on the Syedra test set with different characteristics than the other two datasets, and could be a good option for future studies involving archaeological sites. In the fourth paper, a total of ten CNN and Transformer models (the proposed Residual-Inception U-Net (RIU-Net), U-Net, Residual U-Net, Attention Residual U-Net, U-Net-based models implementing Inception, Inception-ResNet, Xception, and MobileNet as backbones, Trans U-Net, and Swin U-Net) were generated, and building segmentation from high-resolution satellite images was carried out. Massachusetts Buildings Dataset and Inria Aerial Image Labeling Dataset were used for training and evaluation of the models. On the Inria dataset, RIU-Net achieved the highest IoU score, F1 score, and test accuracy, with 0.6736, 0.7868, and 92.23%, respectively. On the Massachusetts Small dataset, Attention Residual U-Net achieved the highest IoU and F1 scores, with 0.6218 and 0.7606, and Trans U-Net reached the highest test accuracy, with 94.26%. On the Massachusetts Large dataset, Residual U-Net accomplished the highest IoU and F1 scores, with 0.6165 and 0.7565, and Attention Residual U-Net attained the highest test accuracy, with 93.81%. The results showed that the RIU-Net approach was significantly successful in the Inria dataset compared to other models. On the Massachusetts datasets, Residual U-Net, Attention Residual U-Net, and Trans U-Net gave more successful results.
Thesis(Ph.D.) -- Istanbul Technical University, Graduate School, 2022
Anahtar kelimeler
convolutional neural networks, evrişimli sinir ağları, image segmentation, görüntü bölütleme, satellite images, uydu görüntüleri