Building Detection and Counting in Convoluted Areas Using Multiclass Datasets with Unmanned Aerial Vehicles (UAVs) Imagery ()
1. Introduction
Remote sensing images have a wide range of applications, including monitoring and counting wild-life [1] , detecting and classifying vegetation from grasslands and heavily forested areas [2] , mining land cover classification [3] , flood extent mapping [4] , and multi-feature building extraction in urban area [5] . Most of the applications are based on high-resolution images which can be used for the pixel-level classification known as semantic classification [6] [7] [8] . The higher resolution Unmanned Aerial Vehicle (UAV) images can create better semantically segmented images [9] which help to understand the contents in the given images. Despite the higher resolution, it lacks spectral information of ground objects which is inherently available in satellite imagery, and thus possesses difficulty in segmentation [10] .
The deep learning methodologies have been studied extensively in remote sensing [11] . With the rise of Deep Convolution Neural Networks (DCNN), they have been widely used in object detection and segmentation [10] . Architecture like UNet, based on DCNN, can be used to extract spatial features of required regions. The building region detection can also be treated as a feature detection problem where we can extract the features of the buildings using a Convolution Neural Network (CNN) in encoder-decoder network architecture [10] .
In order to tackle this problem, many deep learning architectures [12] [13] extract the building’s mask from the given satellite imagery. The building extraction can also be treated as an instance segmentation [14] problem where the building localization follows the segmentation. However, this method is computationally intensive because it requires running each instance of the building detection process every time. Furthermore, we can treat this problem as a semantic segmentation problem [15] , but this gives us a very vague representation of building count. Especially in convoluted buildings, the identification of individual buildings is cumbersome. On top of that, the detected buildings lack a clear boundary between the buildings. While the existing methodologies [16] [17] for building segmentation work well in developed countries, where buildings are typically well-defined and distinct, they may not be suitable for developing countries where buildings are often disorganized and complex, making it difficult even for humans to differentiate. This possesses challenges when automating building segmentation using semantic segmentation for humanitarian aid, population estimation, or urban density calculations.
Existing methodologies have treated building detection as a binary classification problem between building inner segments and the background [18] . Since buildings have a clear boundary between the inner and the outer regions, we can treat it as a multi-class classification problem where the boundary of a building, inner regions, and background can be treated as different classes on their own. This converts the existing binary segmentation problem into multi-class segmentation. It can be applied to all existing datasets with minor modifications. The use of edge detection can help to separate a merged building into separate buildings.
Treating the building segmentation as multi-class segmentation, we increased the building count while preserving the inner building segmentation regions. The erosion process in OpenCV [19] is applied to the building mask, resulting in the creation of inner regions. The new class for the multiclass dataset is defined by taking the difference between the original building mask and these inner regions. Further, we can use the existing model architecture, which does not significantly increase computation complexities. Moreover, the features regarding the building exterior are also already learned during the binary segmentation so, it does not pose any problem with the network convergence instead it converges faster to similar precision and recall metrics.
In summary, we experimented with multiple architectures with joint contour and structure learning. Our experiment showed a clear efficiency in building detection on often complex and convoluted areas, increasing the precision and recall on datasets such as Opencities [20] and Inria building dataset [21] along with our own dataset. The rest of the paper is organized as follows: Section 2 summarizes the previous work on building detection. Section 3 describes the proposed method, while Section 4 provides implementation details. In Section 5, experimental results are presented, followed by Section 6, 7, and 8, which include discussions, challenges, and conclusions, respectively.
2. Literature Review
Building segmentation from satellite imagery has been rigorously studied over the decade [9] [10] [18] . The abundance of data has led to the use of numerous machine-learning methods for building detection [12] [13] . The majority of the building detection techniques can be classified into either classical or deep learning approaches. While visually identifying various simple and complex building patterns may be easy, classifying buildings in remote sensing based on their diverse patterns and styles proves to be challenging for classical machine learning algorithms. Traditional remote sensing image processing methods like Support Vector Machine (SVM) do not perform well on UAV images, as they require a training sample from the multiple variations of building datasets [22] . Also, the building is extracted from high-resolution images using Normalized Difference Vegetation Index (NDVI) indices [23] or even further custom indices like Morphological Building indices (MBI) have been developed [24] . Such methods are prone to errors due to variations in the building’s characters and properties. Buildings extractions from the top-view have also been evaluated to be prone to building complexities [25] . The building segmentation from high-resolution images is also done based on the binary mathematical morphology (MM) operator [26] . Furthermore, direct building extraction by ensembling models trained on multi-spectral images, OpenStreetMap (OSM) dataset, and the RGB images have also been experimented with [27] . The building can be extracted by combining results from two models trained on high-resolution satellite imagery and Digital Surface Model (DSM) data from the LIDAR dataset [28] .
The use of convolution neural networks has created state-of-the-art results on object identification, classification, and segmentation [29] . Deep learning architecture like Mask RCNN [30] is used for object detection and instance segmentation which is based on region proposal while architecture like UNet [31] is used in semantic segmentation originally used in the small datasets of biomedical images. On the other hand, UNet can extract feature masks from an input image that is equal to the original image in spatial resolutions. Further, the features extracted from the initial feature layers are also appended to the later layer preserving the information of predictions. Despite having fewer features, the UNet architecture consistently achieves state-of-the-art results in medical imagery segmentation [32] . For segmentation, medical images and satellite images, both have an issue of data deficiency [33] [34] . UNet has also been widely used in satellite imagery by many precursors. UNet architecture has been found to be beneficial in many competitions like Spacenet building detection where it achieved the leading score in the competition [35] .
To enhance building detection and segmentation, the authors employed multiple UNet architectures and incorporated an attention layer to generate the final mask [36] . Here, the instance segmentation of the building is done using a multiple UNet where each model learns building contours, building regions, and background which is mixed later on. MAP-Net has been used for building segmentation [37] where the author tried to learn both features like building edges and the inner building regions from the same image.
3. Proposed Method
3.1. Model Architecture
We divided the ground truth into single-class and multi-class building datasets, effectively increasing the number of classes. In order to create a multi-class dataset from the single-class dataset, we applied an erosion operation to the building mask using a 15 × 15 pixel kernel as specified by Equation (1). This process generated eroded regions, which were assigned as a new building class, while the disparity between the inner building and the original mask was designated as another new class. Subsequently, we divided the region into three distinct classes: background, building edges, and inner building regions. This allows the segmentation architecture to learn edge information separately, thereby increasing the count of identified buildings. The following equation shows the erosion operation [19]
(1)
To validate our methods, we experimented with UNet [31] , UNet++ [38] , DeepLabV3 [39] , and DeepLabv3+ [40] architectures using efficientnet-b0 [41] as the encoder trained on imagenet datasets [42] .
We used the UNet architecture as the baseline architecture for experimentation. The network architecture consists of an encoding network on the left and a
![]()
Figure 1. Overall system architecture: Multi-class segmentation.
decoder network on the right side. The input image is convoluted using a 3 × 3 convolution layer followed by ReLU and 2 × 2 max-pooling with stride 2 for downsampling. During upsampling, 2 × 2 up-convolution is used followed by 3 × 3 convolution and ReLU. During concatenation, zero padding is used to match the size of the feature. To match the number of features, a 1 × 1 convolution layer is used.
3.2. Loss Function and Evaluation Metrics
We used dice loss [43] instead of Intersection over Union (IOU) [44] as the loss function, as it results in higher consistency between the predicted segmentation mask and the labels, without favoring common regions. The following is the formula for dice loss [43] ,
(2)
We employed cross-entropy loss [31] in addition to dice loss to detect differences between the predicted and original masks. The binary cross-entropy loss was utilized for binary segmentation, while the cross-entropy loss was used for multi-class segmentation. To assess the predicted results, we used the dice coefficient [43] to measure the similarities among the building’s datasets. Additionally, the building mask was assessed using precision and recall [37] [45] , which were computed based on the number of building counts in each image slice.
(3)
(4)
(5)
(6)
4. Implementation Details
4.1. Training Pipeline
We used an image size of 224 × 224 which we resized from the original image of 1024 × 1024. A batch size of 8 was used for the 20 epochs with an initial learning rate of 0.001 which decreases by a factor of 10 at each 30% of epochs with SGD optimization. We used the threshold pixel of 100 pixel area in the image for the building size.
Training Configurations, all the models use efficient-net-b0 as an encoder which is trained on Imagenet dataset [42] . We implemented the model using Python and PyTorch [46] for deep learning model architecture using two GPUs (NVIDIA RTX 3090), CPU (i9 10th gen), and RAM 32GB.
We also implemented the model parallelism using the PyTorch data parallelism [46] pipeline which enabled us to experiment with larger models and large batch sizes. Although it did not improve the training time, this allowed us to train larger models. During the training phase, the datasets were split into multiple GPUs and merged on each epoch.
4.3. Datasets
The experimentation of segmenting the building was carried out in the UAV imagery of Kathmandu, Nepal. The area covered by the dataset is sparsely populated with medium size houses. Along with it, there are tunnels that are used for farming in the regions. We also used the regional datasets from the Nima regions of Accra, Ghana [20] which has densely populated buildings. This region is represented by tier_1_source_acc_d41d81 in the open cities tier1 sample datasets. The building structures were arduous for humans to identify building instances due to limited space between building roof structures. Furthermore, to validate our methodology, we also utilized the Inria Aerial Image Labeling Dataset [21] .
While datasets from Kathmandu regions have been sliced in the 850 × 850 px RGB images, the open cities datasets are sliced on 1024 × 1024 px RGB images, and inria aerial images are randomly cropped at 224 × 224 px since they were less magnified. This variation in image sizes and cropping methods ensured that the datasets were appropriately prepared for the building segmentation task in different regions. The corresponding mask is created from each of these slices and the segmentation error produced while creating a raster image has been reduced to a minimum. Thus created samples are separated on the training and testing samples with 80% and 20% ratio and the validation size is taken as 20% of the training sample with all the images resized during training.
4.4. Data Augmentation
We used the Albumentation [47] library for dataset augmentation, which involved rotation, brightness and contrast manipulation, and RGB shift operations. Augmentation increases the diversity of datasets thus improving the generalization capability of the model [48] , leading to a better performance in building detection. The image was subjected to rotation with a 50% probability, while variations in brightness and contrast ranged from 0 to 20%. Additionally, horizontal and vertical flips, as well as color shifts within the range of 0 to 20%, were performed with a 50% probability.
4.5. Post Processing
Once the image are predicted using the above pipeline, all the images are post-processed to remove any noisy building regions. We use binary thresholding and created the contour from the given segmentation mask using the marching square algorithm [49] using skimage [50] library. Then created contour is processed using Ramers Douglas Peuckar algorithm [51] which minimizes the number of points for the contour. The predicted contour is considered building if its pixelated area exceeds 100 pixels. This threshold is calculated from the smallest building contour area in the given dataset. Once the building cases are identified, we calculate the precision and recall only if they cover an area greater than the minimum building threshold and have an IOU score greater than 0.5. In all the following cases, building recall and building precision are calculated based on the number of buildings following the previous assumptions.
5. Experiments & Results
We trained UNet, UNet++, DeepLabV3, and DeepLabV3+ on datasets from regions of Kathmandu, Opencities datasets [32] , and Inria Aerial Image Labeling Dataset [46] individually with image resized to 224 using efficientnet-b0 as an encoder pre-trained on ImageNet for 20 epochs. All the models had plateaued building precision and building recall scores at 20 epochs. Data was grouped into binary segmentation and multiclass classification which consists of background, building contour, and inner building segments.
The segmentation results on the regions of Kathmandu, the Open Cities datasets and Inria building datasets are displayed in Figures 2-4 using four columns. There has been an increase in the number of buildings, and it is apparent that there is a distinct separation between building boundaries in the multi-class scenario compared to the single-class scenario.
![]()
Figure 2. Using UNet, Ground Image, Mask, Predicted mask in a single channel, Predicted mask with three channel on Kathmandu regions.
Similarly, the evaluation metrics for the regions of Kathmandu, the Opencities datasets, and the Inria Aerial Image Labeling Dataset on different architectures are shown in Tables 1-3. The dice loss is computed by comparing the actual building pixels with the predicted building pixels. Building precision and Building Recall are determined by comparing the building count in the actual image with that in the predicted image. IOU and Accuracy are calculated based on pixel counts for comparison with existing results.
In terms of building recall for multi-class segmentation, Unet++ demonstrated the best performance, achieving a recall rate of 72% on the Opencities dataset
![]()
Figure 3. Using UNet: Ground Image (first column), Ground Truth Mask (second column), Predicted mask in a single channel (third column), Predicted mask with three channel on Open Cities Dataset (last column).
and 67% on the Inria dataset. On the other hand, UNet achieved a higher recall rate of 76% on the region of Kathmandu dataset. Notably, when transitioning from a single class to a multiclass setup, Deeplabv3+ exhibited the most significant improvement in building recall, with an impressive increase of 20%. These findings highlight the effectiveness of different architectures in accurately identifying and segmenting buildings in various datasets.
All model architectures achieved an average accuracy of 95% on both single and multiclass datasets, which is comparable to the existing leaderboards on the Inria Building dataset. Similarly, the IOU values obtained were also comparable.
![]()
Figure 4. Using UNet: Ground Image (first column), Ground Truth Mask (second column), Predicted mask in a single channel (third column), Predicted mask with three channels on Inria Building Dataset (last column)
These results indicate that our models performed at a similar level of accuracy as the established benchmarks in the field.
We conducted a comparison of F1 scores among various model architectures on the Inria Building Datasets in Table 5. Specifically, we evaluated the performance using an IOU (Intersection over Union) threshold of 0.5, which is the same threshold that [10] uses. The comparison revealed that the utilization of multi-class datasets resulted in improved F1 scores compared to single-class datasets. DeepLabV3+ model exhibited the highest improvement of 12% in F1 score, while the Unet model showed a comparatively lower improvement of 3% in F1 score.
![]()
Table 1. Result on Kathmandu region.
![]()
Table 2. Result on Opencities dataset.
![]()
Table 3. Result on Inria building dataset.
![]()
Table 4. Building count metrics using UNet architecture on different datasets.
![]()
Table 5. Building instance F1 score on Inria building dataset.
6. Discussion
We conducted inference on a randomly selected subset of the test dataset to calculate the metrics presented in Building Counts using Unet. The table reveals a significant reduction in false negative cases when utilizing a multi-class dataset as opposed to a single-class dataset, resulting in improved building recall results. Conversely, false positive cases have increased, which can be attributed to the absence of a comprehensive ground truth dataset or nearby similar structures for reference. In order to validate the functionality of our model, we trained it on the Inria building dataset and achieved comparable IOU and accuracy metrics compared to the Inria leaderboard [52] . This further confirms that our methodology can be employed without compromising IOU and accuracy results on the leaderboard, while simultaneously enhancing overall building recall and precision. The results indicate that the UNet model performed well in building segmentation tasks across different datasets, and the use of multi-class datasets consistently improved the building count. However, there was a trade-off observed between false negatives and false positives, with some datasets experiencing an increase in false positives when using the multi-class dataset. Nonetheless, the improved building count with the multi-class dataset indicates its effectiveness in capturing the complexity and diversity of buildings in different urban environments. Overall, the results highlight the importance of dataset selection and the potential benefits of utilizing multi-class datasets in building segmentation tasks with UNet.
The comparison of F1 scores across different model architectures, Table 5 highlights the improvement achieved when utilizing multi-class datasets. To specifically validate the enhancement between single and multi-class datasets, we conducted the training for all models on 20 epochs, as opposed to the MAP-Net and Joint Learning approaches. This approach allowed us to assess the impact of dataset composition on the F1 score and validate the benefits of utilizing multi-class datasets in the context of building segmentation. So that opens up a possibility of experimentation on MAP-Net and Joint learning using multi-class datasets.
When evaluating our model on the Inria building dataset, we achieved comparable performance in terms of metrics like IOU and accuracy to the leaderboard [52] . This indicates that our methodology can effectively generalize to the dataset used for benchmarking. Additionally, in the experimental results with other datasets such as the Kathmandu Dataset and Opencities dataset, we observed similar trends and performance patterns, further highlighting the generalization potential of our approach across different datasets. These findings demonstrate the robustness and adaptability of our methodology, enabling accurate building segmentation across multiple datasets while maintaining comparable performance to the established leaderboard.
However, due to lack of similar studies done using multi-class dataset which not just utilize the background and building instances, but also make use of building contours as a class to improve the efficacy of the model, we could not directly compare with the existing studies on the same basis.
7. Challenges and Limitations
Dataset bias in our study arises from two main factors: the similarity of regions surrounding the buildings and the limited availability of comprehensive ground truth data. The presence of similar structures or elements in the surrounding regions poses a challenge for our model, leading to an increase in false positive cases. This occurs when neighboring structures exhibit visual characteristics or patterns resembling buildings. Additionally, the lack of accurate annotations for certain buildings or structures in the vicinity contributes to dataset bias, affecting the precision of our model’s predictions. Addressing these biases requires careful consideration and further research to overcome the challenges posed by regional similarities and the need for improved ground truth data in building segmentation tasks. One another particular limitation of this study is that we could not experiment with multi-class dataset on architecture like MAP-Net and Joint-learning presented in [9] , in which case, our study would have beend directly comparable to [9] .
8. Conclusion
In this paper, we experimented the building detection and counting problem with multiple architectures (UNet, UNet++, DeepLabV3, and DeepLabV3+) on single and multi-class datasets and calculated the evaluation metrics. Using the multi-class method instead of ensembling the network reduces the computational complexities and inference time by making the model learn both features, building contour and inner building segments, on a single model. Building recall has increased substantially on the convoluted regions which helps us to decrease the human effort during the labeling process. And not only that, our work can be used to detect, segment and count the building instances in the convoluted areas more effectively than the other existing methods to our knowledge.
Acknowledgements
The authors would like to thank IKebana Solutions LLC for letting them use time and resources required for this research project and the constant support throughtout the project. The authors would like to thank our colleagues at IKebana for their fruitful discussion, support and feedbacks. The authors would also like to thank the reviewers for their time and valuable comments on the work.